speed up grepping a file for a long list of needles
by masavini from LinuxQuestions.org on (#51AA0)
hi,
i have a long list of needles. i need to know which ones are present inside a file.
i.e.:
Code:$ wc -l needles.txt
3589 needles.txt
$ head -3 needles.txt
this_string_is_present
this_is_not
and_so_on
$ wc -l hay.txt
756 hay.txt
$ head -3 hay.txt
this file contains a lot of strings: this_string_is_present
some needles are present
and some are nota simple (and SLOW) solution could be:
Code:hay=$(< hay.txt) # store hay.txt in a variable to avoid reading the disk thousands of times
while read needle; do
grep -q "${needle}" <<< "${hay}" \
&& needles+=( "${needle} - verified" ) \
|| needles+=( "${needle}" )
done < needles.txtanother (still pretty slow) solution could be using grep -Fo and comm:
Code:sort needles.txt > sorted-needles.txt
grep -Fo needles.txt hay.txt | sort > verified-needles.txt
comm -13 sorted-needles.txt verified-needles.txt > unverified-needles.txt
can you suggest a better solution?
thanks!


i have a long list of needles. i need to know which ones are present inside a file.
i.e.:
Code:$ wc -l needles.txt
3589 needles.txt
$ head -3 needles.txt
this_string_is_present
this_is_not
and_so_on
$ wc -l hay.txt
756 hay.txt
$ head -3 hay.txt
this file contains a lot of strings: this_string_is_present
some needles are present
and some are nota simple (and SLOW) solution could be:
Code:hay=$(< hay.txt) # store hay.txt in a variable to avoid reading the disk thousands of times
while read needle; do
grep -q "${needle}" <<< "${hay}" \
&& needles+=( "${needle} - verified" ) \
|| needles+=( "${needle}" )
done < needles.txtanother (still pretty slow) solution could be using grep -Fo and comm:
Code:sort needles.txt > sorted-needles.txt
grep -Fo needles.txt hay.txt | sort > verified-needles.txt
comm -13 sorted-needles.txt verified-needles.txt > unverified-needles.txt
can you suggest a better solution?
thanks!