Random sampling from a file
I recently learned about the Linux command line utility shuf from browsing The Art of Command Line. This could be useful for random sampling.
Given just a file name, shuf randomly permutes the lines of the file.
With the option -n you can specify how many lines to return. So it's doing sampling without replacement. For example,
shuf -n 10 foo.txt
would select 10 lines from foo.txt.
Actually, it would select at most 10 lines. You can't select 10 lines without replacement from a file with less than 10 lines. If you ask for an impossible number of lines, the -n option is ignored.
You can also sample with replacement using the -r option. In that case you can select more lines than are in the file since lines may be reused. For example, you could run
shuf -r -n 10 foo.txt
to select 10 lines drawn with replacement from foo.txt, regardless of how many lines foo.txt has. For example, when I ran the command above on a file containing
alpha beta gamma
I got the output
beta gamma gamma beta alpha alpha gamma gamma beta
I don't know how shuf seeds its random generator. Maybe from the system time. But if you run it twice you will get different results. Probably.
Related