In a Bash script, I want to pick out N random lines from input file and output to another file.
How can this be done?
sort -R
as it does a lot of excess work, particularly for long files. You can use $RANDOM
, % wc -l
, jot
, sed -n
(à la stackoverflow.com/a/6022431/563329), and bash functionality (arrays, command redirects, etc) to define your own peek
function which will actually run on 5,000,000-line files.
Sort the file randomly and pick first 100
lines:
lines=100
input_file=/usr/share/dict/words
# This is the basic selection method
<$input_file sort -R | head -n $lines
# If the file has duplicates that must never cause duplicate results
<$input_file sort | uniq | sort -R | head -n $lines
# If the file has blank lines that must be filtered, use sed
<$input_file sed $'/^[ \t]*$/d' | sort -R | head -n $lines
Of course <$input_file
can be replaced with any piped standard input. This (sort -R
and $'...\t...'
to get sed
to match tab chars) works with GNU/Linux and BSD/macOS.
sort
actually sorts identical lines together, so if you may have duplicate lines and you have shuf
(a gnu tool) installed, it's better to use it for this.
shuf -n
acts quite instantaneously.
sort -R
is probably GNU option, install GNU coreutils. btw, shuf
is also part of coreutils.
sort -R input | head -n <num_lines>
. The input file was 279GB, with 2bi+ lines. Can't share it, though. Anyway, the point is you can keep some lines in memory with shuffle to do the random selection of what to output. Sort is going to sort the entire file, regardless of what your needs are.
Well According to a comment on the shuf answer he shuffed 78 000 000 000 lines in under a minute.
Challenge accepted...
EDIT: I beat my own record
powershuf did it in 0.047 seconds
$ time ./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null
./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null 0.02s user 0.01s system 80% cpu 0.047 total
The reason it is so fast, well I don't read the whole file and just move the file pointer 10 times and print the line after the pointer.
Old attempt
First I needed a file of 78.000.000.000 lines:
seq 1 78 | xargs -n 1 -P 16 -I% seq 1 1000 | xargs -n 1 -P 16 -I% echo "" > lines_78000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000.txt > lines_78000000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000000.txt > lines_78000000000.txt
This gives me a a file with 78 Billion newlines ;-)
Now for the shuf part:
$ time shuf -n 10 lines_78000000000.txt
shuf -n 10 lines_78000000000.txt 2171.20s user 22.17s system 99% cpu 36:35.80 total
The bottleneck was CPU and not using multiple threads, it pinned 1 core at 100% the other 15 were not used.
Python is what I regularly use so that's what I'll use to make this faster:
#!/bin/python3
import random
f = open("lines_78000000000.txt", "rt")
count = 0
while 1:
buffer = f.read(65536)
if not buffer: break
count += buffer.count('\n')
for i in range(10):
f.readline(random.randint(1, count))
This got me just under a minute:
$ time ./shuf.py
./shuf.py 42.57s user 16.19s system 98% cpu 59.752 total
I did this on a Lenovo X1 extreme 2nd gen with the i9 and Samsung NVMe which gives me plenty read and write speed.
I know it can get faster but I'll leave some room to give others a try.
Line counter source: Luther Blissett
-r
option, shuf doesn't output the same line twice, and of course this takes additional processing time.
My preferred option is very fast, I sampled a tab-delimited data file with 13 columns, 23.1M rows, 2.0GB uncompressed.
# randomly sample select 5% of lines in file
# including header row, exclude blank lines, new seed
time \
awk 'BEGIN {srand()}
!/^$/ { if (rand() <= .05 || FNR==1) print > "data-sample.txt"}' data.txt
# awk tsv004 3.76s user 1.46s system 91% cpu 5.716 total
seq 1 100 | python3 -c 'print(__import__("random").choice(__import__("sys").stdin.readlines()))'
# Function to sample N lines randomly from a file
# Parameter $1: Name of the original file
# Parameter $2: N lines to be sampled
rand_line_sampler() {
N_t=$(awk '{print $1}' $1 | wc -l) # Number of total lines
N_t_m_d=$(( $N_t - $2 - 1 )) # Number oftotal lines minus desired number of lines
N_d_m_1=$(( $2 - 1)) # Number of desired lines minus 1
# vector to have the 0 (fail) with size of N_t_m_d
echo '0' > vector_0.temp
for i in $(seq 1 1 $N_t_m_d); do
echo "0" >> vector_0.temp
done
# vector to have the 1 (success) with size of desired number of lines
echo '1' > vector_1.temp
for i in $(seq 1 1 $N_d_m_1); do
echo "1" >> vector_1.temp
done
cat vector_1.temp vector_0.temp | shuf > rand_vector.temp
paste -d" " rand_vector.temp $1 |
awk '$1 != 0 {$1=""; print}' |
sed 's/^ *//' > sampled_file.txt # file with the sampled lines
rm vector_0.temp vector_1.temp rand_vector.temp
}
rand_line_sampler "parameter_1" "parameter_2"
In the below 'c' is the number of lines to select from the input. Modify as needed:
#!/bin/sh
gawk '
BEGIN { srand(); c = 5 }
c/NR >= rand() { lines[x++ % c] = $0 }
END { for (i in lines) print lines[i] }
' "$@"
c
lines are selected. At best you can say that the average number of lines being selected is c
.
c/NR
will be guaranteed to be larger than any value produced from rand
for the first c lines. After that, it may or may not be larger than rand
. Therefore we can say that lines
in the end contains at least c entries, and in general more than that, i.e. not exactly c entries. Furthermore, the first c lines of the file are always picked, so the whole selection is not what could be called a random pick.
Success story sharing
sort -R
?