Select random lines from a file

bash shell random text-processing

In a Bash script, I want to pick out N random lines from input file and output to another file.

How can this be done?

Sort the file randomly and pick N first lines.

Also see stackoverflow.com/questions/12354659/….

this is not a duplicate -- he wants N lines vs 1 line.

related: Randomly Pick Lines From a File Without Slurping It With Unix

I disagree with sort -R as it does a lot of excess work, particularly for long files. You can use $RANDOM, % wc -l, jot, sed -n (à la stackoverflow.com/a/6022431/563329), and bash functionality (arrays, command redirects, etc) to define your own peek function which will actually run on 5,000,000-line files.

DomainsFeatured

Use shuf with the -n option as shown below, to get N random lines:

shuf -n N input > output

If you just need a random set of lines, not in a random order, then shuf is very inefficient (for big file): better is to do reservoir sampling, as in this answer.

I ran this on a 500M row file to extract 1,000 rows and it took 13 min. The file had not been accessed in months, and is on an Amazon EC2 SSD Drive.

so is this in essence more random that sort -R ?

@MonaJalal nope just faster, since it doesn't have to compare lines at all.

Does it eventually yield the same line more than once?

Bruno Bronosky

Sort the file randomly and pick first 100 lines:

lines=100
input_file=/usr/share/dict/words

# This is the basic selection method
<$input_file sort -R | head -n $lines

# If the file has duplicates that must never cause duplicate results
<$input_file sort | uniq        | sort -R | head -n $lines

# If the file has blank lines that must be filtered, use sed
<$input_file sed $'/^[ \t]*$/d' | sort -R | head -n $lines

Of course <$input_file can be replaced with any piped standard input. This (sort -R and $'...\t...' to get sed to match tab chars) works with GNU/Linux and BSD/macOS.

sort actually sorts identical lines together, so if you may have duplicate lines and you have shuf (a gnu tool) installed, it's better to use it for this.

Andalso, this is definitely going to make you wait a lot if you have a considerably huge file -- 80kk lines --, whereas, shuf -n acts quite instantaneously.

sort -R is not available under Mac OS X (10.9)

@tfb785: sort -R is probably GNU option, install GNU coreutils. btw, shuf is also part of coreutils.

@J.F.Sebastian The code: sort -R input | head -n <num_lines>. The input file was 279GB, with 2bi+ lines. Can't share it, though. Anyway, the point is you can keep some lines in memory with shuffle to do the random selection of what to output. Sort is going to sort the entire file, regardless of what your needs are.

Stein van Broekhoven

Well According to a comment on the shuf answer he shuffed 78 000 000 000 lines in under a minute.

Challenge accepted...

EDIT: I beat my own record

powershuf did it in 0.047 seconds

$ time ./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null 
./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null  0.02s user 0.01s system 80% cpu 0.047 total

The reason it is so fast, well I don't read the whole file and just move the file pointer 10 times and print the line after the pointer.

Gitlab Repo

Old attempt

First I needed a file of 78.000.000.000 lines:

seq 1 78 | xargs -n 1 -P 16 -I% seq 1 1000 | xargs -n 1 -P 16 -I% echo "" > lines_78000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000.txt > lines_78000000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000000.txt > lines_78000000000.txt

This gives me a a file with 78 Billion newlines ;-)

Now for the shuf part:

$ time shuf -n 10 lines_78000000000.txt










shuf -n 10 lines_78000000000.txt  2171.20s user 22.17s system 99% cpu 36:35.80 total

The bottleneck was CPU and not using multiple threads, it pinned 1 core at 100% the other 15 were not used.

Python is what I regularly use so that's what I'll use to make this faster:

#!/bin/python3
import random
f = open("lines_78000000000.txt", "rt")
count = 0
while 1:
  buffer = f.read(65536)
  if not buffer: break
  count += buffer.count('\n')

for i in range(10):
  f.readline(random.randint(1, count))

This got me just under a minute:

$ time ./shuf.py         










./shuf.py  42.57s user 16.19s system 98% cpu 59.752 total

I did this on a Lenovo X1 extreme 2nd gen with the i9 and Samsung NVMe which gives me plenty read and write speed.

I know it can get faster but I'll leave some room to give others a try.

Line counter source: Luther Blissett

Well, according to your description of powershuf's inner functionning, it looks like it is just randomish. Using a file with just two lines, one being 1 character long, the other being 20 characters long, I expect both lines to be choosen with equal chances. This doesn't seem to be the case with your program.

There was an issue with files shorter than 4KB and some other math mistakes that made it horrible with small files. I fixed them for as far as I could find the issues, please give it another try.

Hi Stein. It doesn't seem to work. Did you test it the way I suggested in my above comment? Before making something quicker than shuf, I reckon you should focus on making something that works as accurately as shuf. I really doubt anyone can beat shuf with a python program. BTW, unless you use the -r option, shuf doesn't output the same line twice, and of course this takes additional processing time.

Why does powershuf discard the first line? Can it ever pick the very first line? It seems to also funnel the search in a weird way: if you have 10 lines too long, then 1 line of valid length, then 5 lines and another line of valid length, then the iteration will find the 10 lines more often than the 5, and funnel about two thirds of the time into the first valid line. The program doesn't promise this, but it would make sense to me if the lines were effectively filtered by length and then random lines were chosen from that set.

The question is how to get random lines from a text file in a bash script, not how to write a Python script.

Merlin

My preferred option is very fast, I sampled a tab-delimited data file with 13 columns, 23.1M rows, 2.0GB uncompressed.

# randomly sample select 5% of lines in file
# including header row, exclude blank lines, new seed

time \
awk 'BEGIN  {srand()} 
     !/^$/  { if (rand() <= .05 || FNR==1) print > "data-sample.txt"}' data.txt

# awk  tsv004  3.76s user 1.46s system 91% cpu 5.716 total

This is brilliant--and super fast.

Andelf

seq 1 100 | python3 -c 'print(__import__("random").choice(__import__("sys").stdin.readlines()))'

andrec

# Function to sample N lines randomly from a file
# Parameter $1: Name of the original file
# Parameter $2: N lines to be sampled 
rand_line_sampler() {
    N_t=$(awk '{print $1}' $1 | wc -l) # Number of total lines

    N_t_m_d=$(( $N_t - $2 - 1 )) # Number oftotal lines minus desired number of lines

    N_d_m_1=$(( $2 - 1)) # Number of desired lines minus 1

    # vector to have the 0 (fail) with size of N_t_m_d 
    echo '0' > vector_0.temp
    for i in $(seq 1 1 $N_t_m_d); do
            echo "0" >> vector_0.temp
    done

    # vector to have the 1 (success) with size of desired number of lines
    echo '1' > vector_1.temp
    for i in $(seq 1 1 $N_d_m_1); do
            echo "1" >> vector_1.temp
    done

    cat vector_1.temp vector_0.temp | shuf > rand_vector.temp

    paste -d" " rand_vector.temp $1 |
    awk '$1 != 0 {$1=""; print}' |
    sed 's/^ *//' > sampled_file.txt # file with the sampled lines

    rm vector_0.temp vector_1.temp rand_vector.temp
}

rand_line_sampler "parameter_1" "parameter_2"

@dannyman: this answer is bash.

user19322235

In the below 'c' is the number of lines to select from the input. Modify as needed:

#!/bin/sh

gawk '
BEGIN   { srand(); c = 5 }
c/NR >= rand() { lines[x++ % c] = $0 }
END { for (i in lines)  print lines[i] }

' "$@"

This does not guarantee that eactly c lines are selected. At best you can say that the average number of lines being selected is c.

That is incorrect: c/NR will be >= 1 (larger than any possible value of rand() ) for the first c lines, thus filling lines[]. x++ % c forces lines[] to c entries, assuming there are at least c lines in the input

Right, c/NR will be guaranteed to be larger than any value produced from rand for the first c lines. After that, it may or may not be larger than rand. Therefore we can say that lines in the end contains at least c entries, and in general more than that, i.e. not exactly c entries. Furthermore, the first c lines of the file are always picked, so the whole selection is not what could be called a random pick.

uh, x++ % c constrains lines[] to indices 0 to c-1. Of course, the first c inputs initially fill lines[], which are replaced in round robin fashion when the random condition is met. A small change (left as an exercise for the reader) could be made to randomly replace entries in lines[], rather than in a round-robin.

Select random lines from a file

Follow WeChat

Want to stay one step ahead of the latest teleworks?

相似问题

Platform

Support

Links

Contact US