从文件中选择随机行

bash shell random text-processing

在 Bash 脚本中，我想从输入文件中挑选出 N 个随机行并输出到另一个文件。

如何才能做到这一点？

随机排序文件并选择 N 个前行。

另见stackoverflow.com/questions/12354659/…。

这不是重复的——他想要 N 行 vs 1 行。

相关：Randomly Pick Lines From a File Without Slurping It With Unix

我不同意 sort -R，因为它做了很多多余的工作，特别是对于长文件。您可以使用 $RANDOM、% wc -l、jot、sed -n (à la stackoverflow.com/a/6022431/563329) 和 bash 功能（数组、命令重定向等）来定义您自己的 peek 函数，该函数将在5,000,000 行文件。

DomainsFeatured

将 shuf 与 -n 选项一起使用，如下所示，以获得 N 个随机行：

shuf -n N input > output

如果您只需要一组随机的行，而不是随机顺序，那么 shuf 效率非常低（对于大文件）：更好的是进行水库采样，如 this answer。

我在一个 500M 行文件上运行它以提取 1,000 行，这需要 13 分钟。该文件已数月未访问，并且位于 Amazon EC2 SSD 驱动器上。

那么这本质上比 sort -R 更随机吗？

@MonaJalal 不只是更快，因为它根本不需要比较行。

它最终会多次产生同一行吗？

Bruno Bronosky

随机排序文件并选择前 100 行：

lines=100
input_file=/usr/share/dict/words

# This is the basic selection method
<$input_file sort -R | head -n $lines

# If the file has duplicates that must never cause duplicate results
<$input_file sort | uniq        | sort -R | head -n $lines

# If the file has blank lines that must be filtered, use sed
<$input_file sed $'/^[ \t]*$/d' | sort -R | head -n $lines

当然 <$input_file 可以替换为任何管道标准输入。这（sort -R 和 $'...\t...' 让 sed 匹配制表符）适用于 GNU/Linux 和 BSD/macOS。

sort 实际上将相同的行排序在一起，因此如果您可能有重复的行并且您安装了 shuf（一个 gnu 工具），那么最好使用它。

此外，如果您有一个相当大的文件（80kk 行），这肯定会让您等待很多，而 shuf -n 的作用非常迅速。

sort -R 在 Mac OS X (10.9) 下不可用

@tfb785：sort -R 可能是 GNU 选项，安装 GNU coreutils。顺便说一句，shuf 也是 coreutils 的一部分。

@JFSebastian 代码：sort -R input | head -n <num_lines>。输入文件为 279GB，有 2bi+ 行。不过不能分享。无论如何，关键是您可以通过随机播放将 some 行保留在内存中，以随机选择要输出的内容。无论您需要什么，Sort 都会对整个文件进行排序。

Stein van Broekhoven

好吧，根据对 shuf 答案的评论，他在一分钟内改组了 78 000 000 000 行。

已接受的挑战...

编辑：我打破了自己的记录

powershuf 在 0.047 秒内完成

$ time ./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null 
./powershuf.py -n 10 --file lines_78000000000.txt > /dev/null  0.02s user 0.01s system 80% cpu 0.047 total

之所以这么快，是因为我没有读取整个文件，只是将文件指针移动 10 次，然后打印指针后面的行。

Gitlab Repo

老尝试

首先，我需要一个 78.000.000.000 行的文件：

seq 1 78 | xargs -n 1 -P 16 -I% seq 1 1000 | xargs -n 1 -P 16 -I% echo "" > lines_78000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000.txt > lines_78000000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000000.txt > lines_78000000000.txt

这给了我一个包含 780 亿换行符的文件 ;-)

现在对于 shuf 部分：

$ time shuf -n 10 lines_78000000000.txt










shuf -n 10 lines_78000000000.txt  2171.20s user 22.17s system 99% cpu 36:35.80 total

瓶颈是 CPU 并且没有使用多个线程，它将 1 个核心固定为 100%，其他 15 个未使用。

Python 是我经常使用的，所以我将使用它来加快速度：

#!/bin/python3
import random
f = open("lines_78000000000.txt", "rt")
count = 0
while 1:
  buffer = f.read(65536)
  if not buffer: break
  count += buffer.count('\n')

for i in range(10):
  f.readline(random.randint(1, count))

这让我不到一分钟：

$ time ./shuf.py         










./shuf.py  42.57s user 16.19s system 98% cpu 59.752 total

我在配备 i9 和三星 NVMe 的 Lenovo X1 Extreme 2nd gen 上执行此操作，这为我提供了足够的读写速度。

我知道它可以变得更快，但我会留出一些空间让其他人尝试一下。

行计数器 source: Luther Blissett

好吧，根据你对powershuf内部功能的描述，看起来它只是随机的。使用只有两行的文件，一个是 1 个字符长，另一个是 20 个字符长，我希望两行的选择机会均等。您的程序似乎并非如此。

小于 4KB 的文件存在问题，以及其他一些数学错误导致小文件变得可怕。我尽可能地修复了它们，请再试一次。

嗨斯坦。它似乎不起作用。您是否按照我在上述评论中建议的方式对其进行了测试？在制作比 shuf 更快的东西之前，我认为你应该专注于制作与 shuf 一样准确的东西。我真的怀疑任何人都可以用 python 程序击败 shuf。顺便说一句，除非您使用 -r 选项，否则 shuf 不会两次输出同一行，当然这需要额外的处理时间。

为什么powershuf会丢弃第一行？它可以选择第一行吗？它似乎也以一种奇怪的方式集中搜索：如果你有 10 行太长，然后是 1 行有效长度，然后是 5 行和另一行有效长度，那么迭代将更频繁地找到 10 行而不是 5 ，并将大约三分之二的时间集中到第一个有效行中。该程序不承诺这一点，但如果这些行按长度有效过滤，然后从该集合中选择随机行，这对我来说是有意义的。

问题是如何从 bash 脚本中的文本文件中获取随机行，而不是如何编写 Python 脚本。

Merlin

我的首选选项非常快，我采样了一个制表符分隔的数据文件，它有 13 列、23.1M 行、2.0GB 未压缩。

# randomly sample select 5% of lines in file
# including header row, exclude blank lines, new seed

time \
awk 'BEGIN  {srand()} 
     !/^$/  { if (rand() <= .05 || FNR==1) print > "data-sample.txt"}' data.txt

# awk  tsv004  3.76s user 1.46s system 91% cpu 5.716 total

这太棒了——而且超级快。

Andelf

seq 1 100 | python3 -c 'print(__import__("random").choice(__import__("sys").stdin.readlines()))'

andrec

# Function to sample N lines randomly from a file
# Parameter $1: Name of the original file
# Parameter $2: N lines to be sampled 
rand_line_sampler() {
    N_t=$(awk '{print $1}' $1 | wc -l) # Number of total lines

    N_t_m_d=$(( $N_t - $2 - 1 )) # Number oftotal lines minus desired number of lines

    N_d_m_1=$(( $2 - 1)) # Number of desired lines minus 1

    # vector to have the 0 (fail) with size of N_t_m_d 
    echo '0' > vector_0.temp
    for i in $(seq 1 1 $N_t_m_d); do
            echo "0" >> vector_0.temp
    done

    # vector to have the 1 (success) with size of desired number of lines
    echo '1' > vector_1.temp
    for i in $(seq 1 1 $N_d_m_1); do
            echo "1" >> vector_1.temp
    done

    cat vector_1.temp vector_0.temp | shuf > rand_vector.temp

    paste -d" " rand_vector.temp $1 |
    awk '$1 != 0 {$1=""; print}' |
    sed 's/^ *//' > sampled_file.txt # file with the sampled lines

    rm vector_0.temp vector_1.temp rand_vector.temp
}

rand_line_sampler "parameter_1" "parameter_2"

@dannyman：这个答案是 bash。

user19322235

下面的“c”是要从输入中选择的行数。根据需要修改：

#!/bin/sh

gawk '
BEGIN   { srand(); c = 5 }
c/NR >= rand() { lines[x++ % c] = $0 }
END { for (i in lines)  print lines[i] }

' "$@"

这并不能保证选择了c 行。充其量你可以说被选中的平均行数是 c。

这是不正确的：对于前 c 行，c/NR 将 >= 1（大于 rand() 的任何可能值），从而填充行 []。 x++ % c 强制 lines[] 到 c 个条目，假设输入中至少有 c 行

没错，c/NR 将保证大于从 rand 为 前 c 行生成的任何值。之后，它可能大于也可能不大于 rand。因此我们可以说lines最后包含至少 c 个条目，并且通常比这更多，即不正好是c 个条目。此外，文件的前 c 行总是被选取，因此整个选取并不是所谓的随机选取。

呃，x++ % c 将 lines[] 限制为索引 0 到 c-1。当然，第一个 c 输入最初填充 lines[]，当满足随机条件时，它们会以循环方式替换。可以进行一个小的更改（留给读者作为练习）以随机替换 lines[] 中的条目，而不是循环替换。

从文件中选择随机行

关注公众号

想领先一步获取最新的外包任务吗？

相似问题

平台

支持

友情链接

联系我们