ChatGPT解决这个技术问题 Extra ChatGPT

How do I grep for all non-ASCII characters?

I have several very large XML files and I'm trying to find the lines that contain non-ASCII characters. I've tried the following:

grep -e "[\x{00FF}-\x{FFFF}]" file.xml

But this returns every line in the file, regardless of whether the line contains a character in the range specified.

Do I have the syntax wrong or am I doing something else wrong? I've also tried:

egrep "[\x{00FF}-\x{FFFF}]" file.xml 

(with both single and double quotes surrounding the pattern).

ASCII characters are only one byte long, so unless the file is unicode there should be no characters above 0xFF.
How do we go above \xFF? Grep gives a "grep: range out of order in character class" error.
Sometimes it's nice to have a second opinion about chars with the high bit set in a file. In that case, I like tr <file.txt -d '\000-\177' >foo.out && ls -al foo.out to get a count. And/or followed by od -x foo.out to get a look at actual values.
The awk solution and C locale + grep work on BSD.

K
Kuzeko

You can use the command:

grep --color='auto' -P -n "[\x80-\xFF]" file.xml

This will give you the line number, and will highlight non-ascii chars in red.

In some systems, depending on your settings, the above will not work, so you can grep by the inverse

grep --color='auto' -P -n "[^\x00-\x7F]" file.xml

Note also, that the important bit is the -P flag which equates to --perl-regexp: so it will interpret your pattern as a Perl regular expression. It also says that

this is highly experimental and grep -P may warn of unimplemented features.


This won't work in BSD grep (on OS X 10.8 Mountain Lion), as it does not support the P option.
To update my last comment, the GNU version of grep is available in Homebrew's dupes library (enable using brew tap homebrew/dupes): brew install grep
@BastiaanVanDeWeerd is correct, grep on OSX 10.8 no longer supports PCRE ("Perl-compatible regular expressions") as Darwin now uses BSD grep instead of GNU grep. An alternative to installing the dupes library is to install pcre instead: brew install pcre... as part of this, you will get the pcregrep utility, which you can use as follows: pcregrep --color='auto' -n "[\x80-\xFF]" file.xml
For Mac brew users, GNU's coreutils can be installed with brew install coreutils. This will give you lots of GNU tools prefixed with a 'g' - in this case use ggrep. This should avoid problems arising from replacing a system utility, since system-specific Mac scripts now depend on BSD grep.
this works fine on a mac ag "[\x80-\xFF]" file you just need to install the_silver_searcher
p
pvandenberk

Instead of making assumptions about the byte range of non-ASCII characters, as most of the above solutions do, it's slightly better IMO to be explicit about the actual byte range of ASCII characters instead.

So the first solution for instance would become:

grep --color='auto' -P -n '[^\x00-\x7F]' file.xml

(which basically greps for any character outside of the hexadecimal ASCII range: from \x00 up to \x7F)

On Mountain Lion that won't work (due to the lack of PCRE support in BSD grep), but with pcre installed via Homebrew, the following will work just as well:

pcregrep --color='auto' -n '[^\x00-\x7F]' file.xml

Any pros or cons that anyone can think off?


This actually worked for me where the above solutions failed. Finding M$ Word apostrophes hasn't been easier!
If you have a bash-compatible shell but not pcre-grep working, LC_COLLATE=C grep $'[^\1-\177]' works (for files without null bytes)
This solution seems to work more consistently than the ones above.
I had to use this to pickup Kanji, Cyrillic and Traditional Chinese in my UTF8 file, using "[\x80-\xFF]" missed all of these.
The pro is this worked excellently while the other options were great but not as great. No cons found so far.
T
Thelema

The following works for me:

grep -P "[\x80-\xFF]" file.xml

Non-ASCII characters start at 0x80 and go to 0xFF when looking at bytes. Grep (and family) don't do Unicode processing to merge multi-byte characters into a single entity for regex matching as you seem to want. The -P option in my grep allows the use of \xdd escapes in character classes to accomplish what you want.


For the view that might not immediately know how to call this over multiple files, just run: find . -name *.xml | xargs grep -P "[\x80-\xFF]"
This does return a match, but there is no indication of what the character is and where it is. How does one see what the character is, and where it is?
Adding the "-n" will give the line number, additionally non-visible chars will show as a block at the terminal: grep -n -P "[\x80-\xFF]" file.xml
I'm having a problem with Hangul Korean: echo '소녀시대' | grep -P "[\x80-\xFF]" returns nothing for me -- can anyone else confirm? (GNU grep 2.21)
@frabjous Same here, but grepping the inverse works: echo '소녀시대' | grep -P "[^\x00-\x7F]". Or just use the_silver_searcher as pointed out by @slf: echo '소녀시대' | ag "[\x80-\xFF]"
G
Gilles 'SO- stop being evil'

The easy way is to define a non-ASCII character... as a character that is not an ASCII character.

LC_ALL=C grep '[^ -~]' file.xml

Add a tab after the ^ if necessary.

Setting LC_COLLATE=C avoids nasty surprises about the meaning of character ranges in many locales. Setting LC_CTYPE=C is necessary to match single-byte characters — otherwise the command would miss invalid byte sequences in the current encoding. Setting LC_ALL=C avoids locale-dependent effects altogether.


On RedHat 6.4 with tcsh, I had to use <<< env LC_COLLATE=C grep -n '[^ -~]' file.xml >>>. I added -n to get the line number.
For me echo "A" | LC_COLLATE=C grep '[^ -~]' returns a match
@frabjous If you have LC_ALL=en_US.UTF-8, that trumps the LC_COLLATE setting. You shouldn't have this in your environment! LC_ALL is only to force a specific task to use a particular locale, usually C. To set the default locale for all categories, set LANG.
At first, I didn't add LC_ALL=C, it behaves differently on Mac OS X and Ubuntu. After I add this setting, they give the same result.
This works on a Mac, while the other grep-based solutions don't.
n
noquery

In perl

perl -ane '{ if(m/[[:^ascii:]]/) { print  } }' fileName > newFile

On OSX10.11 I had to try several grep+regex solutions before finding this which actually works
Care to share that OSX solution @sg?!
The perl script above is the solution that i'm talking about
perl -lne 'print if /[^[:ascii:]]/' file.xml
r
ryanm

Here is another variant I found that produced completely different results from the grep search for [\x80-\xFF] in the accepted answer. Perhaps it will be useful to someone to find additional non-ascii characters:

grep --color='auto' -P -n "[^[:ascii:]]" myfile.txt

Note: my computer's grep (a Mac) did not have -P option, so I did brew install grep and started the call above with ggrep instead of grep.


This is by far the best answer, as it works for Mac as well as Linux.
Depends on the locale. It didn't work for me until I set LC_ALL=C like LC_ALL=C grep --color='auto' -P -n "[^[:ascii:]]" myfile.txt
C
CarenRose

Searching for non-printable chars. TLDR; Executive Summary

search for control chars AND extended unicode locale setting e.g. LC_ALL=C needed to make grep do what you might expect with extended unicode

SO the preferred non-ascii char finders:

$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test

as in top answer, the inverse grep:

$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test

as in top answer but WITH LC_ALL=C:

$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test

. . more . . excruciating detail on this: . . .

I agree with Harvey above buried in the comments, it is often more useful to search for non-printable characters OR it is easy to think non-ASCII when you really should be thinking non-printable. Harvey suggests "use this: "[^\n -~]". Add \r for DOS text files. That translates to "[^\x0A\x020-\x07E]" and add \x0D for CR"

Also, adding -c (show count of patterns matched) to grep is useful when searching for non-printable chars as the strings matched can mess up terminal.

I found adding range 0-8 and 0x0e-0x1f (to the 0x80-0xff range) is a useful pattern. This excludes the TAB, CR and LF and one or two more uncommon printable chars. So IMHO a quite a useful (albeit crude) grep pattern is THIS one:

grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *

ACTUALLY, generally you will need to do this:

LC_ALL=C grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" *

breakdown:

LC_ALL=C - set locale to C, otherwise many extended chars will not match (even though they look like they are encoded > 0x80)
\x00-\x08 - non-printable control chars 0 - 7 decimal
\x0E-\x1F - more non-printable control chars 14 - 31 decimal
\x80-1xFF - non-printable chars > 128 decimal
-c - print count of matching lines instead of lines
-P - perl style regexps

Instead of -c you may prefer to use -n (and optionally -b) or -l
-n, --line-number
-b, --byte-offset
-l, --files-with-matches

E.g. practical example of use find to grep all files under current directory:

LC_ALL=C find . -type f -exec grep -c -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" {} + 

You may wish to adjust the grep at times. e.g. BS(0x08 - backspace) char used in some printable files or to exclude VT(0x0B - vertical tab). The BEL(0x07) and ESC(0x1B) chars can also be deemed printable in some cases.

Non-Printable ASCII Chars ** marks PRINTABLE but CONTROL chars that is useful to exclude sometimes Dec Hex Ctrl Char description Dec Hex Ctrl Char description 0 00 ^@ NULL 16 10 ^P DATA LINK ESCAPE (DLE) 1 01 ^A START OF HEADING (SOH) 17 11 ^Q DEVICE CONTROL 1 (DC1) 2 02 ^B START OF TEXT (STX) 18 12 ^R DEVICE CONTROL 2 (DC2) 3 03 ^C END OF TEXT (ETX) 19 13 ^S DEVICE CONTROL 3 (DC3) 4 04 ^D END OF TRANSMISSION (EOT) 20 14 ^T DEVICE CONTROL 4 (DC4) 5 05 ^E END OF QUERY (ENQ) 21 15 ^U NEGATIVE ACKNOWLEDGEMENT (NAK) 6 06 ^F ACKNOWLEDGE (ACK) 22 16 ^V SYNCHRONIZE (SYN) 7 07 ^G BEEP (BEL) 23 17 ^W END OF TRANSMISSION BLOCK (ETB) 8 08 ^H BACKSPACE (BS)** 24 18 ^X CANCEL (CAN) 9 09 ^I HORIZONTAL TAB (HT)** 25 19 ^Y END OF MEDIUM (EM) 10 0A ^J LINE FEED (LF)** 26 1A ^Z SUBSTITUTE (SUB) 11 0B ^K VERTICAL TAB (VT)** 27 1B ^[ ESCAPE (ESC) 12 0C ^L FF (FORM FEED)** 28 1C ^\ FILE SEPARATOR (FS) RIGHT ARROW 13 0D ^M CR (CARRIAGE RETURN)** 29 1D ^] GROUP SEPARATOR (GS) LEFT ARROW 14 0E ^N SO (SHIFT OUT) 30 1E ^^ RECORD SEPARATOR (RS) UP ARROW 15 0F ^O SI (SHIFT IN) 31 1F ^_ UNIT SEPARATOR (US) DOWN ARROW

UPDATE: I had to revisit this recently. And, YYMV depending on terminal settings/solar weather forecast BUT . . I noticed that grep was not finding many unicode or extended characters. Even though intuitively they should match the range 0x80 to 0xff, 3 and 4 byte unicode characters were not matched. ??? Can anyone explain this? YES. @frabjous asked and @calandoa explained that LC_ALL=C should be used to set locale for the command to make grep match.

e.g. my locale LC_ALL= empty

$ locale
LANG=en_IE.UTF-8
LC_CTYPE="en_IE.UTF-8"
.
.
LC_ALL=

grep with LC_ALL= empty matches 2 byte encoded chars but not 3 and 4 byte encoded:

$ grep -P -n "[\x00-\x08\x0E-\x1F\x80-\xFF]" notes_unicode_emoji_test
5:© copyright c2a9
7:call  underscore c2a0
9:CTRL
31:5 © copyright
32:7 call  underscore

grep with LC_ALL=C does seem to match all extended characters that you would want:

$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test  
1:���� unicode dashes e28090
3:��� Heart With Arrow Emoji - Emojipedia == UTF8? f09f9298
5:� copyright c2a9
7:call� underscore c2a0
11:LIVE��E! ���������� ���� ���������� ���� �� �� ���� ����  YEOW, mix of japanese and chars from other e38182 e38184 . . e0a487
29:1 ���� unicode dashes
30:3 ��� Heart With Arrow Emoji - Emojipedia == UTF8 e28090
31:5 � copyright
32:7 call� underscore
33:11 LIVE��E! ���������� ���� ���������� ���� �� �� ���� ����  YEOW, mix of japanese and chars from other
34:52 LIVE��E! ���������� ���� ���������� ���� �� �� ���� ����  YEOW, mix of japanese and chars from other
81:LIVE��E! ���������� ���� ���������� ���� �� �� ���� ����  YEOW, mix of japanese and chars from other

THIS perl match (partially found elsewhere on stackoverflow) OR the inverse grep on the top answer DO seem to find ALL the ~weird~ and ~wonderful~ "non-ascii" characters without setting locale:

$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test

$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test  

1 ‐‐ unicode dashes e28090
3 💘 Heart With Arrow Emoji - Emojipedia == UTF8? f09f9298
5 © copyright c2a9
7 call  underscore c2a0
9 CTRL-H CHARS URK URK URK 
11 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ  YEOW, mix of japanese and chars from other e38182 e38184 . . e0a487
29 1 ‐‐ unicode dashes
30 3 💘 Heart With Arrow Emoji - Emojipedia == UTF8 e28090
31 5 © copyright
32 7 call  underscore
33 11 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ  YEOW, mix of japanese and chars from other
34 52 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ  YEOW, mix of japanese and chars from other
73 LIVE‐E! あいうえお かが アイウエオ カガ ᚊ ᚋ ซฌ आइ  YEOW, mix of japanese and chars from other

SO the preferred non-ascii char finders:

$ perl -ne 'print "$. $_" if m/[\x00-\x08\x0E-\x1F\x80-\xFF]/' notes_unicode_emoji_test

as in top answer, the inverse grep:

$ grep --color='auto' -P -n "[^\x00-\x7F]" notes_unicode_emoji_test

as in top answer but WITH LC_ALL=C:

$ LC_ALL=C grep --color='auto' -P -n "[\x80-\xFF]" notes_unicode_emoji_test

Answer to why grep doesn't match characters encoded in more than 2 bytes thanks to @calandoa and frabjous in comments above on question. Use LC_ALL=C before the grep command.
Thanks so much for bothering to post an answer buried under 800 other upvotes! My problem was a 0x02 character. You may want to put that "practical example of use" near the top, since you really don't need to read the whole post to just see if that's your problem.
I know, really old answer, and excrutiating detail, but correct useful for me and others also I hope. You are right, I added TLDR; at top.
b
bfontaine

The following code works:

find /tmp | perl -ne 'print if /[^[:ascii:]]/'

Replace /tmp with the name of the directory you want to search through.


On a Mac, this works, while most of the grep-based ones don't.
K
Kajukenbo

This method should work with any POSIX-compliant version of awk and iconv. We can take advantage of file and tr as well.

curl is not POSIX, of course.

Solutions above may be better in some cases, but they seem to depend on GNU/Linux implementations or additional tools.

Get a sample file:

$ curl -Ls http://gutenberg.org/files/84/84-0.txt

$ file 84-0.txt

84-0.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators

Search for UTF-8 characters:

$ awk '/[\x80-\xFF]/ { print }' 84-0.txt

or non-ASCII

$ awk '/[^[:ascii:]]/ { print }' 84-0.txt

Convert UTF-8 to ASCII, removing problematic characters:

$ iconv -c -t ASCII 84-0.txt > 84-ascii.txt

Check it:

$ file 84-ascii.txt

84-ascii.txt: ASCII text, with CRLF line terminators

Tweak it:

$ tr -d '\015' < 84-ascii.txt | file -

/dev/stdin: ASCII text

YMMV


The awk solution works on BSD.
d
dma_k

Strangely, I had to do this today! I ended up using Perl because I couldn't get grep/egrep to work (even in -P mode). Something like:

cat blah | perl -en '/\xCA\xFE\xBA\xBE/ && print "found"'

For unicode characters (like \u2212 in example below) use this:

find . ... -exec perl -CA -e '$ARGV = @ARGV[0]; open IN, $ARGV; binmode(IN, ":utf8"); binmode(STDOUT, ":utf8"); while (<IN>) { next unless /\N{U+2212}/; print "$ARGV: $&: $_"; exit }' '{}' \;

In this scenario you probably need to check the locales as mentioned in stackoverflow.com/a/3208902/7809404
m
miken32

It could be interesting to know how to search for one unicode character. This command can help. You only need to know the code in UTF8

grep -v $'\u200d'

I'm not really an expert, but I know enough to know that's not a UTF8 representation, it's UTF16, or maybe UTF32, or UCS16. For a 2-byte codepoint those three might all be the same.
n
noabody

Finding all non-ascii characters gives the impression that one is either looking for unicode strings or intends to strip said characters individually.

For the former, try one of these (variable file is used for automation):

 file=file.txt ; LC_ALL=C grep -Piao '[\x80-\xFF\x20]{7,}' $file | iconv -f $(uchardet $file) -t utf-8

 file=file.txt ; pcregrep -iao '[\x80-\xFF\x20]{7,}' $file | iconv -f $(uchardet $file) -t utf-8

 file=file.txt ; pcregrep -iao '[^\x00-\x19\x21-\x7F]{7,}' $file | iconv -f $(uchardet $file) -t utf-8

Vanilla grep doesn't work correctly without LC_ALL=C as noted in the previous answers.

ASCII range is x00-x7F, space is x20, since strings have spaces the negative range omits it.

Non-ASCII range is x80-xFF, since strings have spaces the positive range adds it.

String is presumed to be at least 7 consecutive characters within the range. {7,}.

For shell readable output, uchardet $file returns a guess of the file encoding which is passed to iconv for automatic interpolation.


This is very useful due to the mention of the uchardet command. Thanks for that heads-up!
R
RARE Kpop Manifesto

if you're trying to grab/grep UTF8-compliant multibyte-characters, use this :

(                     [\302-\337][\200-\277]|
                [\340][\240-\277][\200-\277]|
                [\355][\200-\237][\200-\277]|
  [\341-\354\356-\357][\200-\277][\200-\277]|
     [\360][\220-\277][\200-\277][\200-\277]|
[\361-\363][\200-\277][\200-\277][\200-\277]|
     [\364][\200-\217][\200-\277][\200-\277]  ) 

 * please delete all newlines, spaces, or tabs in between (..)

 * feel free to use bracket ranges {1,3} etc to optimize
   the redundant listings of [\200-\277]. but don't change that
   [\200-\277]+, as that might result in invalid encodings 
    due to either insufficient or too many continuation bytes

 * although some historical UTF-8 references considers 5- and 
   6-byte encodings to be valid, as of Unicode 13 they only
   consider up to 4-bytes

I've tested this string even against random binary files, and it would report the same multi-byte character count as gnu-wc.

Add in another [\000-\177]| at the front just after ( of that if you need full UTF8 matching string.

This regex is truly hideous yes, but it's also POSIX-compliant, cross-language and cross-platform compatible (doesn't depend on any special regex notation, (should be) fully UTF-8 compliant (Unicode 13), and completely independent of locale-setting.

if you're running grep with this, please use grep -P

If you just need the other bytes, then others have suggested already.

if you need the 11,172 characters of NFC-composed korean hangul it's

(([\352][\260-\277]|[\353\354][\200-\277]|
 [\355][\200-\235])[\200-\277]|[\355][\236][\200-\243])

and if you need Japanese hiragana+katakana, it's

([\343]([\201-\203][\200-\277]|[\207][\260-\277]))