On Linux, I have a directory with lots of files. Some of them have non-ASCII characters, but they are all valid UTF-8. One program has a bug that prevents it working with non-ASCII filenames, and I have to find out how many are affected. I was going to do this with find
and then do a grep to print the non-ASCII characters, and then do a wc -l
to find the number. It doesn't have to be grep; I can use any standard Unix regular expression, like Perl, sed, AWK, etc.
However, is there a regular expression for 'any character that's not an ASCII character'?
/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F]
No, [^\x20-\x7E]
is not ASCII.
This is real ASCII:
[^\x00-\x7F]
Otherwise, it will trim out newlines and other special characters that are part of the ASCII table!
You could also to check this page: Unicode Regular Expressions, as it contains some useful Unicode characters classes, like:
\p{Control}: an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character.
You can use this regex:
[^\w \xC0-\xFF]
Case ask, the options is Multiline.
[^\x00-\x7F]
and [^[:ascii:]]
miss some control bytes so strings can be the better option sometimes. For example cat test.torrent | perl -pe 's/[^[:ascii:]]+/\n/g'
will do odd things to your terminal, where as strings test.torrent
will behave.
To Validate Text Box Accept Ascii Only use this Pattern
[\x00-\x7F]+
I use [^\t\r\n\x20-\x7E]+
and that seems to be working fine.
You don't really need a regex.
printf "%s\n" *[!\ -~]*
This will show file names with control characters in their names, too, but I consider that a feature.
If you don't have any matching files, the glob will expand to just itself, unless you have nullglob
set. (The expression does not match itself, so technically, this output is unambiguous.)
shopt -s nullglob dotglob globasciiranges
to skip the non-matching patterns, to include the dotted filenames like .tmp§
and not to depend on the current locale. I mean setting it temporarily just for this particular command, otherwise the default settings are fine.
This turned out to be very flexible and extensible. $field =~ s/[^\x00-\x7F]//g ; # thus all non ASCII or specific items in question could be cleaned. Very nice either in selection or pre-processing of items that will eventually become hash keys.
Success story sharing
^
is valid in PCRE.:print:
won't work in a UTF8 terminal? This works for me in pry in a UTF8 terminal:27.chr =~ /[^[:print:]]/
rename 's/[^\x00-\x7F]//g' *
(you can use-n
to check the renames are ok first).