ChatGPT解决这个技术问题 Extra ChatGPT

Regex, every non-alphanumeric character except white space or colon

How can I do this one anywhere?

Basically, I am trying to match all kinds of miscellaneous characters such as ampersands, semicolons, dollar signs, etc.

/[^a-zA-Z0-9\s\:]*/

T
Tudor Constantin
[^a-zA-Z\d\s:]

\d - numeric class

\s - whitespace

a-zA-Z - matches all the letters

^ - negates them all - so you get - non numeric chars, non spaces and non colons


That's what I was looking at also :)) - I have to promote your perfect answer
The only thing that I found is that this removes special characters like é or ã. I would prefer [^\w\d\s:].
Downvoted because this will not catch non-Latin characters, nor "special" Latin characters.
\d and \s are Perl extensions which are typically not supported by older tools like grep, sed, tr, lex, etc.
P
Peter Mortensen

This should do it:

[^a-zA-Z\d\s:]

The rest either check for space but not whitespace or have the negation in the wrong spot to actually negate.
\w catches underscores also - which is a non-alphanumeric character
Aha! I shall modify -- I didn't know that. I expect it works differently for different engines, but might as well give the OP the safe answer.
Downvoted because this will not catch non-Latin characters, nor "special" Latin characters.
N
Nick F

If you want to treat accented latin characters (eg. à Ñ) as normal letters (ie. avoid matching them too), you'll also need to include the appropriate Unicode range (\u00C0-\u00FF) in your regex, so it would look like this:

/[^a-zA-Z\d\s:\u00C0-\u00FF]/g

^ negates what follows

a-zA-Z matches upper and lower case letters

\d matches digits

\s matches white space (if you only want to match spaces, replace this with a space)

: matches a colon

\u00C0-\u00FF matches the Unicode range for accented latin characters.

nb. Unicode range matching might not work for all regex engines, but the above certainly works in Javascript (as seen in this pen on Codepen).

nb2. If you're not bothered about matching underscores, you could replace a-zA-Z\d with \w, which matches letters, digits, and underscores.


This range contains some characters which are not alphanumeric (U+00D7 and U+00F7), and excludes a lot of valid accented characters from non-Western languages like Polish, Czech, Vietnamese etc.
Upvoted for the description of each part of the RegEx.
P
Peter Mortensen

Try this:

[^a-zA-Z0-9 :]

JavaScript example:

"!@#$%* ABC def:123".replace(/[^a-zA-Z0-9 :]/g, ".")

See a online example:

http://jsfiddle.net/vhMy8/


Downvoted because this will not catch non-Latin characters, nor "special" Latin characters.
It is easy to down vote an answer, and yet more difficult to provide constructive information to the board, e.g. how does one then catch non-Latin characters, nor "special" Latin characters? As of my count to here you have down voted 3 answers for the same reason, and in my opinion for a rather minor tweak. For example, I am here to find a regex for exactly what is discussed in these answers. I don't care about character sets that will not be used in my application. Law of diminishing returns.
Aaron might be a "minor tweak" to a US citizen, but highly relevant for... the rest of this planet.
[^a-zA-Z0-9 :] can be replaced with [^\w:]
\w includes underscores also, so keep an eye on that
V
Vasyl Gutnyk

No alphanumeric, white space or '_'.

var reg = /[^\w\s)]|[_]/g;

P
Peter Mortensen

If you mean "non-alphanumeric characters", try to use this:

var reg =/[^a-zA-Z0-9]/g      //[^abc]

C
Chris Halcrow

In JavaScript:

/[^\w_]/g

^ negation, i.e. select anything not in the following set

\w any word character (i.e. any alphanumeric character, plus underscore)

_ negate the underscore, as it's considered a 'word' character

Usage example - const nonAlphaNumericChars = /[^\w_]/g;


[^\w_] is the same as [^\w] (as _ is a word char), and it is equal to \W.
P
Peter Mortensen

This regex works for C#, PCRE and Go to name a few.

It doesn't work for JavaScript on Chrome from what RegexBuddy says. But there's already an example for that here.

This main part of this is:

\p{L}

which represents \p{L} or \p{Letter} any kind of letter from any language.`

The full regex itself: [^\w\d\s:\p{L}]

Example: https://regex101.com/r/K59PrA/2


This is the only answer here which deals correctly with Unicode accented alphabetics in a proper way. Sadly, not all regex engines support this facility (even Python lacks it, as of 3.8, even though its regex engine is ostensibly PCRE-based).
I'll remove Python from the answer, I thought I tested that but apparently not. Thanks for pointing that out.
P
Peter Mortensen

Try to add this:

^[^a-zA-Z\d\s:]*$

This has worked for me... :)


This seems to repeat the accepted answer from 2011. The ^ and $ anchors confines it to match entire lines and the * quantifier means it also matches empty lines.
i
its_ me

[^\w\s-]

Character set of characters which not:

Alphanumeric

Whitespace

Colon