ChatGPT解决这个技术问题 Extra ChatGPT

How to remove non-alphanumeric characters?

I need to remove all characters from a string which aren't in a-z A-Z 0-9 set or are not spaces.

Does anyone have a function to do this?


L
Louis

Sounds like you almost knew what you wanted to do already, you basically defined it as a regex.

preg_replace("/[^A-Za-z0-9 ]/", '', $string);

zuk1: regexbuddy is a great help with that
Here's an example if you want to include the hyphen as an allowed character. I needed this because I needed to strip out disallowed characters from a Moodle username, based on email addresses: preg_replace("/[^a-z0-9_.@\-]/", '', $string);
Would this work exactly the same with apostrophes (single-quotes) around the regular expression, instead of quotation marks (double-quotes)? E.g: preg_replace('/[^A-Za-z0-9 ]/', '', $string);
We want explanation about this :) . People come here to see Why it is the way it is. Please consider Regex explanation too! Thanks
What if we want to keep accentued characters?
v
voondo

For unicode characters, it is :

preg_replace("/[^[:alnum:][:space:]]/u", '', $string);

hi voondo , what's with the /ui thing.. what do you call it ? can anyone please shed me some light. Thank you.
For clarification, they're called flags. They're put after the closing delimiter (in this case it's "/", but it could be "~" or "@" or whatever character you want to use as long as the opening and closing delimiters are the same) and change the behavior of the expression.
Btw, \w includes \d and so the \d is unnecessary. Also, this is wrong because it will also leave underscores in the resulting string (which is also included in \w).
There's still an error in this, the character classes need to be terminated with ':]' so the correct line would be: preg_replace("/[^[:alnum:][:space:]]/ui", '', $string);
Is the i flag really necessary here since [:alnum:] already covers both cases?
t
topher

Regular expression is your answer.

$str = preg_replace('/[^a-z\d ]/i', '', $str);

The i stands for case insensitive.

^ means, does not start with.

\d matches any digit.

a-z matches all characters between a and z. Because of the i parameter you don't have to specify a-z and A-Z.

After \d there is a space, so spaces are allowed in this regex.


We want explanation about this :) . People come here to see Why it is the way it is. Please consider Regex explanation too! Not everyone is advanced enough to know what you wrote there without explanation. Thanks
@PratikCJoshi The i stands for case insensitive. ^ means, does not start with. \d matches any digit. a-z matches all characters between a and z. Because of the i parameter you don't have to specify a-z and A-Z. After \d there is a space, so spaces are allows in this regex.
People don't read comments as answer. Please update answer!
J
Jonathon

If you need to support other languages, instead of the typical A-Z, you can use the following:

preg_replace('/[^\p{L}\p{N} ]+/', '', $string);

[^\p{L}\p{N} ] defines a negated (It will match a character that is not defined) character class of: \p{L}: a letter from any language. \p{N}: a numeric character in any script. : a space character.

\p{L}: a letter from any language.

\p{N}: a numeric character in any script.

: a space character.

+ greedily matches the character class between 1 and unlimited times.

This will preserve letters and numbers from other languages and scripts as well as A-Z:

preg_replace('/[^\p{L}\p{N} ]+/', '', 'hello-world'); // helloworld
preg_replace('/[^\p{L}\p{N} ]+/', '', 'abc@~#123-+=öäå'); // abc123öäå
preg_replace('/[^\p{L}\p{N} ]+/', '', '你好世界!@£$%^&*()'); // 你好世界

Note: This is a very old, but still relevant question. I am answering purely to provide supplementary information that may be useful to future visitors.


Works for me if I add unicode u flag at the end of the regex -- /[^\p{L}\p{N} ]+/u
s
scrollup

here's a really simple regex for that:

\W|_

and used as you need it (with a forward / slash delimiter).

preg_replace("/\W|_/", '', $string);

Test it here with this great tool that explains what the regex is doing:

http://www.regexr.com/


You still need the /u flag otherwise non-ascii letters are also removed.
Neat but would also match spaces and if this is wanted, probably could double the performance by use of a character class and additional quantifier for one or more [\W_]+
I
Intacto
[\W_]+

$string = preg_replace("/[\W_]+/u", '', $string);

It select all not A-Z, a-z, 0-9 and delete it.

See example here: https://regexr.com/3h1rj


what does this regex /[\W_]+/u means ?
\W is the inverse of \w which are characters A-Za-z0-9_. So \W will match any character that is not A-Za-z0-9_ and remove them. The [] is a character set boundary. The+ is redundant on a character set boundary but normally means 1 or more character. The u flag expands the expression to include unicode character support, meaning it will not remove characters beyond character code 255 such as ª²³µ . Example of various usages 3v4l.org/hSVV5 with unicode and ascii characters.
P
PASTAGA
preg_replace("/\W+/", '', $string)

You can test it here : http://regexr.com/


Per @Alex Stevens answer, this doesn't catch underscores "_".