Sounds like you almost knew what you wanted to do already, you basically defined it as a regex.
preg_replace("/[^A-Za-z0-9 ]/", '', $string);
For unicode characters, it is :
preg_replace("/[^[:alnum:][:space:]]/u", '', $string);
\w
includes \d
and so the \d
is unnecessary. Also, this is wrong because it will also leave underscores in the resulting string (which is also included in \w
).
i
flag really necessary here since [:alnum:]
already covers both cases?
Regular expression is your answer.
$str = preg_replace('/[^a-z\d ]/i', '', $str);
The i stands for case insensitive.
^ means, does not start with.
\d matches any digit.
a-z matches all characters between a and z. Because of the i parameter you don't have to specify a-z and A-Z.
After \d there is a space, so spaces are allowed in this regex.
If you need to support other languages, instead of the typical A-Z, you can use the following:
preg_replace('/[^\p{L}\p{N} ]+/', '', $string);
[^\p{L}\p{N} ] defines a negated (It will match a character that is not defined) character class of: \p{L}: a letter from any language. \p{N}: a numeric character in any script. : a space character.
\p{L}: a letter from any language.
\p{N}: a numeric character in any script.
: a space character.
+ greedily matches the character class between 1 and unlimited times.
This will preserve letters and numbers from other languages and scripts as well as A-Z:
preg_replace('/[^\p{L}\p{N} ]+/', '', 'hello-world'); // helloworld
preg_replace('/[^\p{L}\p{N} ]+/', '', 'abc@~#123-+=öäå'); // abc123öäå
preg_replace('/[^\p{L}\p{N} ]+/', '', '你好世界!@£$%^&*()'); // 你好世界
Note: This is a very old, but still relevant question. I am answering purely to provide supplementary information that may be useful to future visitors.
u
flag at the end of the regex -- /[^\p{L}\p{N} ]+/u
here's a really simple regex for that:
\W|_
and used as you need it (with a forward /
slash delimiter).
preg_replace("/\W|_/", '', $string);
Test it here with this great tool that explains what the regex is doing:
/u
flag otherwise non-ascii letters are also removed.
[\W_]+
[\W_]+
$string = preg_replace("/[\W_]+/u", '', $string);
It select all not A-Z, a-z, 0-9 and delete it.
See example here: https://regexr.com/3h1rj
\W
is the inverse of \w
which are characters A-Za-z0-9_
. So \W
will match any character that is not A-Za-z0-9_
and remove them. The []
is a character set boundary. The+
is redundant on a character set boundary but normally means 1 or more character. The u
flag expands the expression to include unicode character support, meaning it will not remove characters beyond character code 255 such as ª²³µ
. Example of various usages 3v4l.org/hSVV5 with unicode and ascii characters.
preg_replace("/\W+/", '', $string)
You can test it here : http://regexr.com/
Success story sharing
preg_replace('/[^A-Za-z0-9 ]/', '', $string);