I'm looking for a neat regex solution to replace
All non alphanumeric characters
All newlines
All multiple instances of white space
With a single space
For those playing at home (the following does work)
text.replace(/[^a-z0-9]/gmi, " ").replace(/\s+/g, " ");
My thinking is regex is probably powerful enough to achieve this in one statement. The components I think I'd need are
[^a-z0-9] - to remove non alphanumeric characters
\s+ - match any collections of spaces
\r?\n|\r - match all new line
/gmi - global, multi-line, case insensitive
However, I can't seem to style the regex in the right way (the following doesn't work)
text.replace(/[^a-z0-9]|\s+|\r?\n|\r/gmi, " ");
Input
234&^%,Me,2 2013 1080p x264 5 1 BluRay
S01(*&asd 05
S1E5
1x05
1x5
Desired Output
234 Me 2 2013 1080p x264 5 1 BluRay S01 asd 05 S1E5 1x05 1x5
Be aware, that \W
leaves the underscore. A short equivalent for [^a-zA-Z0-9]
would be [\W_]
text.replace(/[\W_]+/g," ");
\W
is the negation of shorthand \w
for [A-Za-z0-9_]
word characters (including the underscore)
Jonny 5 beat me to it. I was going to suggest using the \W+
without the \s
as in text.replace(/\W+/g, " ")
. This covers white space as well.
\W+
, not [W+]
Well, happy new year all!
&
and -
. Any tips?
Since [^a-z0-9]
character class contains all that is not alnum, it contains white characters too!
text.replace(/[^a-z0-9]+/gi, " ");
Well I think you just need to add a quantifier to each pattern. Also the carriage-return thing is a little funny:
text.replace(/[^a-z0-9]+|\s+/gmi, " ");
edit The \s
thing matches \r
and \n
too.
Update
Please be aware, the browser landscape changes rapidly, these benchmarks would be woefully out of date, and likely misleading at the time you reading this.
This is an old post of mine, the other answers are good for the most part. However I decided to benchmark each solution and another obvious one (just for fun). I wondered if there was a difference between the regex patterns on different browsers with different sized strings.
So basically I used jsPerf on
Testing in Chrome 65.0.3325 / Windows 10 0.0.0
Testing in Edge 16.16299.0 / Windows 10 0.0.0
The regex patterns I tested were
/[\W_]+/g
/[^a-z0-9]+/gi
/[^a-zA-Z0-9]+/g
I loaded them up with a string length of random characters
length 5000
length 1000
length 200
Example javascript I used var newstr = str.replace(/[\W_]+/g," ");
Each run consisted of 50 or more sample on each regex, and i run them 5 times on each browser.
Lets race our horses!
Results
Chrome Edge
Chars Pattern Ops/Sec Deviation Op/Sec Deviation
------------------------------------------------------------------------
5,000 /[\W_]+/g 19,977.80 1.09 10,820.40 1.32
5,000 /[^a-z0-9]+/gi 19,901.60 1.49 10,902.00 1.20
5,000 /[^a-zA-Z0-9]+/g 19,559.40 1.96 10,916.80 1.13
------------------------------------------------------------------------
1,000 /[\W_]+/g 96,239.00 1.65 52,358.80 1.41
1,000 /[^a-z0-9]+/gi 97,584.40 1.18 52,105.00 1.60
1,000 /[^a-zA-Z0-9]+/g 96,965.80 1.10 51,864.60 1.76
------------------------------------------------------------------------
200 /[\W_]+/g 480,318.60 1.70 261,030.40 1.80
200 /[^a-z0-9]+/gi 476,177.80 2.01 261,751.60 1.96
200 /[^a-zA-Z0-9]+/g 486,423.00 0.80 258,774.20 2.15
Truth be known, Regex in both browsers (taking into consideration deviation) were nearly indistinguishable, however i think if it run this even more times the results would become a little more clearer (but not by much).
Theoretical scaling for 1 character
Chrome Edge
Chars Pattern Ops/Sec Scaled Op/Sec Scaled
------------------------------------------------------------------------
5,000 /[\W_]+/g 19,977.80 99,889,000 10,820.40 54,102,000
5,000 /[^a-z0-9]+/gi 19,901.60 99,508,000 10,902.00 54,510,000
5,000 /[^a-zA-Z0-9]+/g 19,559.40 97,797,000 10,916.80 54,584,000
------------------------------------------------------------------------
1,000 /[\W_]+/g 96,239.00 96,239,000 52,358.80 52,358,800
1,000 /[^a-z0-9]+/gi 97,584.40 97,584,400 52,105.00 52,105,000
1,000 /[^a-zA-Z0-9]+/g 96,965.80 96,965,800 51,864.60 51,864,600
------------------------------------------------------------------------
200 /[\W_]+/g 480,318.60 96,063,720 261,030.40 52,206,080
200 /[^a-z0-9]+/gi 476,177.80 95,235,560 261,751.60 52,350,320
200 /[^a-zA-Z0-9]+/g 486,423.00 97,284,600 258,774.20 51,754,840
I wouldn't take to much into these results as this is not really a significant differences, all we can really tell is edge is slower :o . Additionally that i was super bored.
Anyway you can run the benchmark for your self.
A saw a different post that also had diacritical marks, which is great
s.replace(/[^a-zA-Z0-9À-ž\s]/g, "")
To replace with dashes, do the following:
text.replace(/[\W_-]/g,' ');
For anyone still strugging (like me...) after the above more expert replies, this works in Visual Studio 2019:
outputString = Regex.Replace(inputString, @"\W", "_");
Remember to add
using System.Text.RegularExpressions;
When Unicode comes to play use
text.replace(/[^\p{L}\p{N}]+/gu," ");
EXPLANATION
NODE EXPLANATION
--------------------------------------------------------------------------------
[^\p{L}\p{N}]+ Any character except Unicode letters and digits
(1 or more times (matching the most amount possible))
JavaScript code snippet:
const text = `234&^%,Me,2 2013 1080p x264 5 1 BluRąy S01(*&aśd 05 S1E5 1x05 1x5` console.log(text.replace(/[^\p{L}\p{N}]+/gu, ` `))
Success story sharing
\W
will also recognize non-Latin characters as non-word chars.