ChatGPT解决这个技术问题 Extra ChatGPT

Non greedy (reluctant) regex matching in sed?

I'm trying to use sed to clean up lines of URLs to extract just the domain.

So from:

http://www.suepearson.co.uk/product/174/71/3816/

I want:

http://www.suepearson.co.uk/

(either with or without the trailing slash, it doesn't matter)

I have tried:

 sed 's|\(http:\/\/.*?\/\).*|\1|'

and (escaping the non-greedy quantifier)

sed 's|\(http:\/\/.*\?\/\).*|\1|'

but I can not seem to get the non-greedy quantifier (?) to work, so it always ends up matching the whole string.

A side-note: if you delimit your regexes with "|", you needn't escape the "/"s. In fact, most people delimit with "|" instead of "/"s to avoid the "picket fences".
@AttishOculus The first character after the 's' in a substitute expression in sed is the delimiter. Hence 's^foo^bar^' or 's!foo!bar!' also work
For extended regex, use sed -E 's.... Still, no reluctant operator.
Not answer to the question title but in this specific case simple cut -d'/' -f1-3 works.

c
chaos

Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need a later regex. Fortunately, Perl regex for this context is pretty easy to get:

perl -pe 's|(http://.*?/).*|\1|'

For doing it in place use options -pi -e.
Holy smokes I can't believe that worked :-) Only thing that sucks is now my script has a Perl dependency :-( On the plus side, virtually every Linux distro has Perl already so probably not an issue :-)
@Freedom_Ben: IIRC perl is required by POSIX
@dolphus333: "Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier" means "you can't use the non-greedy quantifier in sed".
@Sérgio it's how you do the thing requested, which is impossible in sed, using a syntax basically identical to that of sed
T
Trevor Boyd Smith

In this specific case, you can get the job done without using a non-greedy regex.

Try this non-greedy regex [^/]* instead of .*?:

sed 's|\(http://[^/]*/\).*|\1|g'

How to make sed match non greedy a phrase using this technique?
Unfortunately you can’t; see chaos’s answer.
Many thanks ... since perl is not longer in the default installation base in many linux distros!
@DanielH In fact it is possible to match phrases non-greedily using this technique as requested. It just might take some pain to write either pattern with sufficient precision. E.g. when parsing a key-value-assignment in a URL's query it might require to seearch assignment using ([^&=#]+)=([^&#]*). There are cases that don't work this way for sure, e.g. when parsing URL for its host part and pathname with final slash assumed optional to be excluded from capturing: ^(http:\/\/.+?)/?$
A
Alan Moore

With sed, I usually implement non-greedy search by searching for anything except the separator until the separator :

echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*\)/.*;\1;p'

Output:

http://www.suon.co.uk

this is:

don't output -n

search, match pattern, replace and print s///p

use ; search command separator instead of / to make it easier to type so s;;;p

remember match between brackets \( ... \), later accessible with \1,\2...

match http://

followed by anything in brackets [], [ab/] would mean either a or b or /

first ^ in [] means not, so followed by anything but the thing in the []

so [^/] means anything except / character

* is to repeat previous group so [^/]* means characters except /.

so far sed -n 's;\(http://[^/]*\) means search and remember http://followed by any characters except / and remember what you've found

we want to search untill the end of domain so stop on the next / so add another / at the end: sed -n 's;\(http://[^/]*\)/' but we want to match the rest of the line after the domain so add .*

now the match remembered in group 1 (\1) is the domain so replace matched line with stuff saved in group \1 and print: sed -n 's;\(http://[^/]*\)/.*;\1;p'

If you want to include backslash after the domain as well, then add one more backslash in the group to remember:

echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*/\).*;\1;p'

output:

http://www.suon.co.uk/

Regarding the recent edits: Parentheses are a kind of bracketing character, so it's not incorrect to call them brackets, especially if you follow the word with the actual characters, as the author did. Also, it's the preferred usage in some cultures, so replacing it with the preferred usage in your own culture seems a bit rude, though I'm sure that's not what the editor intended. Personally, I think it's best to use purely descriptive names like round brackets, square brackets, and angle brackets.
C
Community

Simulating lazy (un-greedy) quantifier in sed

And all other regex flavors!

Finding first occurrence of an expression: POSIX ERE (using -r option) Regex: (EXPRESSION).*|. Sed: sed -r ‍'s/(EXPRESSION).*|./\1/g' # Global `g` modifier should be on Example (finding first sequence of digits) Live demo: $ sed -r 's/([0-9]+).*|./\1/g' <<< 'foo 12 bar 34' 12 How does it work? This regex benefits from an alternation |. At each position engine tries to pick the longest match (this is a POSIX standard which is followed by couple of other engines as well) which means it goes with . until a match is found for ([0-9]+).*. But order is important too. Since global flag is set, engine tries to continue matching character by character up to the end of input string or our target. As soon as the first and only capturing group of left side of alternation is matched (EXPRESSION) rest of line is consumed immediately as well .*. We now hold our value in the first capturing group. POSIX BRE Regex: \(\(\(EXPRESSION\).*\)*.\)* Sed: sed 's/\(\(\(EXPRESSION\).*\)*.\)*/\3/' Example (finding first sequence of digits): $ sed 's/\(\(\([0-9]\{1,\}\).*\)*.\)*/\3/' <<< 'foo 12 bar 34' 12 This one is like ERE version but with no alternation involved. That's all. At each single position engine tries to match a digit. If it is found, other following digits are consumed and captured and the rest of line is matched immediately otherwise since * means more or zero it skips over second capturing group \(\([0-9]\{1,\}\).*\)* and arrives at a dot . to match a single character and this process continues. Finding first occurrence of a delimited expression: This approach will match the very first occurrence of a string that is delimited. We can call it a block of string. sed 's/\(END-DELIMITER-EXPRESSION\).*/\1/; \ s/\(\(START-DELIMITER-EXPRESSION.*\)*.\)*/\1/g' Input string: foobar start block #1 end barfoo start block #2 end -EDE: end -SDE: start $ sed 's/\(end\).*/\1/; s/\(\(start.*\)*.\)*/\1/g' Output: start block #1 end First regex \(end\).* matches and captures first end delimiter end and substitues all match with recent captured characters which is the end delimiter. At this stage our output is: foobar start block #1 end. Then the result is passed to second regex \(\(start.*\)*.\)* that is same as POSIX BRE version above. It matches a single character if start delimiter start is not matched otherwise it matches and captures the start delimiter and matches the rest of characters.

Directly answering your question

Using approach #2 (delimited expression) you should select two appropriate expressions:

EDE: [^:/]\/

SDE: http:

Usage:

$ sed 's/\([^:/]\/\).*/\1/g; s/\(\(http:.*\)*.\)*/\1/' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'

Output:

http://www.suepearson.co.uk/

Note: this will not work with identical delimiters.


3) while suggesting sites like regex101 for demo, please add a note that it is not always suitable for cli tools because of syntax and feature differences
@Sundeep Thank you. I turned all those quotes to single quotes. Also I considered the leftmost longest match rule to be mentioned. However in sed and all other engines following the same standard order does matter when it comes to equality. So echo 'foo 1' | sed -r 's/.|([0-9]+).*/\1/g' doesn't have a match but echo 'foo 1' | sed -r 's/([0-9]+).*|./\1/g' does.
@Sundeep also the workaround for delimited expressions didn't work for identical start and end delimiters which I added a note for.
great point about what happens when different alternations start from same location and have same length, guess that'll follow left-right order like other engines.. need to look up if that is described in manual
there's a weird case here though: stackoverflow.com/questions/59683820/…
a
andcoz

sed does not support "non greedy" operator.

You have to use "[]" operator to exclude "/" from match.

sed 's,\(http://[^/]*\)/.*,\1,'

P.S. there is no need to backslash "/".


not really. if the delimiter could be one of many possible characters (say a string of numbers only) your negation match might get more and more complex. that is fine but it would certainly be nice to have an option to make .* non greedy
The question was more general. These solutions work for URLs but not (e.g.) for my use case of stripping trailing zeros. s/([[:digit:]]\.[[1-9]]*)0*/\1/ would obviously not work well for 1.20300. Since the original question was about URLs, though, they should be mentioned in the accepted answer.
g
gresolio

sed - non greedy matching by Christoph Sieghart

The trick to get non greedy matching in sed is to match all characters excluding the one that terminates the match. I know, a no-brainer, but I wasted precious minutes on it and shell scripts should be, after all, quick and easy. So in case somebody else might need it:

Greedy matching

% echo "<b>foo</b>bar" | sed 's/<.*>//g'
bar

Non greedy matching

% echo "<b>foo</b>bar" | sed 's/<[^>]*>//g'
foobar

m
markasoftware

Non-greedy solution for more than a single character

This thread is really old but I assume people still needs it. Lets say you want to kill everything till the very first occurrence of HELLO. You cannot say [^HELLO]...

So a nice solution involves two steps, assuming that you can spare a unique word that you are not expecting in the input, say top_sekrit.

In this case we can:

s/HELLO/top_sekrit/     #will only replace the very first occurrence
s/.*top_sekrit//        #kill everything till end of the first HELLO

Of course, with a simpler input you could use a smaller word, or maybe even a single character.

HTH!


To make it even better, useful in situation when you cannot expect not-used character: 1. replace that special character with really unused WORD, 2. replace ending sequence with the special character, 3. do the search ending with special character, 4. replace special character back, 5. replace special WORD back. For example, you want a greedy operator between and :
Here example: echo "Find:fir~st
yes
sec~ond" | sed -e "s,~,VERYSPECIAL,g" -e "s,,~,g" -e "s,.*Find:([^~]*).*,\1," -e "s,\~,," -e "s,VERYSPECIAL,~,"
I agree. nice solution. I would rephrase the comment into saying: if you cannot rely on ~ being unused, replace its current occurrences first using s/~/VERYspeciaL/g, then do the above trick, then return the original ~ using s/VERYspeciaL/~/g
I tend to like using rarer "variables" for this kind of thing, so instead of `, I'd use <$$> (since $$ expands to your process ID in the shell, though you'd have to use double quotes rather than single quotes, and that might break other parts of your regex) or, if unicode is available, something like <∈∋>.
At some point you have to ask yourself why you're not just using perl or python or some other language instead. perl does this in a less fragile manner in a single line...
L
Lambda Fairy

This can be done using cut:

echo "http://www.suepearson.co.uk/product/174/71/3816/" | cut -d'/' -f1-3

g
ghostdog74

another way, not using regex, is to use fields/delimiter method eg

string="http://www.suepearson.co.uk/product/174/71/3816/"
echo $string | awk -F"/" '{print $1,$2,$3}' OFS="/"

p
peterh

sed certainly has its place but this not not one of them !

As Dee has pointed out: Just use cut. It is far simpler and much more safe in this case. Here's an example where we extract various components from the URL using Bash syntax:

url="http://www.suepearson.co.uk/product/174/71/3816/"

protocol=$(echo "$url" | cut -d':' -f1)
host=$(echo "$url" | cut -d'/' -f3)
urlhost=$(echo "$url" | cut -d'/' -f1-3)
urlpath=$(echo "$url" | cut -d'/' -f4-)

gives you:

protocol = "http"
host = "www.suepearson.co.uk"
urlhost = "http://www.suepearson.co.uk"
urlpath = "product/174/71/3816/"

As you can see this is a lot more flexible approach.

(all credit to Dee)


m
mTUX

There is still hope to solve this using pure (GNU) sed. Despite this is not a generic solution in some cases you can use "loops" to eliminate all the unnecessary parts of the string like this:

sed -r -e ":loop" -e 's|(http://.+)/.*|\1|' -e "t loop"

-r: Use extended regex (for + and unescaped parenthesis)

":loop": Define a new label named "loop"

-e: add commands to sed

"t loop": Jump back to label "loop" if there was a successful substitution

The only problem here is it will also cut the last separator character ('/'), but if you really need it you can still simply put it back after the "loop" finished, just append this additional command at the end of the previous command line:

-e "s,$,/,"

L
Lucero
sed 's|(http:\/\/[^\/]+\/).*|\1|'

If you use "|" as your separator, there is no need to escape "/".
s
stepancheg

sed -E interprets regular expressions as extended (modern) regular expressions

Update: -E on MacOS X, -r in GNU sed.


No it doesn't... At least not GNU sed.
More broadly, -E is unique to BSD sed and therefore OS X. Links to man pages. -r does bring extended regular expressions to GNU sed as noted in @stephancheg's correction. Beware when using a command of known variability across 'nix distributions. I learned that the hard way.
This is the correct answer if you want to use sed, and is the most applicable to the initial question.
GNU sed's -r option only changes the escaping rules, according to Appendix A Extended regular expressions of the info file and some quick tests; it doesn't actually add a non-greedy qualifier (as of GNU sed version 4.2.1 at least.)
GNU sed recognized -E as an undocumented option for a while, but in release 4.2.2.177, the documentation has been updated to reflect that, so -E is fine for both now.
C
Community

Because you specifically stated you're trying to use sed (instead of perl, cut, etc.), try grouping. This circumvents the non-greedy identifier potentially not being recognized. The first group is the protocol (i.e. 'http://', 'https://', 'tcp://', etc). The second group is the domain:

echo "http://www.suon.co.uk/product/1/7/3/" | sed "s|^\(.*//\)\([^/]*\).*$|\1\2|"

If you're not familiar with grouping, start here.


I
Iain Henderson

I realize this is an old entry, but someone may find it useful. As the full domain name may not exceed a total length of 253 characters replace .* with .\{1, 255\}


E
Ed Morton

This is how to robustly do non-greedy matching of multi-character strings using sed. Lets say you want to change every foo...bar to <foo...bar> so for example this input:

$ cat file
ABC foo DEF bar GHI foo KLM bar NOP foo QRS bar TUV

should become this output:

ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV

To do that you convert foo and bar to individual characters and then use the negation of those characters between them:

$ sed 's/@/@A/g; s/{/@B/g; s/}/@C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/g; s/}/bar/g; s/{/foo/g; s/@C/}/g; s/@B/{/g; s/@A/@/g' file
ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV

In the above:

s/@/@A/g; s/{/@B/g; s/}/@C/g is converting { and } to placeholder strings that cannot exist in the input so those chars then are available to convert foo and bar to. s/foo/{/g; s/bar/}/g is converting foo and bar to { and } respectively s/{[^{}]*}/<&>/g is performing the op we want - converting foo...bar to s/}/bar/g; s/{/foo/g is converting { and } back to foo and bar. s/@C/}/g; s/@B/{/g; s/@A/@/g is converting the placeholder strings back to their original characters.

Note that the above does not rely on any particular string not being present in the input as it manufactures such strings in the first step, nor does it care which occurrence of any particular regexp you want to match since you can use {[^{}]*} as many times as necessary in the expression to isolate the actual match you want and/or with seds numeric match operator, e.g. to only replace the 2nd occurrence:

$ sed 's/@/@A/g; s/{/@B/g; s/}/@C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/2; s/}/bar/g; s/{/foo/g; s/@C/}/g; s/@B/{/g; s/@A/@/g' file
ABC foo DEF bar GHI <foo KLM bar> NOP foo QRS bar TUV

L
Luke Davis

Have not yet seen this answer, so here's how you can do this with vi or vim:

vi -c '%s/\(http:\/\/.\{-}\/\).*/\1/ge | wq' file &>/dev/null

This runs the vi :%s substitution globally (the trailing g), refrains from raising an error if the pattern is not found (e), then saves the resulting changes to disk and quits. The &>/dev/null prevents the GUI from briefly flashing on screen, which can be annoying.

I like using vi sometimes for super complicated regexes, because (1) perl is dead dying, (2) vim has a very advanced regex engine, and (3) I'm already intimately familiar with vi regexes in my day-to-day usage editing documents.


R
RavinderSingh13

Since PCRE is also tagged here, we could use GNU grep by using non-lazy match in regex .*? which will match first nearest match opposite of .*(which is really greedy and goes till last occurrence of match).

grep -oP '^http[s]?:\/\/.*?/' Input_file

Explanation: using grep's oP options here where -P is responsible for enabling PCRE regex here. In main program of grep mentioning regex which is matching starting http/https followed by :// till next occurrence of / since we have used .*? it will look for first / after (http/https://). It will print matched part only in line.


T
Tim Cooper
echo "/home/one/two/three/myfile.txt" | sed 's|\(.*\)/.*|\1|'

don bother, i got it on another forum :)


so you get greedy match: /home/one/two/three/, if you add another / like /home/one/two/three/four/myfile.txt you will greedily match four as well: /home/one/two/three/four, the question is about non-greedy
G
GL2014

sed 's|\(http:\/\/www\.[a-z.0-9]*\/\).*|\1| works too


j
jmeinlschmidt

Here is something you can do with a two step approach and awk:

A=http://www.suepearson.co.uk/product/174/71/3816/  
echo $A|awk '  
{  
  var=gensub(///,"||",3,$0) ;  
  sub(/\|\|.*/,"",var);  
  print var  
}'  

Output: http://www.suepearson.co.uk

Hope that helps!


P
Ports

Another sed version:

sed 's|/[:alnum:].*||' file.txt

It matches / followed by an alphanumeric character (so not another forward slash) as well as the rest of characters till the end of the line. Afterwards it replaces it with nothing (ie. deletes it.)


I guess it should be "[[:alnum:]]", not "[:alphanum:]".
V
Volker

@Daniel H (concerning your comment on andcoz' answer, although long time ago): deleting trailing zeros works with

s,([[:digit:]]\.[[:digit:]]*[1-9])[0]*$,\1,g

it's about clearly defining the matching conditions ...


M
Markus Linnala

You should also think about the case where there is no matching delims. Do you want to output the line or not. My examples here do not output anything if there is no match.

You need prefix up to 3rd /, so select two times string of any length not containing / and following / and then string of any length not containing / and then match / following any string and then print selection. This idea works with any single char delims.

echo http://www.suepearson.co.uk/product/174/71/3816/ | \
  sed -nr 's,(([^/]*/){2}[^/]*)/.*,\1,p'

Using sed commands you can do fast prefix dropping or delim selection, like:

echo 'aaa @cee: { "foo":" @cee: " }' | \
  sed -r 't x;s/ @cee: /\n/;D;:x'

This is lot faster than eating char at a time.

Jump to label if successful match previously. Add \n at / before 1st delim. Remove up to first \n. If \n was added, jump to end and print.

If there is start and end delims, it is just easy to remove end delims until you reach the nth-2 element you want and then do D trick, remove after end delim, jump to delete if no match, remove before start delim and and print. This only works if start/end delims occur in pairs.

echo 'foobar start block #1 end barfoo start block #2 end bazfoo start block #3 end goo start block #4 end faa' | \
  sed -r 't x;s/end//;s/end/\n/;D;:x;s/(end).*/\1/;T y;s/.*(start)/\1/;p;:y;d'

l
laur

If you have access to gnu grep, then can utilize perl regex:

grep -Po '^https?://([^/]+)(?=)' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
http://www.suepearson.co.uk

Alternatively, to get everything after the domain use

grep -Po '^https?://([^/]+)\K.*' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
/product/174/71/3816/

V
Victoria Stuart

The following solution works for matching / working with multiply present (chained; tandem; compound) HTML or other tags. For example, I wanted to edit HTML code to remove <span> tags, that appeared in tandem.

Issue: regular sed regex expressions greedily matched over all the tags from the first to the last.

Solution: non-greedy pattern matching (per discussions elsewhere in this thread; e.g. https://stackoverflow.com/a/46719361/1904943).

Example:

echo '<span>Will</span>This <span>remove</span>will <span>this.</span>remain.' | \
sed 's/<span>[^>]*>//g' ; echo

This will remain.

Explanation:

s/ : find

[^>] : followed by anything that is not >

*> : until you find >

//g : replace any such strings present with nothing.

Addendum

I was trying to clean up URLs, but I was running into difficulty matching / excluding a word - href - using the approach above. I briefly looked at negative lookarounds (Regular expression to match a line that doesn't contain a word) but that approach seemed overly complex and did not provide a satisfactory solution.

I decided to replace href with ` (backtick), do the regex substitutions, then replace ` with href.

Example (formatted here for readability):

printf '\n
<a aaa h href="apple">apple</a>
<a bbb "c=ccc" href="banana">banana</a>
<a class="gtm-content-click"
   data-vars-link-text="nope"
   data-vars-click-url="https://blablabla"
   data-vars-event-category="story"
   data-vars-sub-category="story"
   data-vars-item="in_content_link"
   data-vars-link-text
   href="https:example.com">Example.com</a>\n\n' |
sed 's/href/`/g ;
     s/<a[^`]*`/\n<a href/g'

<a href="apple">apple</a> 
<a href="banana">banana</a> 
<a href="https:example.com">Example.com</a>

Explanation: basically as above. Here,

s/href/` : replace href with ` (backtick)

s/

[^`] : followed by anything that is not ` (backtick)

*` : until you find a `

/


u
user2679290

Unfortunately, as mentioned, this it is not supported in sed. To overcome this, I suggest to use the next best thing(actually better even), to use vim sed-like capabilities.

define in .bash-profile

vimdo() { vim $2 --not-a-term -c "$1"  -es +"w >> /dev/stdout" -cq!  ; }

That will create headless vim to execute a command.

Now you can do for example:

echo $PATH | vimdo "%s_\c:[a-zA-Z0-9\\/]\{-}python[a-zA-Z0-9\\/]\{-}:__g" -

to filter out python in $PATH.

Use - to have input from pipe in vimdo.

While most of the syntax is the same. Vim features more advanced features, and using \{-} is standard for non-greedy match. see help regexp.