I am trying to find a simple way to extract an unknown substring (could be anything) that appear between two known substrings. For example, I have a string:
a<-" anything goes here, STR1 GET_ME STR2, anything goes here"
I need to extract the string GET_ME
which is between STR1 and STR2 (without the white spaces).
I am trying str_extract(a, "STR1 (.+) STR2")
, but I am getting the entire match
[1] "STR1 GET_ME STR2"
I can of course strip the known strings, to isolate the substring I need, but I think there should be a cleaner way to do it by using a correct regular expression.
regcapturedmatches(test, gregexpr('STR1 (.+?) STR2', test, perl = TRUE))
You may use str_match
with STR1 (.*?) STR2
(note the spaces are "meaningful", if you want to just match anything in between STR1
and STR2
use STR1(.*?)STR2
, or use STR1\\s*(.*?)\\s*STR2
to trim the value you need). If you have multiple occurrences, use str_match_all
.
Also, if you need to match strings that span across line breaks/newlines add (?s)
at the start of the pattern: (?s)STR1(.*?)STR2
/ (?s)STR1\\s*(.*?)\\s*STR2
.
library(stringr)
a <- " anything goes here, STR1 GET_ME STR2, anything goes here"
res <- str_match(a, "STR1\\s*(.*?)\\s*STR2")
res[,2]
[1] "GET_ME"
Another way using base R regexec
(to get the first match):
test <- " anything goes here, STR1 GET_ME STR2, anything goes here STR1 GET_ME2 STR2"
pattern <- "STR1\\s*(.*?)\\s*STR2"
result <- regmatches(test, regexec(pattern, test))
result[[1]][2]
[1] "GET_ME"
Here's another way by using base R
a<-" anything goes here, STR1 GET_ME STR2, anything goes here"
gsub(".*STR1 (.+) STR2.*", "\\1", a)
Output:
[1] "GET_ME"
\\1
?
Another option is to use qdapRegex::ex_between
to extract strings between left and right boundaries
qdapRegex::ex_between(a, "STR1", "STR2")[[1]]
#[1] "GET_ME"
It also works with multiple occurrences
a <- "anything STR1 GET_ME STR2, anything goes here, STR1 again get me STR2"
qdapRegex::ex_between(a, "STR1", "STR2")[[1]]
#[1] "GET_ME" "again get me"
Or multiple left and right boundaries
a <- "anything STR1 GET_ME STR2, anything goes here, STR4 again get me STR5"
qdapRegex::ex_between(a, c("STR1", "STR4"), c("STR2", "STR5"))[[1]]
#[1] "GET_ME" "again get me"
First capture is between "STR1" and "STR2" whereas second between "STR4" and "STR5".
We could use {unglue}, in that case we don't need regex at all :
library(unglue)
unglue::unglue_vec(
" anything goes here, STR1 GET_ME STR2, anything goes here",
"{}STR1 {x} STR2{}")
#> [1] "GET_ME"
{}
matches anything without keeping it, {x}
captures its match (any variable other than x
could be used. The syntax"{}STR1 {x} STR2{}"
is short for : "{=.*?}STR1 {x=.*?} STR2{=.*?}"
If you wanted to extract the sides too you could do:
unglue::unglue_data(
" anything goes here, STR1 GET_ME STR2, anything goes here",
"{left}, STR1 {x} STR2, {right}")
#> left x right
#> 1 anything goes here GET_ME anything goes here
"{left}, STR1 {x} STR2, {right}"
you could use sprintf("{left}, %s {x} %s, {right}", a, b)
, or paste0("{left}, ", a, " {x} ", b, ", {right}")
Success story sharing
?
here is a part of a lazy (non-greedy) quantifier. It matches as few characters as possible, while*
will match as many as possible. So,STR1 .*? STR2
regex matchesSTR1 xx STR2
, andSTR1 .* STR2
will matchSTR1 xx STR2 zzz STR2
. If you expect multiple matches in your input, lazy quantifier is a must here. Also, FYI: if the part of string betweenSTR1
andSTR2
may contain newlines, you need to prepend the pattern with(?s)
:"(?s)STR1 (.*?) STR2"
.str_match
output is in a matrix? It seems so inconvenient, particularly when the only output most people ever want is[,2]
[,2]
, they should use a mereregmatches(a, regexpr("STR1\\s*\\K.*?(?=\\s*STR2)", a, perl=TRUE))
. Withstringr
, it is also possible to use a pattern likestr_extract_all(a, "(?s)(?<=STR1\\s{0,1000}).*?(?=\\s*STR2)")
(though for some reason the space is still included in the match, and it is rather hacky).str_match
is a life savior when you need to return all matches and captures. Also, the pattern that can be used withstr_match
is much more efficient.