grep -Po
— Kaushal ModiUsing grep
to do substring extraction in shell scripts.
I like regular expressions I recommend using https://regex101.com/ to practice regular expressions of different flavors (PCRE2, PCRE, Python, etc.) whether or not you are new to using regex. as they allow me to be concise and specific about what I need to search.
And I have liked using regular expressions for many years, ever since
I learned Perl about fifteen years back. I am writing this post as I
am remembering the delight I felt when I realized that I can use the
familiar Perl regular expressions to do string parsing in shell
scripts. I am not exactly sure, but I probably learned about this
grep -Po
trick from stackexchange (camh, 2011).
Problem statement #
I could be parsing a log file with a line like web report: https://foo.bar/detail.html
and I need to extract the
https://foo.bar
part to a shell script variable.
Solution using grep -Po
#
This solution requires a GNU grep
version supporting -P
, that’s
compiled with libpcre
.
GNU grep gained the PCRE (-P
) feature back in 2000.
Also I have never come across a system or
used one that did not have such a grep
version installed.
I’ll throw the solution out here and then dig into the details.
echo "def\nabc" | grep -Po 'a\K.(?=c)' # => b
The grep switches used here are:
-P
- Use (P)erl regular expressions. This allows us to use the
look around regex syntax like
(?=..)
and special characters like\K
(“perlre - Perl regular expressions,” n.d.). -o
- Print only the matched portion to the (o)utput
Arriving to this solution #
Now I’ll start with a basic example and build up to the above solution.
- Problem
- Let’s say I have this text with two lines “def” and “abc” and I want to output whatever character is between “a” and “c”.
Below, the regular expression for matching any character between “a” and “c” (
'a.c'
) is correct, but that will output the whole input because the grep of that regex succeeded.echo "def\nabc" | grep 'a.c' # => def\nabc
Now we add the grep
-o
switch so that it outputs only the matched portion. As the regex is'a.c'
, the-o
switch will output every part of the input that matched that. So the output is “abc”. It’s still not what we wanted.echo "def\nabc" | grep -o 'a.c' # => abc
Now we bring in the powerful Perl regex feature positive lookahead. Positive lookahead is used when you want to match something only if it’s followed by something else. It’s syntax looks like
q(?=u)
where that expression matches if aq
is followed by au
, without making theu
part of the match – reference. But this is still not exactly what we want because “a” is still considered as part of the match. Now the output is “ab”.echo "abc" | grep -Po 'a.(?=c)' # => ab
We only need a special character that marks a point in the regex that tells “don’t consider anything before this as part of the match”. The
\K
special construct described in the Perl regular expressions doc as:There is a special form of this construct, called
\K
(available since Perl 5.10.0), which causes the regex engine to “keep” everything it had matched prior to the\K
and not include it in matched string. This effectively provides non-experimental variable-length lookbehind of any length.And, thus we have the final solution:
echo "abc" | grep -Po 'a\K.(?=c)' # => b
Summary #
Taking the example from the problem statement, this will work:
string="web report: https://foo.bar/detail.html"
substring=$(grep -Po 'web report:\s*\K.*?(?=/detail\.html)' <<< "${string}")
echo "${substring}"
https://foo.bar