Emacs, scripting and anything text oriented.

grep -Po

Kaushal Modi

Using grep to do substring extraction in shell scripts.

I like regular expressions  I recommend using https://regex101.com/ to practice regular expressions of different flavors (PCRE2, PCRE, Python, etc.) whether or not you are new to using regex. as they allow me to be concise and specific about what I need to search.

And I have liked using regular expressions for many years, ever since I learned Perl about fifteen years back. I am writing this post as I am remembering the delight I felt when I realized that I can use the familiar Perl regular expressions to do string parsing in shell scripts. I am not exactly sure, but I probably learned about this grep -Po trick from stackexchange (camh, 2011).

Problem statement #

I could be parsing a log file with a line like web report: https://foo.bar/detail.html and I need to extract the https://foo.bar part to a shell script variable.

Solution using grep -Po #

This solution requires a GNU grep version supporting -P, that’s compiled with libpcre GNU grep gained the PCRE (-P) feature back in 2000. Also I have never come across a system or used one that did not have such a grep version installed.

I’ll throw the solution out here and then dig into the details.

echo "def\nabc" | grep -Po 'a\K.(?=c)' # => b
Code Snippet 1: Extracting "b" from "abc" using grep -Po

The grep switches used here are:

-P
Use (P)erl regular expressions. This allows us to use the look around regex syntax like (?=..) and special characters like \K (“perlre - Perl regular expressions,” n.d.).
-o
Print only the matched portion to the (o)utput

Arriving to this solution #

Now I’ll start with a basic example and build up to the above solution.

Problem
Let’s say I have this text with two lines “def” and “abc” and I want to output whatever character is between “a” and “c”.
  • Below, the regular expression for matching any character between “a” and “c” ( 'a.c' ) is correct, but that will output the whole input because the grep of that regex succeeded.

    echo "def\nabc" | grep 'a.c' # => def\nabc
    
  • Now we add the grep -o switch so that it outputs only the matched portion. As the regex is 'a.c'​, the -o switch will output every part of the input that matched that. So the output is “abc”. It’s still not what we wanted.

    echo "def\nabc" | grep -o 'a.c' # => abc
    
  • Now we bring in the powerful Perl regex feature positive lookahead Positive lookahead is used when you want to match something only if it’s followed by something else. It’s syntax looks like q(?=u) where that expression matches if a q is followed by a u, without making the u part of the match – reference. But this is still not exactly what we want because “a” is still considered as part of the match. Now the output is “ab”.

    echo "abc" | grep -Po 'a.(?=c)' # => ab
    
  • We only need a special character that marks a point in the regex that tells “don’t consider anything before this as part of the match”. The \K special construct described in the Perl regular expressions doc as:

    There is a special form of this construct, called \K (available since Perl 5.10.0), which causes the regex engine to “keep” everything it had matched prior to the \K and not include it in matched string. This effectively provides non-experimental variable-length lookbehind of any length.

    And, thus we have the final solution:

    echo "abc" | grep -Po 'a\K.(?=c)' # => b
    

Summary #

Taking the example from the problem statement, this will work:

string="web report: https://foo.bar/detail.html"
substring=$(grep -Po 'web report:\s*\K.*?(?=/detail\.html)' <<< "${string}")
echo "${substring}"
https://foo.bar

References #

camh. (2011). Can grep output only specified groupings that match? [Website]. In Unix stackexchange. https://unix.stackexchange.com/a/13472/57923
perlre - Perl regular expressions. (n.d.). [Website]. In Perldoc 5.34.0. Retrieved February 16, 2022, from https://perldoc.perl.org/perlre