<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-us"><generator uri="https://gohugo.io/" version="0.101.0">Hugo</generator><title type="html">perl on A Scripter's Notes</title><subtitle type="html">Emacs, scripting and anything text oriented.</subtitle><link href="https://scripter.co/tags/perl/" rel="alternate" type="text/html" title="HTML"/><link href="https://scripter.co/tags/perl/index.xml" rel="alternate" type="application/rss+xml" title="RSS"/><link href="https://scripter.co/tags/perl/atom.xml" rel="self" type="application/atom+xml" title="Atom"/><link href="https://scripter.co/tags/perl/jf2feed.json" rel="alternate" type="application/jf2feed+json" title="jf2feed"/><updated>2026-04-22T08:24:58-04:00</updated><author><name>Kaushal Modi</name><email>kaushal.modi@gmail.com</email></author><id>https://scripter.co/tags/perl/</id><entry><title type="html">grep -Po</title><link href="https://scripter.co/grep-po/?utm_source=atom_feed" rel="alternate" type="text/html"/><link href="https://scripter.co/golang-quirk-number-strings-starting-with-0-are-octals/?utm_source=atom_feed" rel="related" type="text/html" title='  Golang Quirk: Number-strings starting with "0" are Octals   '/><link href="https://scripter.co/generics-not-exactly-in-systemverilog/?utm_source=atom_feed" rel="related" type="text/html" title="Generics (not exactly) in SystemVerilog"/><link href="https://scripter.co/sidenotes-using-ox-hugo/?utm_source=atom_feed" rel="related" type="text/html" title="Sidenotes using ox-hugo"/><link href="https://scripter.co/sidenotes-using-only-css/?utm_source=atom_feed" rel="related" type="text/html" title="Sidenotes using only CSS"/><link href="https://scripter.co/notes/string-fns-nim-vs-python/?utm_source=atom_feed" rel="related" type="text/html" title="String Functions: Nim vs Python"/><id>https://scripter.co/grep-po/</id><author><name>Kaushal Modi</name></author><published>2022-02-16T21:34:00-05:00</published><updated>2022-02-16T21:34:00-05:00</updated><content type="html"><![CDATA[<blockquote>Using <code>grep</code> to do substring extraction in shell scripts.</blockquote><div class="ox-hugo-toc toc">
<div class="heading">Table of Contents</div>
<ul>
<li><a href="#grep-po-problem-statement">Problem statement</a></li>
<li><a href="#solution-using-grep-po">Solution using <code>grep -Po</code></a></li>
<li><a href="#arriving-to-this-solution">Arriving to this solution</a></li>
<li><a href="#summary">Summary</a></li>
</ul>
</div>
<!--endtoc-->
<p>I like <a href="https://en.wikipedia.org/wiki/Regular_expression">regular expressions</a>
<span class="sidenote-number"><small class="sidenote">
I recommend using <a href="https://regex101.com/">https://regex101.com/</a> to practice regular
expressions of different flavors (PCRE2, PCRE, Python, etc.) whether
or not you are new to using <abbr aria-label=" regular expression" tabindex=0>regex</abbr>.
</small></span>
as they allow me to be concise and specific about what I need to
search.</p>
<p>And I have liked using regular expressions for many years, ever since
I learned Perl about fifteen years back. I am writing this post as I
am remembering the delight I felt when I realized that I can use the
familiar Perl regular expressions to do string parsing in shell
scripts. I am not exactly sure, but I probably learned about this
<code>grep -Po</code> trick from <em>stackexchange</em> (<a href="#citeproc_bib_item_1">camh, 2011</a>).</p>

<h2 id="grep-po-problem-statement">Problem statement&nbsp;<a class="headline-hash no-text-decoration" href="#grep-po-problem-statement">#</a></h2>


<p>I could be parsing a log file with a line like <code>web report: https://foo.bar/detail.html</code> and I need to extract the
<code>https://foo.bar</code> part to a shell script variable.</p>

<h2 id="solution-using-grep-po">Solution using <code>grep -Po</code>&nbsp;<a class="headline-hash no-text-decoration" href="#solution-using-grep-po">#</a></h2>


<div class="note">
<p>This solution requires a GNU <code>grep</code> version supporting <code>-P</code>, that&rsquo;s
compiled with <code>libpcre</code>.
<span class="sidenote-number"><small class="sidenote">
<em>GNU grep</em> gained the PCRE (<code>-P</code>) feature back <a href="https://git.savannah.gnu.org/cgit/grep.git/commit/?id=05860b2d966701a5a9f70a650d32b30ae2612eeb">in 2000</a>.
</small></span>
Also I have never come across a system or
used one that did not have such a <code>grep</code> version installed.</p>
</div>
<p>I&rsquo;ll throw the solution out here and then dig into the details.</p>
<p><a id="code-snippet--grepPo-example"></a></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="nb">echo</span> <span class="s2">&#34;def\nabc&#34;</span> <span class="p">|</span> grep -Po <span class="s1">&#39;a\K.(?=c)&#39;</span> <span class="c1"># =&gt; b</span>
</span></span></code></pre></div><div class="src-block-caption">
  <span class="src-block-number"><a href="#code-snippet--grepPo-example">Code Snippet 1</a>:</span>
  Extracting "b" from "abc" using <code>grep -Po</code>
</div>
<p>The <em>grep</em> switches used here are:</p>
<dl>
<dt><code>-P</code></dt>
<dd>Use (P)erl regular expressions. This allows us to use the
<a href="https://www.regular-expressions.info/lookaround.html"><em>look around</em> regex</a> syntax like <code>(?=..)</code> and special characters like
<code>\K</code> (<a href="#citeproc_bib_item_2">“perlre - Perl regular expressions,” n.d.</a>).</dd>
<dt><code>-o</code></dt>
<dd>Print only the matched portion to the (o)utput</dd>
</dl>

<h2 id="arriving-to-this-solution">Arriving to this solution&nbsp;<a class="headline-hash no-text-decoration" href="#arriving-to-this-solution">#</a></h2>


<p>Now I&rsquo;ll start with a basic example and build up to the <a href="#code-snippet--grepPo-example">above
solution</a>.</p>
<dl>
<dt>Problem</dt>
<dd>Let&rsquo;s say I have this text with two lines &ldquo;def&rdquo; and &ldquo;abc&rdquo;
and I want<span class="org-target" id="org-target--wanted-grep-output"></span> to output whatever character is between &ldquo;a&rdquo; and &ldquo;c&rdquo;.</dd>
</dl>
<!--listend-->
<ul>
<li>
<p>Below, the regular expression for matching any character between &ldquo;a&rdquo;
and &ldquo;c&rdquo; ( <code>'a.c'</code> ) is correct, but that will output the whole input
because the <em>grep</em> of that regex succeeded.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="nb">echo</span> <span class="s2">&#34;def\nabc&#34;</span> <span class="p">|</span> grep <span class="s1">&#39;a.c&#39;</span> <span class="c1"># =&gt; def\nabc</span>
</span></span></code></pre></div></li>
<li>
<p>Now we add the <em>grep</em> <code>-o</code> switch so that it outputs only the
matched portion. As the regex is <code>'a.c'</code>​, the <code>-o</code> switch will
output every part of the input that matched that. So the output is
&ldquo;abc&rdquo;. It&rsquo;s still not what we <a href="#org-target--wanted-grep-output">wanted</a>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="nb">echo</span> <span class="s2">&#34;def\nabc&#34;</span> <span class="p">|</span> grep -o <span class="s1">&#39;a.c&#39;</span> <span class="c1"># =&gt; abc</span>
</span></span></code></pre></div></li>
<li>
<p>Now we bring in the powerful Perl regex feature <em>positive
lookahead</em>.
<span class="sidenote-number"><small class="sidenote">
Positive lookahead is used when you want to match something <span class="underline">only
if</span> it&rsquo;s followed by something else. It&rsquo;s syntax looks like <code>q(?=u)</code>
where that expression matches if a <code>q</code> is followed by a <code>u</code>, without
making the <code>u</code> part of the match &ndash; <a href="https://www.regular-expressions.info/lookaround.html">reference</a>.
</small></span>
But this is still not exactly what we want because &ldquo;a&rdquo; is still
considered as part of the match. Now the output is &ldquo;ab&rdquo;.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="nb">echo</span> <span class="s2">&#34;abc&#34;</span> <span class="p">|</span> grep -Po <span class="s1">&#39;a.(?=c)&#39;</span> <span class="c1"># =&gt; ab</span>
</span></span></code></pre></div></li>
<li>
<p>We only need a special character that marks a point in the regex
that tells &ldquo;don&rsquo;t consider anything before this as part of the
match&rdquo;. The <code>\K</code> special construct described in the <a href="https://perldoc.perl.org/perlre#Lookaround-Assertions">Perl regular
expressions doc</a> as:</p>
<blockquote>
<p>There is a special form of this construct, called <code>\K</code> (available
since Perl 5.10.0), which causes the regex engine to &ldquo;keep&rdquo;
everything it had matched prior to the <code>\K</code> and not include it in
matched string. This effectively provides non-experimental
variable-length lookbehind of any length.</p>
</blockquote>
<p>And, thus we have the final solution:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-shell" data-lang="shell"><span class="line"><span class="cl"><span class="nb">echo</span> <span class="s2">&#34;abc&#34;</span> <span class="p">|</span> grep -Po <span class="s1">&#39;a\K.(?=c)&#39;</span> <span class="c1"># =&gt; b</span>
</span></span></code></pre></div></li>
</ul>

<h2 id="summary">Summary&nbsp;<a class="headline-hash no-text-decoration" href="#summary">#</a></h2>


<p>Taking the example from the <a href="#grep-po-problem-statement">problem statement</a>, this will work:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="nv">string</span><span class="o">=</span><span class="s2">&#34;web report: https://foo.bar/detail.html&#34;</span>
</span></span><span class="line"><span class="cl"><span class="nv">substring</span><span class="o">=</span><span class="k">$(</span>grep -Po <span class="s1">&#39;web report:\s*\K.*?(?=/detail\.html)&#39;</span> <span class="o">&lt;&lt;&lt;</span> <span class="s2">&#34;</span><span class="si">${</span><span class="nv">string</span><span class="si">}</span><span class="s2">&#34;</span><span class="k">)</span>
</span></span><span class="line"><span class="cl"><span class="nb">echo</span> <span class="s2">&#34;</span><span class="si">${</span><span class="nv">substring</span><span class="si">}</span><span class="s2">&#34;</span>
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">https://foo.bar
</span></span></code></pre></div>
<h2 id="references">References&nbsp;<a class="headline-hash no-text-decoration" href="#references">#</a></h2>


<div class="csl-bib-body">
  <div class="csl-entry"><a id="citeproc_bib_item_1"></a>camh. (2011). Can grep output only specified groupings that match? [Website]. In <i>Unix stackexchange</i>. <a href="https://unix.stackexchange.com/a/13472/57923">https://unix.stackexchange.com/a/13472/57923</a></div>
  <div class="csl-entry"><a id="citeproc_bib_item_2"></a>perlre - Perl regular expressions. (n.d.). [Website]. In <i>Perldoc 5.34.0</i>. Retrieved February 16, 2022, from <a href="https://perldoc.perl.org/perlre">https://perldoc.perl.org/perlre</a></div>
</div>
]]></content><category scheme="https://scripter.co/categories/unix" term="unix" label="unix"/><category scheme="https://scripter.co/categories/shell" term="shell" label="shell"/><category scheme="https://scripter.co/tags/grep" term="grep" label="grep"/><category scheme="https://scripter.co/tags/regex" term="regex" label="regex"/><category scheme="https://scripter.co/tags/string" term="string" label="string"/><category scheme="https://scripter.co/tags/perl" term="perl" label="perl"/><category scheme="https://scripter.co/tags/100daystooffload" term="100daystooffload" label="100DaysToOffload"/></entry></feed>