grep
to do substring extraction in shell scripts.I like regular expressions I recommend using https://regex101.com/ to practice regular expressions of different flavors (PCRE2, PCRE, Python, etc.) whether or not you are new to using regex. as they allow me to be concise and specific about what I need to search.
And I have liked using regular expressions for many years, ever since
I learned Perl about fifteen years back. I am writing this post as I
am remembering the delight I felt when I realized that I can use the
familiar Perl regular expressions to do string parsing in shell
scripts. I am not exactly sure, but I probably learned about this
grep -Po
trick from stackexchange (camh, 2011).
I could be parsing a log file with a line like web report: https://foo.bar/detail.html
and I need to extract the
https://foo.bar
part to a shell script variable.
grep -Po
#This solution requires a GNU grep
version supporting -P
, that’s
compiled with libpcre
.
GNU grep gained the PCRE (-P
) feature back in 2000.
Also I have never come across a system or
used one that did not have such a grep
version installed.
I’ll throw the solution out here and then dig into the details.
echo "def\nabc" | grep -Po 'a\K.(?=c)' # => b
The grep switches used here are:
-P
(?=..)
and special characters like
\K
(“perlre - Perl regular expressions,” n.d.).-o
Now I’ll start with a basic example and build up to the above solution.
Below, the regular expression for matching any character between “a”
and “c” ( 'a.c'
) is correct, but that will output the whole input
because the grep of that regex succeeded.
echo "def\nabc" | grep 'a.c' # => def\nabc
Now we add the grep -o
switch so that it outputs only the
matched portion. As the regex is 'a.c'
, the -o
switch will
output every part of the input that matched that. So the output is
“abc”. It’s still not what we wanted.
echo "def\nabc" | grep -o 'a.c' # => abc
Now we bring in the powerful Perl regex feature positive
lookahead.
Positive lookahead is used when you want to match something only
if it’s followed by something else. It’s syntax looks like q(?=u)
where that expression matches if a q
is followed by a u
, without
making the u
part of the match – reference.
But this is still not exactly what we want because “a” is still
considered as part of the match. Now the output is “ab”.
echo "abc" | grep -Po 'a.(?=c)' # => ab
We only need a special character that marks a point in the regex
that tells “don’t consider anything before this as part of the
match”. The \K
special construct described in the Perl regular
expressions doc as:
There is a special form of this construct, called
\K
(available since Perl 5.10.0), which causes the regex engine to “keep” everything it had matched prior to the\K
and not include it in matched string. This effectively provides non-experimental variable-length lookbehind of any length.
And, thus we have the final solution:
echo "abc" | grep -Po 'a\K.(?=c)' # => b
Taking the example from the problem statement, this will work:
string="web report: https://foo.bar/detail.html"
substring=$(grep -Po 'web report:\s*\K.*?(?=/detail\.html)' <<< "${string}")
echo "${substring}"
https://foo.bar
The aim of this post is to make a Golang quirk more of a common knowledge, with an ulterior motive to eventually get it fixed upstream, somehow..
Disclaimer: I don’t code in Go lang. So I could very well be wrong
in saying problem with Golang vs problem with specifically
strconv
package.
From my perspective, I see strconv
as an internal Go package that
any (most of?) Go coder would use to do string → int conversions. If
so, I don’t grasp the rationale behind why the strconv
developers
would make this strange decision.. strange because in normal
languages like Python, int("010")
returns 10
.
I learned about this issue for the first time from this Hugo Discourse thread. The synopsis is that someone is retrieving US city zip-codes from a Hugo front-matter variable, and then using some conditional logic based on the last 2 digits.
So the code was:
<!-- Value of .Params.cityZipCode is "75009" -->
{{ if (int (last 2 .Params.cityZipCode)) eq 1 }}er{{ else }}e{{ end }}
The logic is simple.. Get the last two characters of
.Params.cityZipCode
, which would be "09"
, convert that string to a
number (int
), and check if it is 1
.
But of-course that didn’t work:
unable to cast “09” of type string to int
Later, as I learn, that’s because of the ParseInt
function from the
strconv
package. There it says (emphasis mine):
func ParseInt(s string, base int, bitSize int) (i int64, err error)
ParseInt
interprets a strings
in the givenbase
(0, 2 to 36) andbit
size (0 to 64) and returns the corresponding valuei
.If
base == 0
, thebase
is implied by the string’s prefix: base 16 for"0x"
, base 8 for"0"
, and base 10 otherwise. For bases 1, below 0 or above 36 an error is returned.
Again.. What was the Golang team thinking?!
This led me to update the int
function documentation for Hugo with
an ugly workaround:
{{ int ("00987" | strings.TrimLeft "0") }}
This problem of number-strings beginning with “0” considered as octals resurfaced recently in Hugo issue #4628.
Though, the reported error did not make it evident that that was the problem:
INFO 2018/04/15 18:49:36 found taxonomies: map[string]string{"category":"categories", "manufacturerletter":"manufacturerletters", "manufacturer":"manufacturers", "featured":"featured", "tag":"tags"}
panic: interface conversion: interface {} is float64, not int
goroutine 50 [running]:
github.com/gohugoio/hugo/hugolib.(*Site).assembleTaxonomies(0xc4204ce2c0)
/go/src/github.com/gohugoio/hugo/hugolib/site.go:1545 +0xee2
The issue reporter had 1500+ content files, and one or more of those files caused this uncaught exception (which is a separate issue, and is planned to be fixed in Hugo) to happen. So I had to spend quite some time doing “forensic debug”1 to understand what caused that “interface conversion: interface {} is float64, not int”.
The exception was thrown at this line (highlighted below):
for _, p := range s.Pages {
vals := p.getParam(plural, !s.Info.preserveTaxonomyNames)
weight := p.getParamToLower(plural + "_weight")
if weight == nil {
weight = 0
}
if vals != nil {
if v, ok := vals.([]string); ok {
for _, idx := range v {
x := WeightedPage{weight.(int), p}
So it was evident that one of the taxonomy weight
(manufacturers_weight
in this case) values wasn’t getting casted
to int
.
So I grepped for anything non-int in those values, like .
, ,
,
e
or E
, but found nothing.
Then doing rg ':\s[0-9]{7,}(\.[0-9]+)*$'
in the content files, I
saw that there were 4 files that had oddly high weight values like
4611000, and wondered if that was somehow the problem. But that
wasn’t it either.
When I deleted all the manufacturers_weight
lines in those 1500+
files, the error went away.
find . -name "*.md" -print0 | xargs -0 sed -i '/manufacturers_weight:.*/d'
So then I restored all of those deleted lines, and started deleting them again, this time in progression ..
I had almost given up on debugging this further, when I decided to
give the git diff
one last glance.. and I found the pattern..
.. the freaking leading 0’s in some of those manufacturers_weight
values!
I had a strong gut feeling that those zeros were the problem. So I once again restored the deleted lines in all the content files, typed out the below2 with confidence ..
find . -name "*.md" -exec grep -P 'manufacturers_weight: 0[0-9]+' -l {} \; -exec sed -r -i 's/(manufacturers_weight: )0([0-9]+)/\1\2/' {} \;
.. and that error was of course gone! 🎉
This ended up with just 16 modified files with a diff like this:
...
modified content/movements/b/buren/buren-04.en.md
@@ -12,7 +12,7 @@ image: "Buren_04.jpg"
movementlistkey: "buren"
caliberkey: "04"
manufacturers: ["buren"]
-manufacturers_weight: 04
+manufacturers_weight: 4
categories: ["movements","movements_b","movements_b_buren_en"]
widgets:
relatedmovements: true
modified content/movements/c/citizen/citizen-0153.de.md
@@ -12,7 +12,7 @@ image: "Citizen_0153.jpg"
movementlistkey: "citizen"
caliberkey: "0153"
manufacturers: ["citizen"]
-manufacturers_weight: 0153
+manufacturers_weight: 153
categories: ["movements","movements_c","movements_c_citizen"]
widgets:
relatedmovements: true
...
So that provided that issue originator a workaround so that they can at least get their site built.
But I hope that this 0-leading octal absurdity gets fixed at the root level — People should once again say with confidence, as they learned as kids, that “010” is the same thing as “10”.
Hugo fixes this issue (4628) on its end by not making this exception
go uncaught, and instead let the user know that they magically added
a non-int
-castable octal value in their content in X file on Y
line.
The Golang team gives some serious thought to this stupid (sorry
about that) annoying decision:
If
base == 0
, thebase
is implied by the string’s prefix: base 16 for"0x"
, base 8 for"0"
..
I call this “forensic debug” because I don’t know Go, and how
and where to add debug statements within the hugo
source code. So my
approach was to figure out which content file/line caused that error. ↩︎
That command finds all the .md
files in the current
directory, returns a list of file names wherein the
manufactureres_weight
value begins with 0
using grep
, and then
surgically remove the leading zeros just in those short-listed files
using sed
. ↩︎