grep
to do substring extraction in shell scripts.I like regular expressions I recommend using https://regex101.com/ to practice regular expressions of different flavors (PCRE2, PCRE, Python, etc.) whether or not you are new to using regex. as they allow me to be concise and specific about what I need to search.
And I have liked using regular expressions for many years, ever since
I learned Perl about fifteen years back. I am writing this post as I
am remembering the delight I felt when I realized that I can use the
familiar Perl regular expressions to do string parsing in shell
scripts. I am not exactly sure, but I probably learned about this
grep -Po
trick from stackexchange (camh, 2011).
I could be parsing a log file with a line like web report: https://foo.bar/detail.html
and I need to extract the
https://foo.bar
part to a shell script variable.
grep -Po
#This solution requires a GNU grep
version supporting -P
, that’s
compiled with libpcre
.
GNU grep gained the PCRE (-P
) feature back in 2000.
Also I have never come across a system or
used one that did not have such a grep
version installed.
I’ll throw the solution out here and then dig into the details.
echo "def\nabc" | grep -Po 'a\K.(?=c)' # => b
The grep switches used here are:
-P
(?=..)
and special characters like
\K
(“perlre - Perl regular expressions,” n.d.).-o
Now I’ll start with a basic example and build up to the above solution.
Below, the regular expression for matching any character between “a”
and “c” ( 'a.c'
) is correct, but that will output the whole input
because the grep of that regex succeeded.
echo "def\nabc" | grep 'a.c' # => def\nabc
Now we add the grep -o
switch so that it outputs only the
matched portion. As the regex is 'a.c'
, the -o
switch will
output every part of the input that matched that. So the output is
“abc”. It’s still not what we wanted.
echo "def\nabc" | grep -o 'a.c' # => abc
Now we bring in the powerful Perl regex feature positive
lookahead.
Positive lookahead is used when you want to match something only
if it’s followed by something else. It’s syntax looks like q(?=u)
where that expression matches if a q
is followed by a u
, without
making the u
part of the match – reference.
But this is still not exactly what we want because “a” is still
considered as part of the match. Now the output is “ab”.
echo "abc" | grep -Po 'a.(?=c)' # => ab
We only need a special character that marks a point in the regex
that tells “don’t consider anything before this as part of the
match”. The \K
special construct described in the Perl regular
expressions doc as:
There is a special form of this construct, called
\K
(available since Perl 5.10.0), which causes the regex engine to “keep” everything it had matched prior to the\K
and not include it in matched string. This effectively provides non-experimental variable-length lookbehind of any length.
And, thus we have the final solution:
echo "abc" | grep -Po 'a\K.(?=c)' # => b
Taking the example from the problem statement, this will work:
string="web report: https://foo.bar/detail.html"
substring=$(grep -Po 'web report:\s*\K.*?(?=/detail\.html)' <<< "${string}")
echo "${substring}"
https://foo.bar
I was working on a tcsh
script that did some cool stuff. But if a
user ran that script not knowing the true impact of the script, it
could make some bad irreversible changes.
While I could simply echo a warning statement and put a sleep 10
, I
wanted the wait time to be shown live.
So here’s what worked pretty nicely — The warning message is shown to the user, and the actual wait time countdown is also displayed.
#!/usr/bin/env tcsh
set wait_time = 10 # seconds
echo "Are you sure you meant to run this script?"
echo "This script does something drastic that you would severely regret if you happened to run this script by mistake!"
echo ""
set temp_cnt = ${wait_time}
# https://www.cyberciti.biz/faq/csh-shell-scripting-loop-example/
while ( ${temp_cnt} >= 1 )
printf "\rYou have %2d second(s) remaining to hit Ctrl+C to cancel that operation!" ${temp_cnt}
sleep 1
@ temp_cnt--
end
echo ""
while
loop runs for $wait_time
times; each time waiting for
a second (sleep 1
) and then decrementing the temporary counter
$temp_cnt
.printf
is chosen instead of echo -n
because I wanted to have the
seconds number always hold 2 character places (%2d
).\r
character in printf
makes the magic here. It represents
carriage return i.e. The cursor will return to the beginning of
the line, and then print the following string, overwriting
whatever there was on that line earlier.
printf
acts like echo -n
i.e. a newline is not inserted
automatically at the end of the printed message. In order to add a
newline at the end for printf
, you need to do so explicitly by
adding a \n
character.Click here to see the animation on asciinema.org.
Below is a re-implementation of the above in bash.
#!/usr/bin/env bash
wait_time=10 # seconds
echo "Are you sure you meant to run this script?"
echo "This script does something drastic that you would severely regret if you happened to run this script by mistake!"
echo ""
temp_cnt=${wait_time}
while [[ ${temp_cnt} -gt 0 ]];
do
printf "\rYou have %2d second(s) remaining to hit Ctrl+C to cancel that operation!" ${temp_cnt}
sleep 1
((temp_cnt--))
done
echo ""
PATH
.I often need to check if a particular executable is present in the
PATH
before I can proceed with what I am doing in a shell
script. Also, I need to work with both tcsh
and bash
scripts. Below presents the solution that has worked for these shell
scripts for me.
The below solution using hash
was with the help of this SO solution.
if ! hash some_exec 2>/dev/null
then
echo "'some_exec' was not found in PATH"
fi
Here is the tl;dr from the above SO solution:
Where bash is your shell/hashbang, consistently use
hash
(for commands) ortype
(to consider built-ins & keywords). When writing a POSIX script, usecommand -v
.
As it turns out, the tcsh
shell does not have the same hash
command as the bash
shell.
But the below solution using where
which I found with the help of
this SO solution works fine.
if ( `where some_exec` == "" ) then
echo "'some_exec' was not found in PATH"
endif
awk
or rev
+cut
or the boring basename
.
pwd | awk -F/ '{print $NF}'
pwd | rev | cut -d/ -f 1 | rev
basename `pwd`