Understanding Regular Expressions

Introduction

Learning and understanding Regular Expressions may not be as straight forward as learning ls command. However, learning Regular Expressions and effectively implementing them in your daily work will doubtlessly reward your learning effort by greater work efficiency and time savings. Regular Expressions is a topic which can easily fill up entire 1000 pages long book. In this article, we only try to explain the basics of Regular Expressions in a concise, non-geeky and example driven manner. Therefore, if you ever wanted to learn Regular Expression basics now you have a viable chance.

The intention of this tutorial is to cover a fundamental core of Basic Regular Expressions and Extended Regular Expressions. For this, we will use a single tool,and that will be the GNU grep command. GNU/Linux operating system and its grep command recognizes three different types of Regular Expressions:

  • Basic Regular Expressions (BRE)
  • Extended Regular Expressions (ERE)
  • Perl Regular Expressions (PRCE)

The difference between Basic Regular Expressions and Extended Regular Expressions well be explained momentarily.

What is a Regular Expression

Regular Expression provides an ability to match a “string of text” in a very flexible and concise manner. Where a “string of text” can be further defined as a single character, word, sentence or particular pattern of characters. Well known abbreviations for “Regular Expression” include regex and regexp.

Simple Regular Expression example

The simplest building block of any regular expression is a character. We can use grep to search for any particular character from within a text of any given non-binary file. For example, here is a content of our regex.txt sample file:

$ cat regex.txt 
grep stands for:                                                                   
global                                                                             
regular                                                                            
expression                                                                         
print

Now we can use grep to search for any character by providing it with a regular expression. Let’s use grep to search for a character “e”:

$ grep e regex.txt                                          
grep stands for:
regular
expression

As you can see from the example above, grep printed all lines comprising of at least one “e” character. We can now combine multiple characters to form a string “regu” and use grep to search for a string in the text:

$ grep regu regex.txt 
regular

To unleash the real power of regular expressions though, we need to form a regular expression from non-alphabetic ( meta-characters ) characters or from the combination of alphabetic and non-alphabetic characters. For example, what if you want to search all lines which begin with character “g”? For this we can use a caret symbol “^”:

$ grep ^g regex.txt 
grep stands for:
global

This was just a fundamental example of more sophisticated regular expression. In this article, we will explain more regular expression’s techniques as the one above in the more detail.

Concatenation

As you can see on our preceding example, the simplest regular expression can consist of an individual character. Hence a regular expression consisting of a single non-special character will match any given string containing that character. The nature of Regular Expressions permits for concatenation of multiple other Regular Expressions. Which means that a set of characters such as “press” will match any string that contains a substring formed by concatenation of several regular expressions “p”,”r”,”e”,”s” and “s”.

$ cat regex.txt 
grep stands for:                                                                   
global                                                                             
regular                                                                            
expression                                                                         
print
$ grep press regex.txt 
expression

Basic vs Extended Regular Expressions

GNU grep understands both, basic and extended regular expressions. The prime difference is that in basic regular expressions, the meta-characters: ?, +, {, |, (, and ) lose their special meaning. To give meta-characters its special meaning they need to be escaped with backslash character. Think over a following example:

Our regex.txt file now contains the following:

$ cat regex.txt 
global|regular|expression|print
Global Regular Expression Print

grep command assumes basic regular expression as a default. Therefore, the following linux command will print exclusively first line only considering that it contains substring “n|p”:

$ grep "n|p" regex.txt 
global|regular|expression|print

The “|” alteration operator has its own special meaning, and that is logical OR. However, this special meaning was suppressed in the previous example since grep by default threats any regular expression as a basic regular expression. To make grep read extended regular expressions, we need to use option -E or simply use egrep instead of grep.

$ grep -E "n|p" regex.txt 
global|regular|expression|print
Global Regular Expression Print
OR
$ egrep "n|p" regex.txt 
global|regular|expression|print
Global Regular Expression Print

In the preceding example, we used grep with extended regular expression, and thus it displays both lines, which contain n OR p character. As said previously the meta-characters lost their special meaning when expressed as basic regular expressions, unless they are escaped with “\” character. Let’s re-use our first example but this time, we escape the “|” character:

$ grep "n\|p" regex.txt 
global|regular|expression|print
Global Regular Expression Print

In this case alteration operator “|” retains its special meaning and acts as logical OR even though we did not use -E option or egrep.

We also said that when using egrep or -E option, grep presumes to be fed with Extended Regular Expressions. Because of that, if you escape a meta character in extended regular expression context it will lose its special meaning and behave as a literal character “|”. If you followed up to here you will notice that this is again exact opposite of basic regular expressions.

Example:

$ egrep "n\|p" regex.txt 
global|regular|expression|print

Bracket Expressions

Now, that we are acquainted with basics of regular expressions, we can engage our exploration into a more powerful and yet more complex nature of regular expressions. The first stop will be the use of “[” and “]” known as “Bracket Expressions”. The story behind the “Bracket Expressions” is that any characters enclosed by “[” and “]” will match any single character in that list. Let’s wrap a letter “e” with “[]” and see what happens:

$ cat regex.txt 
global|regular|expression|print
Global Regular Expression Print
$ grep [e]xpression regex.txt 
global|regular|expression|print

As you can see nothing unusual happened here. Our current regular expression merely matched keyword “expression” and grep therefore printed respective line. On that ground, the following regular expression will also do the same trick:

$ grep expression regex.txt 
global|regular|expression|print

The power of Bracket Expression comes when you want to match for example a single character in the “[]” list. This is demonstrated in the following example:

$ grep [eE]xpression regex.txt 
global|regular|expression|print
Global Regular Expression Print

Can you think of a way how to formulate a regular expression alternative to the above example without using “[ ] “? Such technique has been already shown earlier!

Using Bracket Expression it is also possible to express a logical NOT. For this we can use a caret symbol “^”. In the following example, we use a regular expression to extract all lines holding any characters with the exclusion of characters “a” and “c”.

$ cat regex.txt 
a
b
c
d
$ grep [^ac] regex.txt 
b
d

Expression Range

Bracket expression also allows you to specify an expression range. Expression range comprises of minimum two characters separated by a hyphen. What it means, is that instead of [0123456789] we can simply use [0-9] or instead of [abc] we can use [a-c]. This is illustrated in the following regex example:

$ cat regex.txt 
a
b
c
d
$ grep [^a-c] regex.txt 
d

Character Classes

What follows are pre-defined classes for you to use within bracket expressions.

[:alnum:] – Alphanumeric characters [:alpha:] – Alphabetic characters
[:cntrl:] – Control characters. [:digit:] – Digits: 0 1 2 3 4 5 6 7 8 9.
[:graph:] – Graphical characters [:lower:] – Lower-case letters
[:print:] – Printable characters [:punct:] – Punctuation characters
[:space:] – Space characters [:upper:] – Upper-case letters
[:xdigit:] – Hexadecimal digits

In the following regular expression example, we will use [:lower:] and [:space:] to print only lines, which contain lower-case letter(s) or space:

cat regex.txt 
1
2
3
A
b
c
,
<-- space
$ grep [[:lower:][:space:]] regex.txt
b
c
<-- space

Anchoring

Anchoring is a regular expression technique which engages caret ^ symbol and the dollar sign $ as meta-characters to match the empty string from the beginning and at the end of the line respectively.

Let’s find all lines within /etc/services file, which start with string “ftp”:

$ grep ^ftp /etc/services 
ftp-data        20/tcp
ftp             21/tcp
ftps-data       989/tcp                         # FTP over SSL (data)
ftps            990/tcp

As an opposite example we can use regex anchoring to find all lines ending with ftp:

$ grep ftp$ /etc/services 
zope-ftp        8021/tcp

NOTE:Do not mistake caret’s ^ meaning with a caret symbol used within bracket expression as they have quite distinct significance in their respective context.

The Backslash Character and Special Expressions

There are numerous system tools, including grep, which support “Special Expressions” also known as word boundaries. Here are some Special Expression symbols supported by grep and many other system utilities:

  • \< – match empty string at the beginning of the word
  • \> – match empty string at the end of the word
  • \b – match empty string at the beginning and end of the word
  • \B – match except at the beginning or end of a word

Let’s start with \< which will match empty string from the beginning of the word. Here is our tester file:

$ cat regex.txt 
RegularExpressions
Regular ExpressionsRegular Expressions

The following Regular Expression will match both lines because there is an empty string before word “Regular” on each line:

$ grep "\<Regular" regex.txt 
RegularExpressions
Regular ExpressionsRegular Expressions

The next example will only display second line considering that we use \> to match empty string also at the end of the word:

$ grep "\<Regular\>" regex.txt 
Regular ExpressionsRegular Expressions

The meaning of \b is similar, but it will match both, empty string from the beginning and end of the word:

$ grep "\bExpressions\b" regex.txt 
Regular ExpressionsRegular Expressions

Whereas \B will only match when not at the beginning or end of the word:

$ grep "\bExpressions\B" regex.txt 
Regular ExpressionsRegular Expressions

For completeness of this section here are some other special expressions available for grep. Please note that following symbols are simply an abbreviation of above-mentioned Character Classes:

  • \s – Match any whitespace characters (space, tab, etc.). alias [:space:]
  • \S – Match any character but whitespace (space, tab, etc.). alias [^[:space:]]
  • \w – Match any character in the range 0 – 9, A – Z and a – z alias [:alnum:]
  • \W – Match any character but the range 0 – 9, A – Z and a – z alias [^[:alnum:]]

Here are some examples of Character Classes abbreviations:

$ cat regex.txt 

abcd
1234
"

Match TAB:

$ grep "\s" regex.txt 
        

Match anything but white space:

$ grep "\S" regex.txt 
abcd
1234
"

Match all Alphanumeric characters:

$ grep "\w" regex.txt 
abcd
1234

Match all non-alphanumeric ( includes whitespace )characters:

$ grep "\W" regex.txt 
        
"

Repetition

A regular expression may be followed by one or several repetition quantifiers. Before you continue with this section, please take a look at the table below:

? – The preceding item is optional and matched at most once
* – The preceding item will be matched zero or more times.
+ – The preceding item will be matched one or more times.
{n} – The preceding item is matched exactly n times.
{n,} – The preceding item is matched n or more times.
{n,m} – The preceding item is matched at least n times, but not more than m times.

Let’s begin by creating our sample file regex.txt:

$ cat regex.txt 
Expressions
Expressssssions
Expresssions
Expresions
Expreions

First repetition example will use “?”:

$ grep -E "Expres?ions" regex.txt 
Expresions
Expreions

As described in the table above, the usage of “?” quantifier is to match preceding item at most once or to make the previous item optional. The previous item in our case is a character “s”. Therefore, grep matched only strings with none or single character “s” followed by string “ions”. Next quantifier we are going to take a look at is “*” which by definition will match previous item zero or more times.

$ grep -E "Expres*ions" regex.txt 
Expressions
Expressssssions
Expresssions
Expresions
Expreions

As illustrated above the “*” quantifier will match all strings in our test file. If you wonder why it also matched “Expreions” keep in mind that the “*” quantifier makes the preceding item optional as opposed to “+” quantifier, which must match preceding item at least once or more times:

$ grep -E "Expres+ions" regex.txt 
Expressions
Expressssssions
Expresssions
Expresions

With the “{n}” quantifier you can specify precisely how many times the previous item will be matched. For example our:

$ grep -E "Expres{3}ions" regex.txt 
Expresssions

command will match string, which starts with “Expre” followed by 3 x “s” and followed by “ions”. To stretch our previous regular expression “{n,}” futher, we can specify the minimum value of how many times the preceding item will be matched. As a result, “{3,}” repetition would match 3 or more times:

$ grep -E "Expres{3,}ions" regex.txt 
Expressssssions
Expresssions

To extend the above regular expression even further we can specify range. Therefore, we replace “{3,}” with “{1,3}” and the following regex would match:

$ grep -E "Expres{1,3}ions" regex.txt 
Expressions
Expresssions
Expresions

since the previous item “s” is matched at the minimum once but no more than three times.

Alternation

You can think of regex alternation as a logical OR operation where regular expressions can be joined together by one or more “|” alteration operators. As a result, this regular expression will match any string corresponding to either alternate regular expression.

$ cat regex.txt                                   
grep stands for:                                                                                  
global                                                                                            
regular                                                                                           
expression                                                                                        
print                                                                                             
$ grep -E "^r|^e" regex.txt                                                
regular                                                                            
expression

Precedence

When forming expressions, there is another property of Regular Exppresisons to consider and that is precedence. Similar as it is with arithmetic calculations, regular expressions follow predefined precedence. The highest precedence takes “Repetition” followed by “Concatenation” and the lowest precedence belongs to “Alternation”. Consider a following example:

$ cat regex.txt 
regex
regexxx
$ grep -E "regex{3}" regex.txt 
regexxx

In the aforementioned regular expression, we can see both, Concatenation “regex” and Repetition “x{3}”. Since the repetition has higher precedence the above regular expression will match “regexxx” but not “regex”.
Another example where precedence needs to be taken into account is when using Alteration operator “|” which has the lowest precedence from all regular expressions. Consider a following example:

$ cat regex.txt 
regular expressions
regular
expressions
$ grep -E "^regular|expressions$" regex.txt 
regular expressions
regular
expressions

Since the alteration operator “|” has lowest precedence the above regular expression will match any concatenated expression. In our case, it will be “regular” with anchor “^” and “expressions” with an end of the line anchor “$”. In order to give any regex operator higher precedence we need to use “()”. In the following example, we will use “()” to override Alteration operator precedence to a higher priority, which makes noticeable difference:

$ grep -E "^(regular|expressions)$" regex.txt 
regular
expressions

In this example, the alteration operator is evaluated first as it creates a simple subexpression using “()”. Therefore, as a result the above regular expression will only match lines, which contain “^regular$” OR “^expressions$”.

Back References and Subexpressions

Any substring folded by “()” will create a subexpression which can be used as a back reference in succeeding regular expression. This is illustrated by the following example:

$ cat regex.txt 
regular expressions
$ grep -E "(re)gular exp\1ssions" regex.txt
regular expressions

Subexpression of concatenated regular expression “re” is used as a back reference later when forming regular expression by use of \1 digit. The order used to form subexpressions “n” needs to be consistent with back reference “\n”:

$ grep -E "(r)(e)gular \2xp\1\2ssions" regex.txt
regular expressions

Conclusion

Regular expressions are very powerful tool in hands of any system admin, programmer ( BASH, PHP, C#, Java and many more.. ) or casual Linux/Unix command line user. This article attempted to describe in some simple, consistent and plain English manner the basics of Regular Expressions upon which you can further develop your Regular Expressionsskills and thus save yourself from tedious work which text processing can sometimes offer.

Regular Expressions Examples

Regular Expressions Examples
Regular Expressions syntax Regular Expressions description
grep -E '^([0-9]{4}[- ]?){3}[0-9]{4}$' credit-card.txt
Validating Credit Card. This regular expression will match any credit card number in format of xxxx-xxxx-xxxx-xxxx or xxxx xxxx xxxx xxxx.
grep '^[[:space:]]*$' regex.txt
Using grep and regular expression to find blank lines
grep -E '\<([[:alpha:]]+)[[:space:]]+\1\>' regex.txt
Sometimes you make a mistake of typing same words next to each other in the same sentence. For example “grep and and regular expressions”. This regex will spot this kind of typo.
grep -E  '^\$[0-9]+\.[0-9][0-9]$' regex.txt
Validating currency with 2 decimal points. This regular expression will validate currency using $ symbol and will match $12.46 but not €34.54 or $1.333
df | grep -E "(([6-9][0-9])|(100))%"
Regex for finding all partitions on your system which use more than 60% of their disk space.
grep -E -o "\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" emails.txt
This regex helps you to extract / find all email addresses from any text.
grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' IPs.txt
This regex helps you to extract / find all IP addresses from any input.
grep -oiE '\b(https?)://[-[[:alnum:]+&@#/%?=~_|!:,.;]*[[:alnum:]+
&@#/%=~_|]' index.htm
Extract URLs from html file


Comments and Discussions
Linux Forum