Introduction to grep and regular expressions

Objective

After reading this tutorial you should be able to understand how the grep command works, and how to use it with basic and extended regular expressions.

Difficulty

EASY

Introduction

Grep is one of the most useful tools we can use when administering a unix-based machine: its job is to search for a given pattern inside one or more files and return existing matches.

In this tutorial we will see how to use it, and we will examine also its variants: egrep and fgrep. We will put this really famous excerpt from the book “The Lord Of The Rings” on a file, and we will use as a target for our examples:

Three Rings for the Elven-kings under the sky,
Seven for the Dwarf-lords in their halls of stone,
Nine for Mortal Men doomed to die,
One for the Dark Lord on his dark throne
In the Land of Mordor where the Shadows lie.
One Ring to rule them all, One Ring to find them,
One Ring to bring them all, and in the darkness bind them,
In the Land of Mordor where the Shadows lie.

The file will be called lotr.txt.

Grep variants

In the introduction we spoke about two grep variants: egrep and fgrep. These variants are actually deprecated, since they are the equivalent of running grep with the -E and -F options respectively. Before we start explaining in what those variants are different from the original we must examine the default grep behavior when using regular expressions.

The Basic regular expression mode

A regular expression is a pattern constructed following specific rules in order to match a string or multiple strings. By default grep uses what it calls BRE or basic regular expressions: in this mode only some meta-characters (characters with a special meaning inside a regular expression) are available.

As a first example we will try to use grep to match a very simple string, the word “mortal”. The grep syntax is very simple: we invoke the program providing the pattern to be matched as the first argument, and the target file as the second:

$ grep mortal lotr.txt


The command above returns no matches, although the word “mortal” does appear in the text: this is because by default grep performs a search in case-sensitive mode, so, since the word “Mortal” is capitalized, it doesn’t match the pattern we provided. To overcome this problem and perform a more “generic” search, we can use the -i option (short for --ignore-case, which makes grep ignore case distinctions:

$ grep -i mortal lotr.txt

This time the command produces the following output (the actual match is highlighted in red):

Nine for Mortal Men doomed to die,

One important thing to notice, is that, by default, grep returns the entire line in which the match is found. This behavior, however can be modified using the -o option, or its long version --only-matching. When using this option, only the match itself is printed:

$ grep -o -i mortal lotr.txt
Mortal

Another interesting switch we can use is -n, short for --line-number. When this option is used, the number of the lines where a match is found is included in the grep output. This command:

$ grep -n -i mortal lotr.txt

Produces the following output:

3:Nine for Mortal Men doomed to die

Where 3 is the number of the line in which the match is found.

What if we just want to obtain the actual number of matches found, instead of the matches themselves? Grep has a dedicated option to obtain this result: -c, or --count. Using the command above with this option returns the following output:

1

Which is, as expected, the number of matches found in the text.

Basic meta-characters

It’s time to perform a slightly more elaborate search. We now want to find all the lines starting with the letter “o”. Even when working with basic regular expressions we can use the ^ character to match the empty string at the beginning of a line:



$ grep -i ^o lotr.txt

As expected, the result of the command is:

One for the Dark Lord on his dark throne
One Ring to rule them all, One Ring to find them,
One Ring to bring them all, and in the darkness bind them,

That was pretty easy. Now let’s suppose we want to further restrict our search, and find all the lines starting with an “o” and ending with a “,” character. We can use this example to introduce some other meta-characters we can use in basic regex mode:

$ grep -i ^o.*,$ lotr.txt

The above linux command returns exactly what we were searching for:


One Ring to rule them all, One Ring to find them,
One Ring to bring them all, and in the darkness bind them,

Let’s explain what we did above. First of all, we used the -i option to make our search case-insensitive, just like we did in the previous examples, than we used the ^ meta-character, followed by an “o”, searching for lines starting with this letter.

We than used two new meta-characters: . and *. What is their role in the regular expression? The . matches any single character, while the * is a repetition operator, which matches the preceding element zero or more times. Finally we specified the ,, a comma, to be matched literally as the last character before the end of the line, matched itself by the $ meta-character.

Matching a set of characters with square brackets

In the example above we used the dot, ., to specify a pattern that matches every single character. What if we wanted to match only a subset of characters? Say, for example, we wanted to find all lines starting with an “o” or an “i”: to obtain such a result, we can enclose the set of possible characters to be matched in square brackets:

$ grep -i ^[o,i] lotr.txt

The command will perform a case-insensitive search for an “o” or an “i” located at the beginning of a line. Here is the result:

One for the Dark Lord on his dark throne
In the Land of Mordor where the Shadows lie.
One Ring to rule them all, One Ring to find them,
One Ring to bring them all, and in the darkness bind them,
In the Land of Mordor where the Shadows lie.


For the pattern to be matched, as it is above, at least one of the characters contained withing brackets should be found. When specifying characters inside square brackets we can specify also a range by using the - character. So, for example, to match digits we can write [0-9]. Back to our text, we can use this syntax to match lines starting with letters from “i” to “s” (case insensitive):

$ grep -i ^[i-s] lotr.txt

The output of the command:

Seven for the Dwarf-lords in their halls of stone,
Nine for Mortal Men doomed to die,
One for the Dark Lord on his dark throne
In the Land of Mordor where the Shadows lie.
One Ring to rule them all, One Ring to find them,
One Ring to bring them all, and in the darkness bind them,
In the Land of Mordor where the Shadows lie.

The above is almost the entire text of the poem: only the first line, which starts with the letter “T” (not included in the range we specified), has been excluded from the match.

Within square brackets, we can match also specific classes of characters, using predefined bracket expressions. Some examples are:

  • [:alnum:] – alphanumeric characters
  • [:digit:] – digits from 0 to 9
  • [:lower:] – lower case letters
  • [:upper:] – upper case letters
  • [:blank:] – spaces and tabs

The one above is not a complete list, but you can easily find more examples of bracket expressions consulting the grep manual.

Inverting the result of a match

In the above examples we searched for every line starting with an “o” or an “i”, using a case insensitive search. What if we wanted to obtain the opposite output, and so to find only lines with no matches?

Grep allow us to obtain this result using the -v option (short for --invert-match). The option, as suggested, instructs grep to return the inverted match. If we run the last command we used above providing this option, we should obtain only the first line of the poem as output. Let’s verify it:

$ grep -i -v ^[i-s] lotr.txt

The result, is just as we expected, only the first line of the poem:

Three Rings for the Elven-kings under the sky,

In our example, we can obtain the same result by prefixing the list of characters between square brackets with the ^ character, which in this context assumes a different meaning, causing the pattern to match only characters not contained in the list. If we run:

$ grep -i ^[^i-s] lotr.txt

We receive, the same output as before:

Three Rings for the Elven-kings under the sky,

Extended expression mode

By using egrep or grep with the -E option (the latter is the recommended way), we can access other meta-characters to be used in regular expressions. Let’s see them.



Advanced repetitions operators

We already met the * repetition operator which is available also in basic regular expression mode. When using extended expressions, we have access to other operators of that kind:

  • ? – matches the item preceding it one or zero times
  • + – matches the preceding element one or more times

We can also specify more granular repetitions by using curly braces syntax. For example, the following pattern matches each occurrence of a double “l”:

grep l{2} lort.txt

The output of the command above is:

Seven for the Dwarf-lords in their halls of stone,
One Ring to rule them all, One Ring to find them,
One Ring to bring them all, and in the darkness bind them,

With the same syntax we can specify a minimum number of occurrences, by using {x,}, or an entire possible range, using {x,y}, where x and y represent, respectively, the minimum and the maximum number of repetitions of the preceding item.

Alternation

When working with extended regular expressions, we also have access to the | meta-character, also called inflix operator. By using it we can join two regular expressions, producing an expression which will match any string matching either alternate expressions.

It’s important to notice that both sides of the inflix operator will always tried to be matched: this means that this operator does not work as the conditional or operator, where the right side is evaluated only if the left side is false: this can be verified by observing the output of the following command:

$ grep -n -E '^O|l{2}' lotr.txt
2:Seven for the Dwarf-lords in their halls of stone,
4:One for the Dark Lord on his dark throne
6:One Ring to rule them all, One Ring to find them,
7:One Ring to bring them all, and in the darkness bind them,

Observe the output: each line starting with capital “o”, or containing a double “l” has been included in the output. On lines 6 and 7, however, both expressions at the left and right side of the inflix operator produced a match. This, as stated above means that both the sides of the operator are evaluated and if both produces a match, both matches are included.

Fgrep

If, by default, grep supports basic regular expressions operators, and by using the -E option or egrep we can use extended regular expressions, with the -F switch (short for –fixed-strings) or fgrep, we can instruct the program to always interpret a pattern as a list of fixed strings.

This means that strings are always tried to be matched literally, and all the meta-characters loose their special meaning. This can be useful when operating on a text or a string which contains a lot of characters which may be considered as operators without having to escape them manually.

Closing thoughts

In this tutorial we learned to know the grep unix command. We saw how can we use it to find matches in a text by using regular expressions and we also examined the behavior of its variants: egrep and fgrep. We examined some very useful options like -i, which can be use to make case-insensitive searches.

Finally we took a tour of some of the more used regular expressions operators. Grep is definitively one of the most important system tools and has a very exhaustive documentation: consulting it is always a good idea!



Comments and Discussions
Linux Forum