How to Correctly Grep for Text in Bash Scripts

grep is a versatile Linux utility, which can take a few years to master well. Even seasoned Linux engineers may make the mistake of assuming a given input text file will have a certain format. grep can also be used, directly in combination with if based searches to scan for the presence of a string within a given text file. Discover how to correctly grep for text independent of character sets, how to use the -q option to text for string presence, and more!

In this tutorial you will learn:

  • How to do correct character set-independent text searches with grep
  • How to use advanced grep statements from within scripts or terminal oneliner commands
  • How to test for string presence using the -q option to grep
  • Examples highlighting grep usage for these use cases

How to Correctly Grep for Text in Bash Scripts

How to Correctly Grep for Text in Bash scripts

Software requirements and conventions used

Software Requirements and Linux Command Line Conventions
Category Requirements, Conventions or Software Version Used
System Linux Distribution-independent
Software Bash command line, Linux based system
Other Any utility which is not included in the Bash shell by default can be installed using sudo apt-get install utility-name (or yum install for RedHat based systems)
Conventions # – requires linux-commands to be executed with root privileges either directly as a root user or by use of sudo command
$ – requires linux-commands to be executed as a regular non-privileged user

Example 1: Correct Character Set-Independent Text Searches With Grep

What happens when you grep through a file which is text/character-based, but contains special characters outside of the normal range? This can potentially happen when the file contains complex character sets or seems to contain binary like contents. To understand this better, we first need to understand what binary data is.

Most (but not all) computers use at their most basic level only two states: 0 and 1. Perhaps over simplified can you can think about this like a switch: 0 is no volt, no power, and 1 is “some level of voltage” or powered-on. Modern computers are able to process millions of these 0 and 1’s in a fraction of a second. This is 0/1 state is called a ‘bit’ and is a base-2 numerical system (just like our 0-9 decimal system is a base-10 numerical system). There are other ways of representing bit/binary based data like octal (8-base: 0-7) and hexadecimal (16-base: 0-F).

Coming back to ‘binary’ (bin, dual), you can start seeing how is commonly used to describe any type of data which cannot easily be recognized by humans, but can be understood by binary-based computers. It’s perhaps not the best analogy, as binary usually refers to two states (true/false), whereas in common IT jargon ‘binary data’ has come to meany data which is not easily easily interpretable.

For example, a source code file compiled with a compiler contains binary data mostly unreadable by humans. For example, a source code file compiled with a compiler contains binary data mostly unreadable by the human eye. Another example could be a encrypted file or a configuration file written in a propriety format.

What does it look like when you try and view binary data?

Binary Data

Usually, when viewing binary data for executables, you will see some real binary data (all the odd looking characters – your computer is displaying binary data in the limited output format capabilities which your terminal supports), as well as some text-based output. In the case of ls as seen here, they seem to be function names within the ls code.

To view binary data correctly, you really do need a binary file viewer. Such viewers simply format data in their native format, alongside with a text-based side column. This avoids limitations of textual output and allows you to see the computer code for what it really is: 0’s and 1’s, though often formatted in hexadecimal formatting (0-F or 0-f as shown below).

Let’s have a look at two sets of 4 lines of the binary code of ls to see what this looks like:

$ hexdump -C /bin/ls | head -n4; echo '...'; hexdump -C /bin/ls | tail -n131 | head -n4
00000000  7f 45 4c 46 02 01 01 00  00 00 00 00 00 00 00 00  |.ELF............|
00000010  03 00 3e 00 01 00 00 00  d0 67 00 00 00 00 00 00  |..>......g......|
00000020  40 00 00 00 00 00 00 00  c0 23 02 00 00 00 00 00  |@........#......|
00000030  00 00 00 00 40 00 38 00  0d 00 40 00 1e 00 1d 00  |....@.8...@.....|
...
00022300  75 2e 76 65 72 73 69 6f  6e 00 2e 67 6e 75 2e 76  |u.version..gnu.v|
00022310  65 72 73 69 6f 6e 5f 72  00 2e 72 65 6c 61 2e 64  |ersion_r..rela.d|
00022320  79 6e 00 2e 72 65 6c 61  2e 70 6c 74 00 2e 69 6e  |yn..rela.plt..in|
00022330  69 74 00 2e 70 6c 74 2e  67 6f 74 00 2e 70 6c 74  |it..plt.got..plt|


How does all of this (besides learning more about how computers work) help you to understand correct grep usage? Let’s come back to our original question: what happens when you grep through a file which is text/character-based, but contains special characters outside of the normal range?

We can now rightly reword this to ‘what happens when you grep through a binary file’? Your first reaction may be: why would I want to search through a binary file?. In part, the answer shows in the above ls example already; often binary files still contain text based strings.

And there is a much more important and primary reason; grep by default will assume many files to contain binary data as soon as they have special characters in them, and perhaps when they contain certain binary escape sequences, even though the file in itself may be data based. What’s worse is that by default grep will fail and abort scanning these files as soon as such data is found:

$ head -n2 test_data.sql 
CREATE TABLE t1 (id int);
INSERT INTO t1 VALUES (1);
$ grep 'INSERT' test_data.sql | tail -n2
INSERT INTO t1 VALUES(1000);
Binary file test_data.sql matches

As two prominent examples from personal experience with database work, when you scan database server error logs, which can easily contain such special characters as at times error messages, database, table and field names may make it to the error log and such messages are regularly in region-specific character sets.

Another example is test SQL obtained from database testing suites (shown in the example above). Such data often contains special characters for testing and stressing the server in a multitude of ways. The same would apply to most website testing data and other domain testing data sets. As grep fails by default against such data, it is important to ensure we add an option to grep to cover this.

The option is --binary-files=text. We can see how our grep now works correctly:

$ grep 'INSERT' test_data.sql | wc -l
7671
$ grep 'INSERT' test_data.sql | tail -n1
Binary file test_data.sql matches
$ grep --binary-files=text 'INSERT' test_data.sql | wc -l
690427

What a difference! You can imagine how many automated grep scripts throughout the world are failing to scan all data they should be scanning. What is worse, and significantly compounds the issue is that grep fails 100% silently when this happens, the error code will be 0 (success) in both cases:

$ grep -q 'INSERT' test_data.sql; echo $?
0
$ grep --binary-files=text -q 'INSERT' test_data.sql; echo $?
0


Compounding it even more, the error message is displayed on stdout output, and not on stderr as one might expect. We can verify this by redirecting stderr to the null device /dev/null, only displaying stdout output. The output remains:

$ grep 'INSERT' test_data.sql 2>/dev/null | tail -n1 
Binary file test_data.sql matches

This also means that if you were to redirect your grep results to another file (> somefile.txt after the grep command), that the ‘Binary file … matches` would now be part of that file, besides missing all entries seen after such issue occurred.

Another issue is the security aspect: let’s take an organization who has scripted access log greps to email reports to sysadmins whenever a rogue agent (like a hacker) tries and access unauthorized resources. If such a hacker is able to insert some binary data into the access log before their access attempt, and the grep is unprotected by --binary-files=text, no such emails will ever be sent.

Even if the script is developed well enough to check for the grep exit code, still no-one will ever notice a script error, as grep returns 0, or in other words: success. Success it ain’t though 🙂

There are two easy solutions; add --binary-files=text to all your grep statements, and you may want to consider scanning grep output (or the contents of a redirected output file) for the regular expression ‘^Binary file.*matches’. For more information on regular expressions, see Bash Regexps for Beginners with Examples and Advanced Bash Regex with Examples. However, either doing both or only the first one would be preferred, as the second option is not future-proof; the ‘Binary file…matches’ text may change.

Finally, note that when a text file becomes corrupted (disk failure, network failure etc.), it contents may end up being part-text and part-binary. This is yet another reason to always protect your grep statements with the --binary-files=text option.

TL;DR: Use --binary-files=text for all your grep statements, even if they currently work fine. You never know when that binary data may hit your file.

Example 2: Test for the Presence of a Given String Within a Text File

We can use grep -q in combination with an if statement in order to test for the presence of a given string within a text file:

$ if grep --binary-files=text -qi "insert" test_data.sql; then echo "Found!"; else echo "Not Found!"; fi
Found!

Let’s break this down a little by first checking if the data truly exists:

$ grep --binary-files=text -i "insert" test_data.sql | head -n1
INSERT INTO t1 VALUES (1);

Here we dropped the q (quiet) option to obtain output and see that the string ‘insert’ – taken in case-insensitive manner (by specifying the -i option to grep exists in the file as ‘INSERT…`.

Note that the q option is not specifically a testing option. It is rather an output modifier which tells grep to be ‘quiet’, i.e. not to output anything. So how does the if statement know whether there is a presence of a given string within a text file? This is done through the grep exit code:

$ grep --binary-files=text -i "INSERT" test_data.sql 2>&1 >/dev/null; echo $?
0
$ grep --binary-files=text -i "THIS REALLY DOES NOT EXIST" test_data.sql 2>&1 >/dev/null; echo $?
1


Here we did a manual redirect of all stderr and sdtout output to /dev/null by redirecting stderr (2>) to stdout (&1) and redirecting all stdout output to the null device (>/dev/null). This is basically equivalent to the -q (quiet) option to grep.

We next verified the output code and established that when the string is found, 0 (success) is returned, whereas 1 (failure) is returned when the string is not found. if can use these two exit codes to execute either the then or the else clauses specified to it.

In summary, we can use if grep -q to test for the presence of a certain string within a text file. The fully correct syntax, as seen earlier in this article, is if grep --binary-files=text -qi "search_term" your_file.sql for case-insensitive searches, and if grep --binary-files=text -q "search_term" your_file.sql for case-sensitive searches.

Conclusion

In this article, we saw the many reasons why it is important to use --binary-files=text on nearly all grep searches. We also explored using grep -q in combination with if statements to test for the presence of a given string within a text file. Enjoy using grep, and leave us a comment with your greatest grep discoveries!



Comments and Discussions
Linux Forum