grep is a versatile Linux utility, which can take a few years to master well. Even seasoned Linux engineers may make the mistake of assuming a given input text file will have a certain format.
grep can also be used, directly in combination with
if based searches to scan for the presence of a string within a given text file. Discover how to correctly grep for text independent of character sets, how to use the
-q option to text for string presence, and more!
In this tutorial you will learn:
- How to do correct character set-independent text searches with grep
- How to use advanced grep statements from within scripts or terminal oneliner commands
- How to test for string presence using the
-qoption to grep
- Examples highlighting grep usage for these use cases
Software requirements and conventions used
|Category||Requirements, Conventions or Software Version Used|
|Software||Bash command line, Linux based system|
|Other||Any utility which is not included in the Bash shell by default can be installed using |
|Conventions||# - requires linux-commands to be executed with root privileges either directly as a root user or by use of |
$ - requires linux-commands to be executed as a regular non-privileged user
Example 1: Correct Character Set-Independent Text Searches With Grep
What happens when you grep through a file which is text/character-based, but contains special characters outside of the normal range? This can potentially happen when the file contains complex character sets or seems to contain binary like contents. To understand this better, we first need to understand what binary data is.
Most (but not all) computers use at their most basic level only two states: 0 and 1. Perhaps over simplified can you can think about this like a switch: 0 is no volt, no power, and 1 is “some level of voltage” or powered-on. Modern computers are able to process millions of these 0 and 1’s in a fraction of a second. This is 0/1 state is called a ‘bit’ and is a base-2 numerical system (just like our 0-9 decimal system is a base-10 numerical system). There are other ways of representing bit/binary based data like octal (8-base: 0-7) and hexadecimal (16-base: 0-F).
Coming back to ‘binary’ (bin, dual), you can start seeing how is commonly used to describe any type of data which cannot easily be recognized by humans, but can be understood by binary-based computers. It’s perhaps not the best analogy, as binary usually refers to two states (true/false), whereas in common IT jargon ‘binary data’ has come to meany data which is not easily easily interpretable.
For example, a source code file compiled with a compiler contains binary data mostly unreadable by humans. For example, a source code file compiled with a compiler contains binary data mostly unreadable by the human eye. Another example could be a encrypted file or a configuration file written in a propriety format.
What does it look like when you try and view binary data?
Usually, when viewing binary data for executables, you will see some real binary data (all the odd looking characters - your computer is displaying binary data in the limited output format capabilities which your terminal supports), as well as some text-based output. In the case of
ls as seen here, they seem to be function names within the
To view binary data correctly, you really do need a binary file viewer. Such viewers simply format data in their native format, alongside with a text-based side column. This avoids limitations of textual output and allows you to see the computer code for what it really is: 0’s and 1’s, though often formatted in hexadecimal formatting (0-F or 0-f as shown below).
Let’s have a look at two sets of 4 lines of the binary code of
ls to see what this looks like:
$ hexdump -C /bin/ls | head -n4; echo '...'; hexdump -C /bin/ls | tail -n131 | head -n4 00000000 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 |.ELF............| 00000010 03 00 3e 00 01 00 00 00 d0 67 00 00 00 00 00 00 |..>......g......| 00000020 40 00 00 00 00 00 00 00 c0 23 02 00 00 00 00 00 |@........#......| 00000030 00 00 00 00 40 00 38 00 0d 00 40 00 1e 00 1d 00 |....@.8...@.....| ... 00022300 75 2e 76 65 72 73 69 6f 6e 00 2e 67 6e 75 2e 76 |u.version..gnu.v| 00022310 65 72 73 69 6f 6e 5f 72 00 2e 72 65 6c 61 2e 64 |ersion_r..rela.d| 00022320 79 6e 00 2e 72 65 6c 61 2e 70 6c 74 00 2e 69 6e |yn..rela.plt..in| 00022330 69 74 00 2e 70 6c 74 2e 67 6f 74 00 2e 70 6c 74 |it..plt.got..plt|
How does all of this (besides learning more about how computers work) help you to understand correct
grep usage? Let’s come back to our original question: what happens when you grep through a file which is text/character-based, but contains special characters outside of the normal range?
We can now rightly reword this to ‘what happens when you grep through a binary file’? Your first reaction may be: why would I want to search through a binary file?. In part, the answer shows in the above
ls example already; often binary files still contain text based strings.
And there is a much more important and primary reason;
grep by default will assume many files to contain binary data as soon as they have special characters in them, and perhaps when they contain certain binary escape sequences, even though the file in itself may be data based. What’s worse is that by default grep will fail and abort scanning these files as soon as such data is found:
$ head -n2 test_data.sql CREATE TABLE t1 (id int); INSERT INTO t1 VALUES (1); $ grep 'INSERT' test_data.sql | tail -n2 INSERT INTO t1 VALUES(1000); Binary file test_data.sql matches
As two prominent examples from personal experience with database work, when you scan database server error logs, which can easily contain such special characters as at times error messages, database, table and field names may make it to the error log and such messages are regularly in region-specific character sets.
Another example is test SQL obtained from database testing suites (shown in the example above). Such data often contains special characters for testing and stressing the server in a multitude of ways. The same would apply to most website testing data and other domain testing data sets. As grep fails by default against such data, it is important to ensure we add an option to grep to cover this.
The option is
--binary-files=text. We can see how our grep now works correctly:
$ grep 'INSERT' test_data.sql | wc -l 7671 $ grep 'INSERT' test_data.sql | tail -n1 Binary file test_data.sql matches $ grep --binary-files=text 'INSERT' test_data.sql | wc -l 690427
What a difference! You can imagine how many automated
grep scripts throughout the world are failing to scan all data they should be scanning. What is worse, and significantly compounds the issue is that
grep fails 100% silently when this happens, the error code will be 0 (success) in both cases:
$ grep -q 'INSERT' test_data.sql; echo $? 0 $ grep --binary-files=text -q 'INSERT' test_data.sql; echo $? 0
Compounding it even more, the error message is displayed on
stdout output, and not on
stderr as one might expect. We can verify this by redirecting
stderr to the null device
/dev/null, only displaying
stdout output. The output remains:
$ grep 'INSERT' test_data.sql 2>/dev/null | tail -n1 Binary file test_data.sql matches
This also means that if you were to redirect your grep results to another file (
> somefile.txt after the grep command), that the ‘Binary file … matches` would now be part of that file, besides missing all entries seen after such issue occurred.
Another issue is the security aspect: let’s take an organization who has scripted access log greps to email reports to sysadmins whenever a rogue agent (like a hacker) tries and access unauthorized resources. If such a hacker is able to insert some binary data into the access log before their access attempt, and the grep is unprotected by
--binary-files=text, no such emails will ever be sent.
Even if the script is developed well enough to check for the
grep exit code, still no-one will ever notice a script error, as grep returns
0, or in other words: success. Success it ain’t though :)
There are two easy solutions; add
--binary-files=text to all your
grep statements, and you may want to consider scanning grep output (or the contents of a redirected output file) for the regular expression ‘^Binary file.*matches’. For more information on regular expressions, see Bash Regexps for Beginners with Examples and Advanced Bash Regex with Examples. However, either doing both or only the first one would be preferred, as the second option is not future-proof; the ‘Binary file…matches’ text may change.
Finally, note that when a text file becomes corrupted (disk failure, network failure etc.), it contents may end up being part-text and part-binary. This is yet another reason to always protect your
grep statements with the
--binary-files=text for all your
grep statements, even if they currently work fine. You never know when that binary data may hit your file.
Example 2: Test for the Presence of a Given String Within a Text File
We can use
grep -q in combination with an
if statement in order to test for the presence of a given string within a text file:
$ if grep --binary-files=text -qi "insert" test_data.sql; then echo "Found!"; else echo "Not Found!"; fi Found!
Let’s break this down a little by first checking if the data truly exists:
$ grep --binary-files=text -i "insert" test_data.sql | head -n1 INSERT INTO t1 VALUES (1);
Here we dropped the
q (quiet) option to obtain output and see that the string ‘insert’ - taken in case-insensitive manner (by specifying the
-i option to
grep exists in the file as ‘INSERT…`.
Note that the
q option is not specifically a testing option. It is rather an output modifier which tells
grep to be ‘quiet’, i.e. not to output anything. So how does the
if statement know whether there is a presence of a given string within a text file? This is done through the
grep exit code:
$ grep --binary-files=text -i "INSERT" test_data.sql 2>&1 >/dev/null; echo $? 0 $ grep --binary-files=text -i "THIS REALLY DOES NOT EXIST" test_data.sql 2>&1 >/dev/null; echo $? 1
Here we did a manual redirect of all
sdtout output to
/dev/null by redirecting
stdout (&1) and redirecting all
stdout output to the null device (
>/dev/null). This is basically equivalent to the
-q (quiet) option to grep.
We next verified the output code and established that when the string is found,
0 (success) is returned, whereas
1 (failure) is returned when the string is not found.
if can use these two exit codes to execute either the
then or the
else clauses specified to it.
In summary, we can use
if grep -q to test for the presence of a certain string within a text file. The fully correct syntax, as seen earlier in this article, is
if grep --binary-files=text -qi "search_term" your_file.sql for case-insensitive searches, and
if grep --binary-files=text -q "search_term" your_file.sql for case-sensitive searches.
In this article, we saw the many reasons why it is important to use
--binary-files=text on nearly all grep searches. We also explored using
grep -q in combination with
if statements to test for the presence of a given string within a text file. Enjoy using
grep, and leave us a comment with your greatest