Big Data Manipulation for Fun and Profit Part 2

In the first part of this big data manipulation series – which you may want to read first if you haven’t read it yet; Big Data Manipulation for Fun and Profit Part 1 – we discussed at some length the various terminologies and some of the ideas surrounding big data, or more specifically as it relates to handling, transforming, mangling, munging, parsing, wrangling, transforming and manipulating the data. Often these terms are use interchangeably and often their use overlaps. We also looked at the first set of Bash tools which may help us with work related to these terms.

This article will explore a further set of Bash tools which can help us when processing and manipulating text-based (or in some cases binary) big data. As mentioned in the previous article, data transformation in general is an semi-endless topic as there are hundreds of tools for each particular text format. Remember that at times using Bash tools may not be the best solution, as an off-the-shelf tool may do a better job. That said, this series is specifically for all those (many) other times when no tool is available to get your data in the format of your choice.

And, if you want to learn why big data manipulation can be both profitable and fun… please read Part 1 first.

In this tutorial you will learn:

  • More big data wrangling / parsing / handling / manipulation / transformation techniques
  • What Bash tools are available to help you, specifically for text based applications
  • Examples showing different methods and approaches
Big Data Manipulation for Fun and Profit Part 2

Big Data Manipulation for Fun and Profit Part 2

Software requirements and conventions used

Software Requirements and Linux Command Line Conventions
Category Requirements, Conventions or Software Version Used
System Linux Distribution-independent
Software Bash command line, Linux based system
Other Any utility which is not included in the Bash shell by default can be installed using sudo apt-get install utility-name (or yum install for RedHat based systems)
Conventions # – requires linux-commands to be executed with root privileges either directly as a root user or by use of sudo command
$ – requires linux-commands to be executed as a regular non-privileged user


Example 1: awk

Going back to the data we used in our first article in this series (a small downloaded part of the Wikipedia database), we can use awk to start manipulating the data:

$ grep '31197816' enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442
269019710:31197816:Linux Is My Friend
$ grep '31197816' enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 | awk '{print $2}'
Is

First we grepped for a specific item in the flat text database file. Once we had the output (269019710:31197816:Linux Is My Friend), we then tried to print the second column by using the instruction {print $2} (print the second column) to awk, but this failed, rendering Is. The reason for this is that the awk utility by default will use whitespace (space or tab) as it’s separator. We can confirm this by reading the manual (man awk), or simply by testing;

$ echo -e 'test1\ttest2'
test1   test2
$ echo -e 'test1\ttest2' | awk '{print $2}'
test2
$ echo -e 'test1 test2' | awk '{print $2}'
test2

In the first line we insert a regular expression (regex) tab (\t) in the output to be generated by echo and we enable regular expression syntax by specifying -e to echo. If you would like to learn more about regular expressions in Bash and elsewhere, please see Bash Regexps for Beginners with Examples, Advanced Bash Regex with Examples and the semi-related Python Regular Expressions with Examples.

Subsequently we again use awk to print the second column {print $2} and see that the output this time is correct. Finally we test with ‘ ‘ and again see the output correctly as test2. We can also see in our former example that the text 269019710:31197816:Linux and Is is separated by a space – which matches the working of awk. The detailed information about the working of awk is helpful here, as often data is formatted in various ways. You may see spaces, tabs, colons, semicolons and other symbols being used as field separators. And it gets even more complex when dealing with HTML, XML, JSON, MD etc. formats.

Let’s change the separator by using the -F option to awk:

$ grep '31197816' enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 | awk -F':' '{print $2}'
31197816

Exactly what we need. -F is described in the awk manual as the input field separator. You can see how using awk to print various columns perceived in the data (you can simply swap the $2 to $3 to print the third column, etc.), so that we can process it further into the format we like. Let’s, to round up, change the order of the fields and drop one field we don’t think we need:

$ grep '31197816' enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 | awk -F':' '{print $3"\t"$2}' > out
$ cat out
Linux Is My Friend  31197816


Great! We changed the order of columns 2 and 3, and sent the output to a new file, and changed the separator to a tab (thanks to the "\t" insert in the print statement). If we now simply process the whole file:

$ awk -F':' '{print $3"\t"$2}' enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 > out
$ 

The whole input data is structurally changed to the new format! Welcome to the fun world of big data manipulation. You can see how with a few simple Bash commands, we are able to substantially restructure/change the file as we deem fit. I have always found Bash to come the closest to the ideal toolset for big data manipulation, combined with some off-the-shelf tools and perhaps Python coding. One of the main reasons for this is the multitude of tools available in Bash which make big data manipulation easier.

Let’s next verify our work

wc -l enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442
329956 enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442
$ wc -l out
329956 out
$ grep '31197816' out
Linux Is My Friend  31197816

Great – the same number of lines are there in the original and the modified file. And the specific example we used previously is still there. All good. If you like, you can dig a little further with commands like head and tail against both files to verify the lines look correctly changed across the board.

You could even try and open the file in your favorite text editor, but I would personally recommend vi as the number of lines may be large, and not all text editors deal well with this. vi takes a while to learn, but it’s a journey well worth taking. Once you get good with vi, you’ll never look back – it grows on you so to speak.

Example 2: tr

We can use the tr utility to translate or delete some characters:

$ grep '31197816' enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 | tr ':' '\t'
269019710   31197816    Linux Is My Friend

Here we change our field separator colon (:) to tab (\t). Easy and straightforward, and the syntax speaks for itself.

You can also use tr to delete any character:

$ grep '31197816' enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 | tr -d ':' | tr -d '[0-9]'
Linux Is My Friend


You can see how we first removed : from the output by using the delete (-d) option to tr, and next we removed – using a regular expression – any number in the 0-9 range ([0-9]).

Note how changing the :: to \t still does not enable us to use awk without changing the field separator, as there are now both tabs (\t) and spaces in the output, and both are seen by default (in awk) as field separators. So printing $3 with awk leads to just the first word (before a space is seen):

$ grep '31197816' enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 | tr ':' '\t' | awk '{print $3}'
Linux

This also highlights why it is always very important to test, retest and test again all your regular expressions and data transforming/manipulating command statements.

Conclusion

The multitude of tools in Bash make big data manipulation fun and in some cases very easy. In this second article in the series, we continued to explore Bash tools which may help us with big data manipulation.

Enjoy the journey, but remember the warning given at the end of the first article… Big data can seem to have a mind of it’s own, and there are inherent dangers in working with a lot of data (or with input overload, as in daily life), and these are (mainly) perception overload, perfection overreach, time lost and prefrontal cortex (and other brain areas) overuse. The more complex the project, source data or target format, the larger the risk. Speaking from plenty of experience here.

A good way to counteract these dangers is to set strict time limits to working with complex and large data sets. For example, 2 hours (at max) per day. You’ll be surprised what you can achieve if you set your mind to a dedicated two hours, and don’t go over it, consistently. Don’t say I didn’t warn you 🙂

Let us know your thoughts below – interesting large data sets, strategies (both technical and lifestyle/approach), and other ideas are welcome!



Comments and Discussions
Linux Forum