These days everyone seems to be speaking about Big Data – but what does it really mean? The term is used quite ambiguously in a variety of situations. For the purposes of this article, and the series, we will refer to big data whenever we mean ‘a large amount of textual data, in any format (for example plain ASCII text, XML, HTML, or any other human-readable or semi-human-readable format). Some techniques shown may work well for binary data also, when used with care and knowledge.
So, why fun (ref title)?
Handling gigabytes of raw textual data in a quick and efficient script, or even using a one-liner command (see Linux Complex Bash One Liner Examples to learn more about one-liners in general), can be quite fun, especially when you get things to work well and are able to automate things. We can never learn enough about how to handle big data; the next challenging text parse will always be around the corner.
And, why profit?
Many of the world’s data is stored in large textual flat files. For example, did you know you can download the full Wikipedia database? The problem is that often this data is formatted in some other format like HTML, XML or JSON, or even proprietary data formats! How do you get it from one system to another? Knowing how to parse big data, and parse it well, puts all the power at your fingertips to change data from one format to another. Simple? Often the answer is ‘No’, and thus it helps if you know what you are doing. Straightforward? Idem. Profitable? Regularly, yes, especially if you become good at handling and using big data.
Handling big data is also referred to as ‘data wrangling’. I started working with big data over 17 years ago, so hopefully there is a thing or two you can pickup from this series. In general, data transformation as a topic is semi-endless (hundreds of third-party tools are available for each particular text format), but I will focus on one specific aspect which applies to textual data parsing; using the Bash command line to parse any type of data. At times, this may not be the best solution (i.e. a pre-created tool may do a better job), but this series is specifically for all those (many) other times when no tool is available to get your data ‘just right’.
In this tutorial you will learn:
Big Data Manipulation for Fun and Profit Part 1