Whether you are an IT professional who needs to download 2000 online bug reports into a flat text file and parse them to see which ones need attention, or a mum who wants to download 20 recipes from an public domain website, you can benefit from knowing the tools which help you download webpages into a text based file. If you are interested in learning more about how to parse the pages you download, you can have a look at our Big Data Manipulation for Fun and Profit Part 1 article.
In this tutorial you will learn:
- How to retrieve/download webpages using wget, curl and lynx
- What the main differences between the wget, curl and lynx tools are
- Examples showing how to use wget, curl and lynx
Software requirements and conventions used
Category | Requirements, Conventions or Software Version Used |
---|---|
System | Linux Distribution-independent |
Software | Bash command line, Linux based system |
Other | Any utility which is not included in the Bash shell by default can be installed using sudo apt-get install utility-name (or yum install for RedHat based systems) |
Conventions | # – requires linux-commands to be executed with root privileges either directly as a root user or by use of sudo command$ – requires linux-commands to be executed as a regular non-privileged user |
Before we start, please install the 3 utilities using the following command (on Ubuntu or Mint), or use yum install
instead of apt install
if you are using a RedHat based Linux distribution.
$ sudo apt-get install wget curl lynx
Once done, let’s get started!
Example 1: wget
Using wget
to retrieve a page is easy and straightforward:
$ wget https://linuxconfig.org/linux-complex-bash-one-liner-examples --2020-10-03 15:30:12-- https://linuxconfig.org/linux-complex-bash-one-liner-examples Resolving linuxconfig.org (linuxconfig.org)... 2606:4700:20::681a:20d, 2606:4700:20::681a:30d, 2606:4700:20::ac43:4b67, ... Connecting to linuxconfig.org (linuxconfig.org)|2606:4700:20::681a:20d|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: 'linux-complex-bash-one-liner-examples’ linux-complex-bash-one-liner-examples [ <=> ] 51.98K --.-KB/s in 0.005s 2020-10-03 15:30:12 (9.90 MB/s) - 'linux-complex-bash-one-liner-examples’ saved [53229] $
Here we downloaded an article from linuxconfig.org
into a file, which by default is named the same as the name in the URL.
Let’s check out the file contents
$ file linux-complex-bash-one-liner-examples linux-complex-bash-one-liner-examples: HTML document, ASCII text, with very long lines, with CRLF, CR, LF line terminators $ head -n5 linux-complex-bash-one-liner-examples <!DOCTYPE html> <html lang="en-gb" dir="ltr"> <head> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta http-equiv="X-UA-Compatible" content="IE=edge" />
Great, file
(the file classification utility) recognizes the downloaded file as HTML, and the head
confirms that first 5 lines (-n5
) look like HTML code, and are text based.
Example 2: curl
$ curl https://linuxconfig.org/linux-complex-bash-one-liner-examples > linux-complex-bash-one-liner-examples % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 53045 0 53045 0 0 84601 0 --:--:-- --:--:-- --:--:-- 84466 $
This time we used curl
to do the same as in our first example. By default, curl
will output to standard out (stdout
) and display the HTML page in your terminal! Thus, we instead redirect (using >
) to the file linux-complex-bash-one-liner-examples
.
We again confirm the contents:
$ file linux-complex-bash-one-liner-examples linux-complex-bash-one-liner-examples: HTML document, ASCII text, with very long lines, with CRLF, CR, LF line terminators $ head -n5 linux-complex-bash-one-liner-examples <!DOCTYPE html> <html lang="en-gb" dir="ltr"> <head> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta http-equiv="X-UA-Compatible" content="IE=edge" />
Great, the same result!
One challenge, when we want to process this/these file(s) further, is that the format is HTML based. We could parse the output by using sed
or awk
and some semi-complex regular expression, to reduce the output to text-only but doing so is somewhat complex and often not sufficiently error-proof. Instead, let’s use a tool which was natively enabled/programmed to dump pages into text format.
Example 3: lynx
Lynx is another tool which we can use to retrieve the same page. However, unlike wget
and curl
, lynx
is meant to be a full (text-based) browser. Thus, if we output from lynx
, the output will be text, and not HTML, based. We can use the lynx -dump
command to output the webpage being accessed, instead of starting a fully interactive (test-based) browser in your Linux client.
$ lynx -dump https://linuxconfig.org/linux-complex-bash-one-liner-examples > linux-complex-bash-one-liner-examples $
Let’s examine the contents of the created file once more:
$ file linux-complex-bash-one-liner-examples linux-complex-bash-one-liner-examples: UTF-8 Unicode text $ head -n5 linux-complex-bash-one-liner-examples * [1]Ubuntu + o [2]Back o [3]Ubuntu 20.04 o [4]Ubuntu 18.04
As you can see, this time we have a UTF-8 Unicode
text based file, unlike the previous wget
and curl
examples, and the head
command confirms that the first 5 lines are text based (with references to the URL’s in the form of [nr]
markers). We can see the URL’s towards the end of the file:
$ tail -n86 linux-complex-bash-one-liner-examples | head -n3 Visible links 1. https://linuxconfig.org/ubuntu 2. https://linuxconfig.org/linux-complex-bash-one-liner-examples
Retrieving pages in this way provides us with a great benefit of having HTML-free text-based files which we can use to process further if so required.
Conclusion
In this article, we had a short introduction to the wget
, curl
and lynx
tools, and we discovered how the latter can be used to retrieve web pages in a textual format dropping all HTML content.
Please, always use the knowledge gained here responsibly: please do not overload webservers, and only retrieve public domain, no-copyright, or CC-0 etc. data/pages. Also always make sure to check if there is a downloadable database/dataset of the data you are interested in, which is much preferred to individually retrieving webpages.
Enjoy your new found knowledge, and, mum, looking forward to that cake for which you downloaded the recipe using lynx --dump
! If you dive into any of the tools further, please leave us a comment with your discoveries.