Web scraping is the process of analyzing the structure of HTML pages, and programmatically extract data from them. In the past we saw how to scrape the web using the Python programming language and the “Beautilful Soup” library; in this tutorial, instead, we see how to perform the same operation using a command line tool written in Rust: htmlq.

In this tutorial you will learn:

How to install cargo and htmlq

How to add the ~/.cargo/bin directory to PATH

How to scrape a page with curl and htmlq

How to extract a specific tag

How to get the value of a specific tag attribute

How to add base URLs to links

How to use css selectors

How to get text between tags

Software requirements and conventions used

Software Requirements and Linux Command Line Conventions Category Requirements, Conventions or Software Version Used System Distribution-independent Software curl, cargo, htmlq Other None Conventions # – requires given linux-commands to be executed with root privileges either directly as a root user or by use of sudo command

$ – requires given linux-commands to be executed as a regular non-privileged user

Installation

Htmlq is an application written using Rust, a general-purpose programming language, syntactically similar to C++. Cargo is the Rust package manager: it is basically what pip is for Python. In this tutorial we will use Cargo to install the htmlq tool, therefore the first thing we have to do, is to install it in our system.

Installing cargo

The “cargo” package is available in the repositories of all the most commonly used Linux distribution. To install “Cargo” on Fedora, for example, we simply use the dnf package manager:

$ sudo dnf install cargo

apt

apt-get

apt-cache

$ sudo apt install cargo

On Debian, and Debian-based distributions, instead, a modern way to perform the installation is to use thewrapper, which is designed to provide a more user-friendly interface to commands likeand. The command we need to run is the following:

If Archlinux is our favorite Linux distribution, all we have to do is to install the rust package: Cargo is part of it. To achieve the task, we can use the pacman package manager:

$ sudo pacman -Sy rust

Installing htmlq

Once Cargo is installed, we can use it to install the htmlq tool. We don’t need administrative privileges to perform the operation, since we will install the software only for our user. To install htmlq we run:

$ cargo install htmlq

Binaries installed with cargo are placed in the ~/.cargo/bin directory, therefore, to be able to invoke the tool from the command line without having to specify its full patch each time, we need to add the directory to our PATH . In our ~/.bash_profile or ~/.profile file, we add the following line:

export PATH="${PATH}:${HOME}/.cargo/bin"

To make the modification effective we need to logout and log back in, or as a temporary solution, just re-source the file:

$ source ~/.bash_profile

htmlq

Htmlq usage examples

At this point we should be able to invokefrom our terminal. Let’s see some examples of its usage.

The most common way to use htmlq is to pass it the output of another very commonly used application: curl . For those of you who don’t know it, curl is a tool used to transfer data from or to a server. Running it on a web page, it does return that page source to standard output; all we have to do is to pipe it to htmlq . Let’s see some examples.

Extracting a specific tag

Suppose we want to extract all the links contained in the homepage of “The New York Times” website. We know the in the HTML links are created using the a tag, therefore the command we would run is the following:

$ curl --silent https://www.nytimes.com | htmlq a

In the example above, we invoked curl with the --silent option: this is to avoid the application showing the page download progress or other messages we don’t need in this case. With the | pipe operator we used the output produced by curl as htmlq input. We called the latter passing the name of the tag we are searching for as argument. Here is the (truncated) result of the command:

[...] <a class="css-1wjnrbv" href="/section/world">World</a> <a class="css-1wjnrbv" href="/section/us">U.S.</a> <a class="css-1wjnrbv" href="/section/politics">Politics</a> <a class="css-1wjnrbv" href="/section/nyregion">N.Y.</a> <a class="css-1wjnrbv" href="/section/business">Business</a> <a class="css-1wjnrbv" href="/section/opinion">Opinion</a> <a class="css-1wjnrbv" href="/section/technology">Tech</a> <a class="css-1wjnrbv" href="/section/science">Science</a> <a class="css-1wjnrbv" href="/section/health">Health</a> <a class="css-1wjnrbv" href="/section/sports">Sports</a> <a class="css-1wjnrbv" href="/section/arts">Arts</a> <a class="css-1wjnrbv" href="/section/books">Books</a> <a class="css-1wjnrbv" href="/section/style">Style</a> <a class="css-1wjnrbv" href="/section/food">Food</a> <a class="css-1wjnrbv" href="/section/travel">Travel</a> <a class="css-1wjnrbv" href="/section/magazine">Magazine</a> <a class="css-1wjnrbv" href="/section/t-magazine">T Magazine</a> <a class="css-1wjnrbv" href="/section/realestate">Real Estate</a> [...]

We truncated the output above for convenience, however, we can see that the entire <a> tags were returned. What if we want to obtain only the value of one of the tag attributes? In such cases we can simply invoke htmlq with the --attribute option, and pass the attribute we want to retrieve the value of as argument. Suppose, for example, we only want to get the value of the href attribute, which is the actual URL of the page the links sends to. Here is what we would run:

$ curl --silent https://www.nytimes.com | htmlq a --attribute href

Here is the result we would obtain:

[...] /section/world /section/us /section/politics /section/nyregion /section/business /section/opinion /section/technology /section/science /section/health /section/sports /section/arts /section/books /section/style /section/food /section/travel /section/magazine /section/t-magazine /section/realestate [...]

Obtaining complete links URLs

As you can see, links are returned as they appear in the page. What is missing from them is the “base” URL, which in this case is https://www.nytimes.com . Is there a way to add it on the fly? The answer is yes. What we have to do is to use the -b (short for --base ) option of htmlq , and pass the base URL we want to as argument:

$ curl --silent https://www.nytimes.com | htmlq a --attribute href -b https://www.nytimes.com

The command above would return the following:

[...] https://www.nytimes.com/section/world https://www.nytimes.com/section/us https://www.nytimes.com/section/politics https://www.nytimes.com/section/nyregion https://www.nytimes.com/section/business https://www.nytimes.com/section/opinion https://www.nytimes.com/section/technology https://www.nytimes.com/section/science https://www.nytimes.com/section/health https://www.nytimes.com/section/sports https://www.nytimes.com/section/arts https://www.nytimes.com/section/books https://www.nytimes.com/section/style https://www.nytimes.com/section/food https://www.nytimes.com/section/travel https://www.nytimes.com/section/magazine https://www.nytimes.com/section/t-magazine https://www.nytimes.com/section/realestate [...]

Obtaining the text between tags

What if we want to “extract” the text contained between specific tags? Say for example, we want to get only the text used for the links existing in the page? All we have to do is to use the -t ( --text ) option of htmlq :

$ curl --silent https://www.nytimes.com | htmlq a --text

[...] World U.S. Politics N.Y. Business Opinion Tech Science Health Sports Arts Books Style Food Travel Magazine T Magazine Real Estate [...]

Using css selectors

Here is the output returned by the command above:

When using htmlq , we are not limited to simply pass the name of the tag we want to retrieve as argument, but we can use more complex css selectors. Here is an example. Of all the links existing in the page we used in the example above, suppose we want to retrieve only those with css-jq1cx6 class. We would run:

$ curl --silent https://www.nytimes.com | htmlq a.css-jq1cx6

Similarly, to filter all the tags where the data-testid attribute exists and have the “footer-link” value, we would run:

$ curl --silent https://www.nytimes.com | htmlq a[data-testid="footer-link"]

Conclusions