Web scraping is the process of analyzing the structure of HTML pages, and programmatically extract data from them. In the past we saw how to scrape the web using the Python programming language and the “Beautilful Soup” library; in this tutorial, instead, we see how to perform the same operation using a command line tool written in Rust: htmlq.
In this tutorial you will learn:
- How to install cargo and htmlq
- How to add the ~/.cargo/bin directory to PATH
- How to scrape a page with curl and htmlq
- How to extract a specific tag
- How to get the value of a specific tag attribute
- How to add base URLs to links
- How to use css selectors
- How to get text between tags

Software requirements and conventions used
Category | Requirements, Conventions or Software Version Used |
---|---|
System | Distribution-independent |
Software | curl, cargo, htmlq |
Other | None |
Conventions | # – requires given linux-commands to be executed with root privileges either directly as a root user or by use of sudo command$ – requires given linux-commands to be executed as a regular non-privileged user |
Installation
Htmlq is an application written using Rust, a general-purpose programming language, syntactically similar to C++. Cargo is the Rust package manager: it is basically what pip is for Python. In this tutorial we will use Cargo to install the htmlq tool, therefore the first thing we have to do, is to install it in our system.
Installing cargo
The “cargo” package is available in the repositories of all the most commonly used Linux distribution. To install “Cargo” on Fedora, for example, we simply use the dnf
package manager:
$ sudo dnf install cargo
On Debian, and Debian-based distributions, instead, a modern way to perform the installation is to use the
apt
wrapper, which is designed to provide a more user-friendly interface to commands like apt-get
and apt-cache
. The command we need to run is the following:
$ sudo apt install cargo
If Archlinux is our favorite Linux distribution, all we have to do is to install the rust
package: Cargo is part of it. To achieve the task, we can use the pacman
package manager:
$ sudo pacman -Sy rust
Installing htmlq
Once Cargo is installed, we can use it to install the htmlq tool. We don’t need administrative privileges to perform the operation, since we will install the software only for our user. To install htmlq
we run:
$ cargo install htmlq
Binaries installed with cargo are placed in the ~/.cargo/bin
directory, therefore, to be able to invoke the tool from the command line without having to specify its full patch each time, we need to add the directory to our PATH
. In our ~/.bash_profile
or ~/.profile
file, we add the following line:
export PATH="${PATH}:${HOME}/.cargo/bin"
To make the modification effective we need to logout and log back in, or as a temporary solution, just re-source the file:
$ source ~/.bash_profile
At this point we should be able to invoke
htmlq
from our terminal. Let’s see some examples of its usage.
Htmlq usage examples
The most common way to use htmlq
is to pass it the output of another very commonly used application: curl
. For those of you who don’t know it, curl is a tool used to transfer data from or to a server. Running it on a web page, it does return that page source to standard output; all we have to do is to pipe it to htmlq
. Let’s see some examples.
Extracting a specific tag
Suppose we want to extract all the links contained in the homepage of “The New York Times” website. We know the in the HTML links are created using the a
tag, therefore the command we would run is the following:
$ curl --silent https://www.nytimes.com | htmlq a
In the example above, we invoked curl
with the --silent
option: this is to avoid the application showing the page download progress or other messages we don’t need in this case. With the |
pipe operator we used the output produced by curl as htmlq
input. We called the latter passing the name of the tag we are searching for as argument. Here is the (truncated) result of the command:
[...] <a class="css-1wjnrbv" href="/section/world">World</a> <a class="css-1wjnrbv" href="/section/us">U.S.</a> <a class="css-1wjnrbv" href="/section/politics">Politics</a> <a class="css-1wjnrbv" href="/section/nyregion">N.Y.</a> <a class="css-1wjnrbv" href="/section/business">Business</a> <a class="css-1wjnrbv" href="/section/opinion">Opinion</a> <a class="css-1wjnrbv" href="/section/technology">Tech</a> <a class="css-1wjnrbv" href="/section/science">Science</a> <a class="css-1wjnrbv" href="/section/health">Health</a> <a class="css-1wjnrbv" href="/section/sports">Sports</a> <a class="css-1wjnrbv" href="/section/arts">Arts</a> <a class="css-1wjnrbv" href="/section/books">Books</a> <a class="css-1wjnrbv" href="/section/style">Style</a> <a class="css-1wjnrbv" href="/section/food">Food</a> <a class="css-1wjnrbv" href="/section/travel">Travel</a> <a class="css-1wjnrbv" href="/section/magazine">Magazine</a> <a class="css-1wjnrbv" href="/section/t-magazine">T Magazine</a> <a class="css-1wjnrbv" href="/section/realestate">Real Estate</a> [...]
We truncated the output above for convenience, however, we can see that the entire <a>
tags were returned. What if we want to obtain only the value of one of the tag attributes? In such cases we can simply invoke htmlq
with the --attribute
option, and pass the attribute we want to retrieve the value of as argument. Suppose, for example, we only want to get the value of the href
attribute, which is the actual URL of the page the links sends to. Here is what we would run:
$ curl --silent https://www.nytimes.com | htmlq a --attribute href
Here is the result we would obtain:
[...] /section/world /section/us /section/politics /section/nyregion /section/business /section/opinion /section/technology /section/science /section/health /section/sports /section/arts /section/books /section/style /section/food /section/travel /section/magazine /section/t-magazine /section/realestate [...]
Obtaining complete links URLs
As you can see, links are returned as they appear in the page. What is missing from them is the “base” URL, which in this case is https://www.nytimes.com
. Is there a way to add it on the fly? The answer is yes. What we have to do is to use the -b
(short for --base
) option of htmlq
, and pass the base URL we want to as argument:
$ curl --silent https://www.nytimes.com | htmlq a --attribute href -b https://www.nytimes.com
The command above would return the following:
[...] https://www.nytimes.com/section/world https://www.nytimes.com/section/us https://www.nytimes.com/section/politics https://www.nytimes.com/section/nyregion https://www.nytimes.com/section/business https://www.nytimes.com/section/opinion https://www.nytimes.com/section/technology https://www.nytimes.com/section/science https://www.nytimes.com/section/health https://www.nytimes.com/section/sports https://www.nytimes.com/section/arts https://www.nytimes.com/section/books https://www.nytimes.com/section/style https://www.nytimes.com/section/food https://www.nytimes.com/section/travel https://www.nytimes.com/section/magazine https://www.nytimes.com/section/t-magazine https://www.nytimes.com/section/realestate [...]
Obtaining the text between tags
What if we want to “extract” the text contained between specific tags? Say for example, we want to get only the text used for the links existing in the page? All we have to do is to use the -t
(--text
) option of htmlq
:
$ curl --silent https://www.nytimes.com | htmlq a --text
Here is the output returned by the command above:
[...] World U.S. Politics N.Y. Business Opinion Tech Science Health Sports Arts Books Style Food Travel Magazine T Magazine Real Estate [...]
Using css selectors
When using htmlq
, we are not limited to simply pass the name of the tag we want to retrieve as argument, but we can use more complex css selectors. Here is an example. Of all the links existing in the page we used in the example above, suppose we want to retrieve only those with css-jq1cx6
class. We would run:
$ curl --silent https://www.nytimes.com | htmlq a.css-jq1cx6
Similarly, to filter all the tags where the data-testid
attribute exists and have the “footer-link” value, we would run:
$ curl --silent https://www.nytimes.com | htmlq a[data-testid="footer-link"]
Conclusions
In this tutorial we learned how to use the htmlq
application to perform the scraping of web pages from the command line. The tool is written in Rust, so we saw how to install it using the “Cargo” package manager, and how to add the default directory Cargo uses to store binaries to our PATH. We learned how to retrieve specific tags from a page, how to get the value of a specific tag attribute, how to pass a base URL to be added to partial links, how to use css selectors, and, finally, how to retrieve text enclosed between tags.