PDF documents are commonly used to hold lengthy amounts of text, especially for formal matters like contracts or terms and conditions. These PDF documents can prove unwieldy in certain scenarios, since a PDF reader application is required to open them, and a PDF editor must be used for changing the contents.
In many cases, a plain text file is just easier to work with. Luckily, we can easily convert the text of a PDF into a normal plain text file on the Linux command line. In this tutorial, you will learn how to extract the text from a PDF document on a Linux system.
In this tutorial you will learn:
- How to install the
pdftotextcommand on all major Linux distros
- How to use the
pdftotextcommand to extract text from PDF
|Category||Requirements, Conventions or Software Version Used|
|System||Any Linux distro|
|Other||Privileged access to your Linux system as root or via the
# – requires given linux commands to be executed with root privileges either directly as a root user or by use of
$ – requires given linux commands to be executed as a regular non-privileged user
Install pdftotext command on major Linux distros
We are able to use the
pdftotextLinux command in order to extract the text from a PDF document. This command is normally installed by default, but if not, it is provided by the Poppler software package. You can use the appropriate command below to install pdftotext with your system’s package manager.
$ sudo apt install poppler-utils
$ sudo dnf install poppler
$ sudo pacman -S poppler
pdftotext Command Examples
Be aware that
pdftotextwill only extract text that has been stored as text. If your PDF document contains scanned images (JPG files, for example), then
pdftotextdoes not support OCR and will not be able to extract any text.
- Use the
pdftotextcommand followed by your PDF document file name as an argument.
$ pdftotext document.pdf
Your text file will be created with the same file name, just a
.txtextension. In other words,
document.pdfwould have its text extracted into the
- If you only want to extract text from a certain range of pages, we can use the
-loptions to specify the first page and the last page that we want to extract, respectively (and all pages in between). For example, to extract all text from page 3 to page 9:
$ pdftotext -f 3 -l 9 document.pdf
document.txtplain text file will now contain all the same text from pages 3 to 9.
- If you find that your plain text file is structured oddly, you can tell
pdftotextto maintain the original layout as much as possible by supplying the
-layoutoption. By default,
pdftotextwill try to undo certain structuring like columns, which do not translate nicely to plain text.
$ pdftotext -layout document.pdf
- There are some other options available, but these would only cover niche scenarios and do not require much elaboration. See the help page for a full list with this command:
$ pdftotext -h
Or more explanation of available options…
$ man pdftotext
In this tutorial, we saw how to extract text from a PDF document on a Linux system. This involved the installation of the
pdftotextcommand, which is the must-have utility on Linux for a task like extracting text from PDF files. We also learned how to change the output structure in our text files, since retaining the same layout from PDF to plain text is often impossible.