How to extract text from PDF document

PDF documents are commonly used to hold lengthy amounts of text, especially for formal matters like contracts or terms and conditions. These PDF documents can prove unwieldy in certain scenarios, since a PDF reader application is required to open them, and a PDF editor must be used for changing the contents.

In many cases, a plain text file is just easier to work with. Luckily, we can easily convert the text of a PDF into a normal plain text file on the Linux command line. In this tutorial, you will learn how to extract the text from a PDF document on a Linux system.

In this tutorial you will learn:

  • How to install the pdftotext command on all major Linux distros
  • How to use the pdftotext command to extract text from PDF
How to extract text from PDF document
How to extract text from PDF document
Software Requirements and Linux Command Line Conventions
Category Requirements, Conventions or Software Version Used
System Any Linux distro
Software pdftotext
Other Privileged access to your Linux system as root or via the sudo command.
Conventions # – requires given linux commands to be executed with root privileges either directly as a root user or by use of sudo command
$ – requires given linux commands to be executed as a regular non-privileged user

Install pdftotext command on major Linux distros




We are able to use the pdftotext Linux command in order to extract the text from a PDF document. This command is normally installed by default, but if not, it is provided by the Poppler software package. You can use the appropriate command below to install pdftotext with your system’s package manager.

To install pdftotext on Ubuntu, Debian, and Linux Mint:

$ sudo apt install poppler-utils

To install pdftotext on Fedora, CentOS, AlmaLinux, and Red Hat:

$ sudo dnf install poppler

To install pdftotext on Arch Linux and Manjaro:

$ sudo pacman -S poppler

pdftotext Command Examples

NOTE
Be aware that pdftotext will only extract text that has been stored as text. If your PDF document contains scanned images (JPG files, for example), then pdftotext does not support OCR and will not be able to extract any text.
  1. Use the pdftotext command followed by your PDF document file name as an argument.


    $ pdftotext document.pdf
    

    Your text file will be created with the same file name, just a .txt extension. In other words, document.pdf would have its text extracted into the document.txt text file.

  2. If you only want to extract text from a certain range of pages, we can use the -f and -l options to specify the first page and the last page that we want to extract, respectively (and all pages in between). For example, to extract all text from page 3 to page 9:
    $ pdftotext -f 3 -l 9 document.pdf
    

    Our document.txt plain text file will now contain all the same text from pages 3 to 9.

  3. If you find that your plain text file is structured oddly, you can tell pdftotext to maintain the original layout as much as possible by supplying the -layout option. By default, pdftotext will try to undo certain structuring like columns, which do not translate nicely to plain text.
    $ pdftotext -layout document.pdf
    
  4. There are some other options available, but these would only cover niche scenarios and do not require much elaboration. See the help page for a full list with this command:
    $ pdftotext -h
    

    Or more explanation of available options…

    $ man pdftotext
    

Closing Thoughts




In this tutorial, we saw how to extract text from a PDF document on a Linux system. This involved the installation of the pdftotext command, which is the must-have utility on Linux for a task like extracting text from PDF files. We also learned how to change the output structure in our text files, since retaining the same layout from PDF to plain text is often impossible.