How to extract text from image

Although it is always ideal to have text stored as… well, text, sometimes an image or screenshot is the only option we have for the time being. The problem with having text stored as an image is that it can’t be easily copied and pasted, the font can’t be changed, and the file must be stored in an image format instead of something easily editable and small like a text file. Rather than going through the painstaking process of manually converting the image to text by typing it out, there are tools that can do the job for us on a Linux system.

Using optical character recognition (OCR) technology, various tools can read the text stored in an image and convert it to regular characters for storage inside of a text file or document. In this tutorial, we will go over a command line and GUI method for extracting text from an image on a Linux system.

In this tutorial you will learn:

  • How to install gImageReader on major Linux distros
  • How to install Tesseract OCR on major Linux distros
  • How to extract text from image via GUI
  • How to extract text from image via command line
How to extract text from image
How to extract text from image
Software Requirements and Linux Command Line Conventions
Category Requirements, Conventions or Software Version Used
System Any Linux distro
Software gImageReader, Tesseract OCR
Other Privileged access to your Linux system as root or via the sudo command.
Conventions # – requires given linux commands to be executed with root privileges either directly as a root user or by use of sudo command
$ – requires given linux commands to be executed as a regular non-privileged user

Extract text from image with gImageReader – GUI method




gImageReader is a graphical frontend for Tesseract OCR, a technology developed by Hewlett Packard. By installing gImageReader, we are also installing the backend component Tesseract OCR. It is regarded as one of the best OCR tools available for free, and is open source. Let’s see how to install and use gImageReader on Linux.

gImageReader Installation

First, let’s get started with installing gImageReader. You can use the appropriate command below to install gImageReader with your system’s package manager.

To install gImageReader on Ubuntu, Debian, and Linux Mint:

$ sudo apt update
$ sudo apt install gimagereader tesseract-ocr

To install gImageReader on Fedora, CentOS, AlmaLinux, and Red Hat:

$ sudo dnf install gimagereader-qt tesseract

To install gImageReader on Arch Linux and Manjaro:

$ sudo pacman -S gimagereader-qt tesseract

Using gImageReader

Now that gImageReader has been installed, follow the steps below to use the program to extract text from one or more image files:

  1. Start by opening gImageReader from your application launcher. In GNOME, this would be the Activities menu.

    Opening gImageReader GUI application
    Opening gImageReader GUI application



  2. Let’s add an image that we want to convert to text. To do so, click the green icon in the upper left corner of the gImageReader application. Refer to the screenshot below for the exact location.
    Clicking on the icon to add a new image source
    Clicking on the icon to add a new image source
  3. Locate the image (or images) that you want to add to gImageReader.
    Adding an image to gImageReader for conversion to text
    Adding an image to gImageReader for conversion to text
  4. For the next step, we can either manually select the text sections that we want to convert, or click on the ‘recognize all’ button at the top to let gImageReader detect the text automatically and begin converting it. Be sure to select a different language if your screenshot is not in English.
    Click on the recognize all button to start converting the detected text
    Click on the recognize all button to start converting the detected text
  5. The converted text now appears in the right pane. You can interact with the text to make corrections if you spot any errors, or copy the text to your clipboard. To save the text output as a file, click on the ‘Save output’ option as seen below.
    Save the output of the converted text
    Save the output of the converted text

Extract text from image with Tesseract OCR – command line method

Tesseract OCR is a command line program and the backend engine for the gImageReader GUI covered above. In the sections below, we will show you how to install Tesseract OCR on major Linux distros and then use its commmand syntax to start extracting text from images.

Tesseract OCR Installation

First, let’s get started with installing Tesseract OCR. You can use the appropriate command below to install Tesseract OCR with your system’s package manager.

To install Tesseract OCR on Ubuntu, Debian, and Linux Mint:

$ sudo apt update
$ sudo apt install tesseract-ocr




To install Tesseract OCR on Fedora, CentOS, AlmaLinux, and Red Hat:

$ sudo dnf install tesseract

To install Tesseract OCR on Arch Linux and Manjaro:

$ sudo pacman -S tesseract

Using Tesseract OCR

Now that Tesseract OCR has been installed, follow the commands below to use the program to extract text from one or more image files:

  1. We supply the tesseract with the name of our image and the desired name of the output text file. In this example, our image file is named server-guide.png and the resulting text file will be named output.txt.
    $ tesseract server-guide.png output
    
    NOTE
    Do not specify a file extension for your output file, as it will always default to .txt, even if you try to put a different one.
    Converting our image to text and then checking the results with cat command
    Converting our image to text and then checking the results with cat command
  2. In case you are getting different results than expected, you can manually specify the DPI resolution of the image you are trying to convert. This may help tesseract do a better job if it is having trouble detecting it by itself.
    $ tesseract server-guide.png output --dpi 300
    


Closing Thoughts

In this tutorial, we saw how to extract text from an image file on a Linux system. This can be done via command line or GUI, with the same engine powering both methods: The Tesseract OCR developed by Hewlett Packard and maintained by Google. This is a commercial grade OCR technology that does a great job at converting images to text. Nevertheless, for tricky images with varying fonts and sizes, you can expect to have to make some minor corrections in the output.



Comments and Discussions
Linux Forum