HTTP is the protocol used by the World Wide Web, that's why being able to interact with it programmatically is essential: scraping a web page, communicating with a service APIs, or even simply downloading a file, are all tasks based on this interaction. Python makes such operations very easy: some useful functions are already provided in the standard library, and for more complex tasks it's possible (and even recommended) to use the external requests module. In this first article of the series we will focus on the built-in modules. We will use python3 and mostly work inside the python interactive shell: the needed libraries will be imported only once to avoid repetitions.

In this tutorial you will learn:
  • How to perform HTTP requests with python3 and the urllib.request library
  • How to work with server responses
  • How to download a file using the urlopen or the urlretrieve functions
python-logo-requests-standard-library
HTTP request with python - Pt. I: The standard library

Software Requirements and Conventions Used

Software Requirements and Linux Command Line Conventions
Category Requirements, Conventions or Software Version Used
System Os-independent
Software Python3
Other
  • Knowledge of the basic concepts of Object Oriented Programming and of the Python programming language
  • Basic Knowledge of the HTTP protocol and HTTP verbs
Conventions # - requires given linux commands to be executed with root privileges either directly as a root user or by use of sudo command
$ - requires given linux commands to be executed as a regular non-privileged user

Performing requests with the standard library

Let's start with a very easy GET request. The GET HTTP verb is used to retrieve data from a resource. When performing such type of requests, it is possible to specify some parameters in the form variables: those variables, expressed as key-value pairs, form a query string which is "appended" to the URL of the resource. A GET request should always be idempotent (this means that the result of the request should be independent from the number of times it is performed) and should never be used to change a state. Performing GET requests with python is really easy. For the sake of this tutorial we will take advantage of the open NASA API call which let us retrieve the so called "picture of the day":


SUBSCRIBE TO NEWSLETTER
Subscribe to Linux Career NEWSLETTER and receive latest Linux news, jobs, career advice and tutorials.


>>> from urllib.request import urlopen
>>> with urlopen("https://api.nasa.gov/planetary/apod?api_key=DEMO_KEY") as response:
...    response_content = response.read()

The first thing we did was to import the urlopen function from the urllib.request library: this function returns an http.client.HTTPResponse object which has some very useful methods. We used the function inside a with statement because the HTTPResponse object supports the context-management protocol: resources are immediately closed after the "with" statement is executed, even if an exception is raised.

The read method we used in the example above returns the body of the response object as a bytes and optionally takes an argument which represents the amount of bytes to read (we will see later how this is important in some cases, especially when downloading big files). If this argument is omitted, the body of the response is read in its entirety.

At this point we have the body of the response as a bytes object, referenced by the response_content variable. We may want to transform it into something else. To turn it into a string, for example, we use the decode method, providing the encoding type as argument, typically:

>>> response_content.decode('utf-8')

In the example above we used the utf-8 encoding. The API call we used in the example, however, returns a response in JSON format, therefore, we want to process it with the help of the json module:

>>> import json
json_response = json.loads(response_content)

The json.loads method deserializes a string, a bytes or a bytearray instance containing a JSON document into a python object. The result of calling the function, in this case, is a dictionary:

>>> from pprint import pprint
>>> pprint(json_response)
{'date': '2019-04-14',
 'explanation': 'Sit back and watch two black holes merge. Inspired by the '
                'first direct detection of gravitational waves in 2015, this '
                'simulation video plays in slow motion but would take about '
                'one third of a second if run in real time. Set on a cosmic '
                'stage the black holes are posed in front of stars, gas, and '
                'dust. Their extreme gravity lenses the light from behind them '
                'into Einstein rings as they spiral closer and finally merge '
                'into one. The otherwise invisible gravitational waves '
                'generated as the massive objects rapidly coalesce cause the '
                'visible image to ripple and slosh both inside and outside the '
                'Einstein rings even after the black holes have merged. Dubbed '
                'GW150914, the gravitational waves detected by LIGO are '
                'consistent with the merger of 36 and 31 solar mass black '
                'holes at a distance of 1.3 billion light-years. The final, '
                'single black hole has 63 times the mass of the Sun, with the '
                'remaining 3 solar masses converted into energy in '
                'gravitational waves. Since then the LIGO and VIRGO '
                'gravitational wave observatories have reported several more '
                'detections of merging massive systems, while last week the '
                'Event Horizon Telescope reported the first horizon-scale '
                'image of a black hole.',
 'media_type': 'video',
 'service_version': 'v1',
 'title': 'Simulation: Two Black Holes Merge',
 'url': 'https://www.youtube.com/embed/I_88S8DWbcU?rel=0'}

As an alternative we could also use the json_load function (notice the missing trailing "s"). The function accepts a file-like object as argument: this means that we can use it directly on the HTTPResponse object:

>>> with urlopen("https://api.nasa.gov/planetary/apod?api_key=DEMO_KEY") as response:
...    json_response = json.load(response)

Reading the response headers

Another very useful method usable on the HTTPResponse object is getheaders. This method returns the headers of the response as an array of tuples. Each tuple contains an header parameter and its corresponding value:



>>> pprint(response.getheaders())
[('Server', 'openresty'),
('Date', 'Sun, 14 Apr 2019 10:08:48 GMT'),
('Content-Type', 'application/json'),
('Content-Length', '1370'),
('Connection', 'close'),
('Vary', 'Accept-Encoding'),
('X-RateLimit-Limit', '40'),
('X-RateLimit-Remaining', '37'),
('Via', '1.1 vegur, http/1.1 api-umbrella (ApacheTrafficServer [cMsSf ])'),
('Age', '1'),
('X-Cache', 'MISS'),
('Access-Control-Allow-Origin', '*'),
('Strict-Transport-Security', 'max-age=31536000; preload')]

You can notice, among the others, the Content-type parameter, which, as we said above, is application/json. If we want to retrieve only a specific parameter we can use the getheader method instead, passing the name of the parameter as argument:

>>> response.getheader('Content-type')
'application/json'

Getting the status of the response

Getting the status code and reason phrase returned by the server after an HTTP request is also very easy: all we have to do is to access the status and reason properties of the HTTPResponse object:

>>> response.status
200
>>> response.reason
'OK'

Including variables in the GET request

The URL of the request we sent above contained only one variable: api_key, and its value was "DEMO_KEY". If we want to pass multiple variables, instead of attaching them to the URL manually, we can provide them and their associated values as key-value pairs of a python dictionary (or as a sequence of two-elements tuples); this dictionary will be passed to the urllib.parse.urlencode method, which will build and return the query string. The API call we used above, allow us to specify an optional "date" variable, to retrieve the picture associated with a specific day. Here is how we could proceed:

>>> from urllib.parse import urlencode
>>> query_params = {
..."api_key": "DEMO_KEY",
..."date": "2019-04-11"
}
>>> query_string = urlencode(query_params)
>>> query_string
'api_key=DEMO_KEY&date=2019-04-11'

First we defined each variable and its corresponding value as key-value pairs of a dictionary, than we passed said dictionary as an argument to the urlencode function, which returned a formatted query string. Now, when sending the request, all we have to do is to attach it to the URL:

>>> url = "?".join(["https://api.nasa.gov/planetary/apod", query_string])

If we send the request using the URL above, we obtain a different response and a different image:


{'date': '2019-04-11',
 'explanation': 'What does a black hole look like? To find out, radio '
                'telescopes from around the Earth coordinated observations of '
                'black holes with the largest known event horizons on the '
                'sky.  Alone, black holes are just black, but these monster '
                'attractors are known to be surrounded by glowing gas.  The '
                'first image was released yesterday and resolved the area '
                'around the black hole at the center of galaxy M87 on a scale '
                'below that expected for its event horizon.  Pictured, the '
                'dark central region is not the event horizon, but rather the '
                "black hole's shadow -- the central region of emitting gas "
                "darkened by the central black hole's gravity. The size and "
                'shape of the shadow is determined by bright gas near the '
                'event horizon, by strong gravitational lensing deflections, '
                "and by the black hole's spin.  In resolving this black hole's "
                'shadow, the Event Horizon Telescope (EHT) bolstered evidence '
                "that Einstein's gravity works even in extreme regions, and "
                'gave clear evidence that M87 has a central spinning black '
                'hole of about 6 billion solar masses.  The EHT is not done -- '
                'future observations will be geared toward even higher '
                'resolution, better tracking of variability, and exploring the '
                'immediate vicinity of the black hole in the center of our '
                'Milky Way Galaxy.',
 'hdurl': 'https://apod.nasa.gov/apod/image/1904/M87bh_EHT_2629.jpg',
 'media_type': 'image',
 'service_version': 'v1',
 'title': 'First Horizon-Scale Image of a Black Hole',
 'url': 'https://apod.nasa.gov/apod/image/1904/M87bh_EHT_960.jpg'}


In case you didn't notice, the returned image URL points to the recently unveiled first picture of a black hole:

nasa-black-hole
The picture returned by the API call - The First image of a black hole

Sending a POST request

Sending a POST request, with variables 'contained' inside the request body, using the standard library, requires additional steps. First of all, as we did before, we construct the POST data in the form of a dictionary:

>>> data = {
...    "variable1": "value1",
...    "variable2": "value2"
...}

After we constructed our dictionary, we want to use the urlencode function as we did before, and additionally encode the resulting string in ascii:

>>>post_data = urlencode(data).encode('ascii')

Finally, we can send our request, passing the data as the second argument of the urlopen function. In this case we will use https://httpbin.org/post as destination URL (httpbin.org is a request & response service):

>>> with urlopen("https://httpbin.org/post", post_data) as response:
...    json_response = json.load(response)
>>> pprint(json_response)
{'args': {},
 'data': '',
 'files': {},
 'form': {'variable1': 'value1', 'variable2': 'value2'},
 'headers': {'Accept-Encoding': 'identity',
             'Content-Length': '33',
             'Content-Type': 'application/x-www-form-urlencoded',
             'Host': 'httpbin.org',
             'User-Agent': 'Python-urllib/3.7'},
 'json': None,
 'origin': 'xx.xx.xx.xx, xx.xx.xx.xx',
 'url': 'https://httpbin.org/post'}

The request was successful, and the server returned a JSON response which includes information about the request we made. As you can see the variables we passed in the body of the request are reported as the value of the 'form' key in the response body. Reading the value of the headers key, we can also see the that the content type of the request was application/x-www-form-urlencoded and the user agent 'Python-urllib/3.7'.

Sending JSON data in the request

What if we want to send a JSON representation of data with our request? First we define the structure of the data, than we convert it to JSON:

>>> person = {
...     "firstname": "Luke",
...     "lastname": "Skywalker",
...     "title": "Jedi Knight"
... }

We also want to use a dictionary to define custom headers. In this case, for example, we want to specify that our request content is application/json:

>>> custom_headers = {
...    "Content-Type": "application/json"
...}

Finally, instead of sending the request directly, we create a Request object and we pass, in order: the destination URL, the request data and the request headers as arguments of its constructor:

>>> from urllib.request import Request
>>> req = Request(
...    "https://httpbin.org/post",
...    json.dumps(person).encode('ascii'),
...    custom_headers
...)

One important thing to notice is that we used the json.dumps function passing the dictionary containing the data we want to be included in the request as its argument: this function is used to serialize an object into a JSON formatted string, which we encoded using the encode method.


SUBSCRIBE TO NEWSLETTER
Subscribe to Linux Career NEWSLETTER and receive latest Linux news, jobs, career advice and tutorials.


At this point we can send our Request, passing it as the first argument of the urlopen function:

>>> with urlopen(req) as response:
...    json_response = json.load(response)

Let's check the content of the response:

{'args': {},
'data': '{"firstname": "Luke", "lastname": "Skywalker", "title": "Jedi '
        'Knight"}',
'files': {},
'form': {},
'headers': {'Accept-Encoding': 'identity',
            'Content-Length': '70',
            'Content-Type': 'application/json',
            'Host': 'httpbin.org',
            'User-Agent': 'Python-urllib/3.7'},
'json': {'firstname': 'Luke', 'lastname': 'Skywalker', 'title': 'Jedi Knight'},
'origin': 'xx.xx.xx.xx, xx.xx.xx.xx',
'url': 'https://httpbin.org/post'}

This time we can see that the dictionary associated with the "form" key in the response body is empty, and the one associated with "json" key represents the data we sent as JSON. As you can observe, even the custom header parameter we sent has been received correctly.

Sending a request with an HTTP verb other than GET or POST

When interacting with APIs we may need to use HTTP verbs other than just GET or POST. To accomplish this task we must use the last parameter of the Request class constructor and specify the verb we want to use. The default verb is GET if the data parameter is None, otherwise POST is used. Suppose we want to send a PUT request:

>>> req = Request(
...    "https://httpbin.org/put",
...    json.dumps(person).encode('ascii'),
...    custom_headers,
...    method='PUT'
...)

Downloading a file

Another very common operation we may want to perform is to download some kind of file from the web. Using the standard library there are two ways to do it: using the urlopen function, reading the response in chunks (especially if the file to download is big) and writing them to a local file "manually", or using the urlretrieve function, which, as stated in the official documentation, is considered part of an old interface, and might become deprecated in the future. Let's see an example of both strategies.

Downloading a file using urlopen

Say we want to download the tarball containing the latest version of the Linux kernel source code. Using the first method we mentioned above, we write:

>>> latest_kernel_tarball = "https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.0.7.tar.xz"
>>> with urlopen(latest_kernel_tarball) as response:
...    with open('latest-kernel.tar.xz', 'wb') as tarball:
...        while True:
...            chunk = response.read(16384)
...            if chunk:
...                tarball.write(chunk)
...            else:
...                break

In the example above we first used both the urlopen function and the open one inside with statements and therefore using the context-management protocol to ensure that resources are cleaned immediately after the block of code where they are used is executed. Inside a while loop, at each iteration, the chunk variable references the bytes read from the response, (16384 in this case - 16 Kibibytes). If chunk is not empty, we write the content to the file object ("tarball"); if it is empty, it means that we consumed all the content of the response body, therefore we break the loop.

A more concise solution involves the use of the shutil library and the copyfileobj function, which copies data from a file-like object (in this case "response") to another file-like object (in this case, "tarball"). The buffer size can be specified using the third argument of the function, which, by default, is set to 16384 bytes):

>>> import shutil
...    with urlopen(latest_kernel_tarball) as response:
...        with open('latest-kernel.tar.xz', 'wb') as tarball:
...            shutil.copyfileobj(response, tarball)


Downloading a file using the urlretrieve function

The alternative and even more concise method to download a file using the standard library is by the use of the urllib.request.urlretrieve function. The function takes four argument, but only the first two interest us now: the first is mandatory, and is the URL of the resource to download; the second is the name used to store the resource locally. If it is not given, the resource will be stored as a temporary file in /tmp. The code becomes:

>>> from urllib.request import urlretrieve
>>> urlretrieve("https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.0.7.tar.xz")
('latest-kernel.tar.xz', <http.client.HTTPMessage object at 0x7f414a4c9358>)

Very simple, isn't it? The function returns a tuple which contains the name used to store the file (this is useful when the resource is stored as temporary file, and the name is a random generated one), and the HTTPMessage object which holds the headers of the HTTP response.

Conclusions

In this first part of the series of articles dedicated to python and HTTP requests, we saw how to send various types of requests using only standard library functions, and how to work with responses. If you have doubt or want to explore things more in depth, please consult the official official urllib.request documentation. The next part of the series will focus on Python HTTP request library.

ARE YOU LOOKING FOR A LINUX JOB?
Submit your RESUME, create a JOB ALERT or subscribe to RSS feed on LinuxCareers.com.
LINUX CAREER NEWSLETTER
Subscribe to NEWSLETTER and receive latest news, jobs, career advice and tutorials.
DO YOU NEED ADDITIONAL HELP?
Get extra help by visiting our LINUX FORUM or simply use comments below.

You may also be interested in:



Comments and Discussions