What is a Web Request

Posted on 10 April, 2020 at 12:53 CEST by Paul DiGian

What is a Web Request is part of Junior2Senior a course to help you grow as software engineers.

It is about unloading all the knowledge that I accumulate over my career to younger engineers to help speed up their own career.

You can follow me on twitter as well

The book is actually on pre-sale 50% off for only 19$.

Buy Junior2Senior

Networks Cables - Photo by Jordan Harrison on Unsplash

The web and internet as a large are fundamental for today’s world, but how does it work under the hood?

In this article, we will talk about HTTP requests that are the fundamentals blocks that make the modern web possible.

We will explore how tools like Python requests or Ruby Net::HTTP or again the Go package net/http or any other library that you may use in any other language works.

The Aha moment

Reasoning about the topic becomes extremely simple as soon as we figure out what is an HTTP request, and the reality is extremely simple.

An HTTP request is nothing more than a stream of bytes, almost always ASCII bytes that it is possible to read and interpret. To put it even more simply, it is just a string formatted following some rule.

The one below is already a complete HTTP request and this is the most important concept. An HTTP request is nothing more than strings of text similar to this one.

GET / HTTP/2
Host: www.google.com
User-Agent: curl/7.58.0
Accept: */*

An HTTP response is not that different:

HTTP/2 200
date: Fri, 10 Apr 2020 11:25:36 GMT
expires: -1
cache-control: private, max-age=0
content-type: text/html; charset=ISO-8859-1
p3p: CP="This is not a P3P policy! See g.co/p3phelp for more info."
server: gws
x-xss-protection: 0
x-frame-options: SAMEORIGIN
set-cookie: 1P_JAR=2020-04-10-11; expires=Sun, 10-May-2020 11:25:36 GMT; path=/; domain=.google.com; Secure
set-cookie: NID=202=lXbpl7jzwsRDvcSyw84CtGB7NO3J2HziT0SjF24N4joVsoUzXNRdc03yTeckZu2zQXc8TJty73IYg9ktX3yrtSb59lC1-jxyTprH_wGly4D2RiFC4Ww1T2Om69YYjxDtkgEDmQbqoYYyzahBQowvSM-q5JpF6hoC-gzLRTnnn38; expires=Sat, 10-Oct-2020 11:25:36 GMT; path=/; domain=.google.com; HttpOnly
alt-svc: quic=":443"; ma=2592000; v="46,43",h3-Q050=":443"; ma=2592000,h3-Q049=":443"; ma=2592000,h3-Q048=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,h3-T050=":443"; ma=2592000
accept-ranges: none
vary: Accept-Encoding

<!doctype html><html> ...THE_HTML_OF_THE_GOOGLE_HOMEPAGE... </html>

All those libraries and frameworks are just a way to create, interpret and send over the network strings that look like this.

The HTTP rules and protocol

The rules to create a correct HTTP request are encoded in an RFC 7230 it can be interesting to explore the document and understand at least some of the details, but for most developer will be an overkill. However, if your work orbit a lot around the web infrastructure it can be necessary.

For instance, Section 3 of RFC 7230 describe formally what is an HTTP message.

HTTP-message = start-line
               *( header-field CRLF )
               CRLF
               [ message-body ]

This definition means that an HTTP-message is nothing else than a start-line, followed by 0 or more header-field each followed by a CRLF (a new line \r\n), another new line and an optional message-body.

Keep reading the RFC we will discover what is start-line on section 3.1 and so on.

It turns out that the start-line of a request specify the method of the request, either GET, HEAD, POST, PUT, DELETE, CONNECT, OPTIONS or TRACE all documented in RFC 7231 Section 4. Most developers just need to be aware of all the possible methods, but they won’t use all of them.

Then it specifies what is the target of the request, in our case the root directory /.

And finally, it specifies what protocol to use in the communication, in this case, we use HTTP/2, another common one is HTTP/1.1.

Keeping state, authentication, and cookie

The HTTP protocol is mostly stateless, there is nothing that forbids a client to request somebody private records on a social network protocol after all an HTTP request is just a string formatted following some rules.

However, the server needs to forbid those requests.

To overcome the limitation imposed by being a stateless protocol, HTTP relies on the concept of Cookies.

An HTTP response, generated by an HTTP server, can include the Set-Cookie header (the name is case insensitive, the request above to google.com returned set-cookie). The Set-Cookie header instructs the user-agent (the client, with some level of approximation) to store the value of the cookie and to send it to the server to any subsequent request using the Cookie header.

Another RFC (6265) covers this topic extensively.

For most developer is sufficient to know that the Set-Cookie header provides some options to tweak how the client manages the cookie.

For instance, the Path attribute indicates that the cookie should be sent only for requests against a specific URL path.

The Secure indicates that the cookie should not be sent on insecure connections.

Finally the HttpOnly indicate that the cookie should not be accessible outside the HTTP protocol, for instance, it should not be read using the JS API on the browser.

Closing thoughts

While the protocol itself is rather simple, after all, it is just about concatenating strings and keep tracks of cookies, the necessities of modern software forced us to build complex systems to extract as much performance as possible while keeping an ergonomic interface.

Most likely developers would like to make more requests in parallel while maybe sharing some cookies but not all.

All these real-world necessities and use case makes the software explode in complexity, but also in usefulness.

However HTTP remains a simple protocol and the CPython codebase offers a glimpse of this simplicity.

request = '%s %s %s' % (method, url, self._http_vsn_str)
self._output(self._encode_request(request))

It is creating the start-line of an HTTP message, just like we did at the very beginning of this post with GET / HTTP/2

Exercises

Create a function to create HTTP requests. As input, you can expect the methods (GET, POST, etc…), the host and the URL of the resource. You can expand the little function to support also headers? And how would you manage cookies?

In the next section, we will discover DNS, sockets, and TCP. So it will be possible to actually send a request and read back the response.

Bonus Trivial

From the standard of HTTP1.1, the headers are strictly optional, indeed they are defined like this *( header-field CRLF) where the star (*) means zero or more repetition. However, the same standard dictates that the Host header must be present.

This clear inconsistency was introduced to keep backward compatibility so that every HTTP1.1 request is also an HTTP1.0 request.

This is a clear example of how the world evolves from mistakes and error and also something as widespread and used as "the internet" was designed with some initial mistakes. After all, we are human and the success of the HTTP was not sure.