What is a Web Request
Posted on by Paul DiGian
What is a Web Request is part of Junior2Senior a course to help you grow as software engineers.
It is about unloading all the knowledge that I accumulate over my career to younger engineers to help speed up their own career.
You can follow me on twitter as well Follow @DigianPaul
The book is actually on pre-sale 50% off for only 19$.
The web and internet as a large are fundamental for today’s world, but how does it work under the hood?
In this article, we will talk about HTTP requests that are the fundamentals blocks that make the modern web possible.
We will explore how tools like Python requests
or Ruby Net::HTTP
or again the Go package net/http
or any other library that you may use in any other language works.
The Aha moment
Reasoning about the topic becomes extremely simple as soon as we figure out what is an HTTP request, and the reality is extremely simple.
An HTTP request is nothing more than a stream of bytes, almost always ASCII bytes that it is possible to read and interpret. To put it even more simply, it is just a string formatted following some rule.
The one below is already a complete HTTP request and this is the most important concept. An HTTP request is nothing more than strings of text similar to this one.
GET / HTTP/2
Host: www.google.com
User-Agent: curl/7.58.0
Accept: */*
An HTTP response is not that different:
HTTP/2 200
date: Fri, 10 Apr 2020 11:25:36 GMT
expires: -1
cache-control: private, max-age=0
content-type: text/html; charset=ISO-8859-1
p3p: CP="This is not a P3P policy! See g.co/p3phelp for more info."
server: gws
x-xss-protection: 0
x-frame-options: SAMEORIGIN
set-cookie: 1P_JAR=2020-04-10-11; expires=Sun, 10-May-2020 11:25:36 GMT; path=/; domain=.google.com; Secure
set-cookie: NID=202=lXbpl7jzwsRDvcSyw84CtGB7NO3J2HziT0SjF24N4joVsoUzXNRdc03yTeckZu2zQXc8TJty73IYg9ktX3yrtSb59lC1-jxyTprH_wGly4D2RiFC4Ww1T2Om69YYjxDtkgEDmQbqoYYyzahBQowvSM-q5JpF6hoC-gzLRTnnn38; expires=Sat, 10-Oct-2020 11:25:36 GMT; path=/; domain=.google.com; HttpOnly
alt-svc: quic=":443"; ma=2592000; v="46,43",h3-Q050=":443"; ma=2592000,h3-Q049=":443"; ma=2592000,h3-Q048=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,h3-T050=":443"; ma=2592000
accept-ranges: none
vary: Accept-Encoding
<!doctype html><html> ...THE_HTML_OF_THE_GOOGLE_HOMEPAGE... </html>
All those libraries and frameworks are just a way to create, interpret and send over the network strings that look like this.
The HTTP rules and protocol
The rules to create a correct HTTP request are encoded in an RFC 7230 it can be interesting to explore the document and understand at least some of the details, but for most developer will be an overkill. However, if your work orbit a lot around the web infrastructure it can be necessary.
For instance, Section 3 of RFC 7230 describe formally what is an HTTP message
.
HTTP-message = start-line
*( header-field CRLF )
CRLF
[ message-body ]
This definition means that an HTTP-message
is nothing else than a start-line
, followed by 0 or more header-field
each followed by a CRLF (a new line \r\n
), another new line and an optional message-body
.
Keep reading the RFC we will discover what is start-line
on section 3.1 and so on.
It turns out that the start-line
of a request specify the method of the request, either GET
, HEAD
, POST
, PUT
, DELETE
, CONNECT
, OPTIONS
or TRACE
all documented in RFC 7231 Section 4.
Most developers just need to be aware of all the possible methods, but they won’t use all of them.
Then it specifies what is the target of the request, in our case the root directory /
.
And finally, it specifies what protocol to use in the communication, in this case, we use HTTP/2
, another common one is HTTP/1.1
.
Keeping state, authentication, and cookie
The HTTP protocol is mostly stateless, there is nothing that forbids a client to request somebody private records on a social network protocol after all an HTTP request is just a string formatted following some rules.
However, the server needs to forbid those requests.
To overcome the limitation imposed by being a stateless protocol, HTTP relies on the concept of Cookies.
An HTTP response, generated by an HTTP server, can include the Set-Cookie
header (the name is case insensitive, the request above to google.com returned set-cookie
).
The Set-Cookie
header instructs the user-agent
(the client, with some level of approximation) to store the value of the cookie and to send it to the server to any subsequent request using the Cookie
header.
Another RFC (6265) covers this topic extensively.
For most developer is sufficient to know that the Set-Cookie
header provides some options to tweak how the client manages the cookie.
For instance, the Path
attribute indicates that the cookie should be sent only for requests against a specific URL path.
The Secure
indicates that the cookie should not be sent on insecure connections.
Finally the HttpOnly
indicate that the cookie should not be accessible outside the HTTP protocol, for instance, it should not be read using the JS API on the browser.
Closing thoughts
While the protocol itself is rather simple, after all, it is just about concatenating strings and keep tracks of cookies, the necessities of modern software forced us to build complex systems to extract as much performance as possible while keeping an ergonomic interface.
Most likely developers would like to make more requests in parallel while maybe sharing some cookies but not all.
All these real-world necessities and use case makes the software explode in complexity, but also in usefulness.
However HTTP
remains a simple protocol and the CPython codebase offers a glimpse of this simplicity.
request = '%s %s %s' % (method, url, self._http_vsn_str)
self._output(self._encode_request(request))
It is creating the start-line
of an HTTP message, just like we did at the very beginning of this post with GET / HTTP/2
Exercises
Create a function to create HTTP requests.
As input, you can expect the methods (GET
, POST
, etc…), the host and the URL of the resource.
You can expand the little function to support also headers?
And how would you manage cookies?
In the next section, we will discover DNS, sockets, and TCP. So it will be possible to actually send a request and read back the response.
Bonus Trivial
From the standard of HTTP1.1
, the headers are strictly optional, indeed they are defined like this *( header-field CRLF)
where the star (*
) means zero or more repetition.
However, the same standard dictates that the Host
header must be present.
This clear inconsistency was introduced to keep backward compatibility so that every HTTP1.1
request is also an HTTP1.0
request.
This is a clear example of how the world evolves from mistakes and error and also something as widespread and used as "the internet" was designed with some initial mistakes. After all, we are human and the success of the HTTP
was not sure.