1. 程式人生 > >Web Application from Scratch, Part I

Web Application from Scratch, Part I

This is the first in a series of posts in which I’m going to go through the process of building a web application (and its web server) from scratch in Python. For the purposes of this series, I’m going to solely rely on the Python standard library and I’m going to ignore the WSGI standard.

Without further ado, let’s get to it!

The web server

To begin with, we’re going to write the HTTP server that will power our web app. But first, we need to spend a little time looking into how the HTTP protocol works.

How HTTP works

Simply put, HTTP clients connect to HTTP servers over the network and send them a string of data representing the request. The server then interprets that request and sends the client back a response. The entire protocol and the formats of those requests and responses are described in

RFC2616, but I’m going to informally describe them below so you don’t have to read the whole thing.

Request format

Requests are represented by a series of \r\n-separated lines, the first of which is called the “request line”. The request line is made up of an HTTP method, followed by a space, followed by the path of the file being requested, followed by another space, followed by the HTTP protocol version the client speaks and, finally, followed by a carriage return (\r

) and a line feed (\n) character:

GET /some-path HTTP/1.1\r\n

After the request line come zero or more header lines. Each header line is made up of the header name, followed by a colon, followed by an optional value, followed by \r\n:

Host: example.com\r\nAccept: text/html\r\n

The end of the headers section is signalled by an empty line:

\r\n

Finally, the request may contain a “body” — an arbitrary payload that is sent to the server with the request.

Putting it all together, here’s a simple GET request:

GET / HTTP/1.1\r\nHost: example.com\r\nAccept: text/html\r\n\r\n

and here’s a simple POST request with a body:

POST / HTTP/1.1\r\nHost: example.com\r\nAccept: application/json\r\nContent-type: application/json\r\nContent-length: 2\r\n\r\n{}

Response format

Responses, like requests, are made up of a series of \r\n-separated lines. The first line in the response is called the “status line” and it is made up of the HTTP protocol version, followed by a space, followed by the response status code, followed by another space, then the status code reason, followed by \r\n:

HTTP/1.1 200 OK\r\n

After the status line come the response headers, then an empty line and then an optional response body:

HTTP/1.1 200 OK\r\nContent-type: text/html\r\nContent-length: 15\r\n\r\n<h1>Hello!</h1>

A simple server

Based on what we know so far about the protocol, let’s write a server that sends the same response regardless of the incoming request.

To start out, we need to create a socket, bind it to an address and then start listening for connections.

If you try to run this code now, it’ll print to standard out that it’s listening on 127.0.0.1:9000and then exit. In order to actually process incoming connections we need to call the accept method on our socket. Doing so will block the process until a client connects to our server.

Once we have a socket connection to the client, we can start to communicate with it. Using the sendall method, let’s send the connecting client an example response:

If you run the code now and then visit http://127.0.0.1:9000 in your favourite browser, it should render the string “Hello!”. Unfortunately, the server will exit after it sends the response so refreshing the page will fail. Let’s fix that:

At this point we have a web server that can serve a simple HTML web page on every request, all in about 25 lines of code. That’s not too bad!

A file server

Let’s extend the HTTP server so that it can serve files off of disk.

Request abstraction

Before we can do that, we have to be able to read and parse incoming request data from the client. Since we know that request data is represented by a series of lines, each separated by \r\n characters, let’s write a generator function that reads data from a socket and yields each individual line:

This may look a bit daunting, but essentially what it does is it reads as much data as it can from the socket (in bufsize chunks), joins that data together in a buffer (buff) and continually splits the buffer into individual lines, yielding one at a time. Once it finds an empty line, it returns the extra data that it read.

Using iter_lines, we can begin printing the requests we get from our clients:

If you run the server now and visit http://127.0.0.1:9000, you should see something like this in your console:

Received connection from ('127.0.0.1', 62086)...b'GET / HTTP/1.1'b'Host: localhost:9000'b'Connection: keep-alive'b'Cache-Control: max-age=0'b'Upgrade-Insecure-Requests: 1'b'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36'b'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'b'Accept-Encoding: gzip, deflate, br'b'Accept-Language: en-US,en;q=0.9,ro;q=0.8'

Pretty neat! Let’s abstract over that data by defining a Request class:

For now, the request class is only going to know about methods, paths and request headers. We’ll leave parsing query string parameters and reading request bodies for later.

To encapsulate the logic needed to build up a request, we’ll add a class method to Request called from_socket:

It uses the iter_lines function we defined earlier to read the request line. That’s where it gets the method and the path, then it reads each individual header line and parses those. Finally, it builds the Request object and returns it. If we plug that into our server loop, it should look something like this:

If you connect to the server now, you should see lines like this one get printed out:

Request(method='GET', path='/', headers={'host': 'localhost:9000', 'user-agent': 'curl/7.54.0', 'accept': '*/*'})

Because from_socket can raise an exception under certain circumstances, the server might crash if given an invalid request right now. To simulate this, you can use telnet to connect to the server and send it some bogus data:

~> telnet 127.0.0.1 9000Trying 127.0.0.1...Connected to localhost.Escape character is '^]'.helloConnection closed by foreign host.

Sure enough, the server crashed:

Received connection from ('127.0.0.1', 62404)...Traceback (most recent call last):  File "server.py", line 53, in parse    request_line = next(lines).decode("ascii")ValueError: not enough values to unpack (expected 3, got 1)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):  File "server.py", line 82, in <module>    with client_sock:  File "server.py", line 55, in parse    raise ValueError("Request line missing.")ValueError: Malformed request line 'hello'.

To handle these kinds of issues a little more gracefully, let’s wrap the call to from_socket in a try-except block and send the client a “400 Bad Request” response when we get a malformed request:

If we try to break it now, our client will get a response back and the server will stay up:

~> telnet 127.0.0.1 9000Trying 127.0.0.1...Connected to localhost.Escape character is '^]'.helloHTTP/1.1 400 Bad RequestContent-type: text/plainContent-length: 11
Bad RequestConnection closed by foreign host.

At this point we’re ready to start implementing the file serving part, but first let’s make our default response a “404 Not Found” response:

Additionally, let’s add a “405 Method Not Allowed” response. We’re going to need it for when we get anything other than a GET request.

Let’s define a SERVER_ROOT constant to represent where the server should serve files from and a serve_file function.

serve_file takes the client socket and a path to a file. It then tries to resolve that path to a real file inside of the SERVER_ROOT, returning a “not found” response if the file resolves outside of the server root. Then it tries to open the file and figure out its mime type and size (using os.fstat), then it constructs the response headers and uses the sendfile system call to write the file to the socket. If it can’t find the file on disk, then it sends a “not found” response.

If we add serve_file into the mix, our server loop should now look like this:

If you add a file called www/index.html next to your server.py file and visit http://localhost:9000 you should see the contents of that file. Cool, eh?

Winding down

That’s it for part 1. In part 2 we’re going to cover extracting Server and Response abstractions as well as making the server handle multiple concurrent connections. If you’d like to check out the full source code and follow along, you can find it here.

See ya next time!