By Howard Osborne
In this deep dive, we are going to take a closer look at compression. Why? Because enabling or disabling compression is a matter of judgement. Get it right and you have a performance edge over your slower competitors, get it wrong and your site could be exposed to security vulnerabilities.
CRIME and BREACH
In late 2012, the BREACH security vulnerability in compression was announced leading to the wholesale removal of compression for the TLS and SPDY protocols. When in August this vulnerability was explained more specifically for HTTP in BREACH, there were widespread fears about the safety of the internet as a whole. The dip in the use of compression for the top 100 companies can be seen in the HTTP archives towards the end of that year.
And whilst mitigation strategies are now available such as preventing cross-site request forgery, the exploit still exists.
The urge to simply switch off compression is understandable but without it, you are risking lagging behind your competitors.
Background to compression
There’s nothing new about trying to save space, if fact it has been something of an obsession in earlier times. Before Y2K even four digit years used to be a luxury.
With the phenomenal improvements in hard drive, memory and processor technologies, most of those old preoccupations went out the window and as far as memory and hard drive space went, greed was good. That was until we started moving these bulky new bits of data around the world over old telephone cables.
And in the new commercial internet age, trying a customer’s patience meant them surfing to your competitor. So how could rich content be delivered speedily?
A neat solution was HTTP compression, which used the old tried and tested technologies for reducing the size of files and applied it specifically for webservers and their clients.
The process is relatively simple. If a client has the capability of handling compressed files, it says so by adding a header to requests, either: Accept-Encoding: gzip, deflate or Content-Encoding: gzip
When the server sees the header, it sends a compressed version of the file.
The most common forms of file compression implement a version of the deflate algorithm. For this there are two stages:
1.Find duplicate strings and replace with a pointer
During this process, the file is scanned for strings or series of characters which appear more than once. Each instance is then replaced with a pointer back to the first example. This works well where there is lots of repetition and is ideal for Phil Collins songs.
Let’s take an example of this in action. Here are two files recording similar information about how long it takes to do an activity.
And here is our second file with the same timings but with the unit of duration (seconds) specified in the heading rather than in each row
As a result, timings.txt is a much bigger file
Timings.txt 14007 bytes
Timings_lean.txt 6016 bytes
However, because file compression works well with repetition, when our files are compressed, they end up occupying a similar amount of space:
Timings.zip 2979 bytes
Timings_lean.zip 2748 bytes
2.Apply Huffman coding
The second process, known as Huffman coding which was invented in those space conscious 1950s, involves changing how common symbols are represented and replacing them with something smaller.
In an ascii file, each character is represented by eight bits. The great advantage of this system is that it is very simple to make a file reader. You read a file one byte at a time and look up its ascii value and print it out. For example, when you read the byte 01100001, you print the letter ‘a’.
However, this can be a little bit wasteful as you always need a full eight bits for each character. Huffman coding creates a tree of characters along with its frequency and the most frequent get a smaller bit representation.
Not all files are equal
Our timings files compressed very nicely as they were plain text files with lots of duplication, however other file formats will not produce such great results. Image files such as JPEG have already undergone a form of compression and as a result, gain little if anything from the process, but as a rule of thumb, we can take any file and send it compressed irrespective of what it is.
Now we have seen the role that compression plays in optimising performance, our next deep dive will be on trying to not send files unnecessarily and for that we will look at caching.