e-testing Blog

Compression

By Howard Osborne

Good Performance Testing - Compression

In this deep dive, we are going to take a closer look at compression. Why? Because enabling or disabling compression is a matter of judgement. Get it right and you have a performance edge over your slower competitors, get it wrong and your site could be exposed to security vulnerabilities.

CRIME and BREACH
In late 2012, the BREACH security vulnerability in compression was announced leading to the wholesale removal of compression for the TLS and SPDY protocols. When in August this vulnerability was explained more specifically for HTTP in BREACH, there were widespread fears about the safety of the internet as a whole. The dip in the use of compression for the top 100 companies can be seen in the HTTP archives towards the end of that year.

Compression_deep_dive_series_part2

And whilst mitigation strategies are now available such as preventing cross-site request forgery, the exploit still exists.

The urge to simply switch off compression is understandable but without it, you are risking lagging behind your competitors.

Background to compression
There’s nothing new about trying to save space, if fact it has been something of an obsession in earlier times. Before Y2K even four digit years used to be a luxury.

With the phenomenal improvements in hard drive, memory and processor technologies, most of those old preoccupations went out the window and as far as memory and hard drive space went, greed was good. That was until we started moving these bulky new bits of data around the world over old telephone cables.

And in the new commercial internet age, trying a customer’s patience meant them surfing to your competitor. So how could rich content be delivered speedily?

A neat solution was HTTP compression, which used the old tried and tested technologies for reducing the size of files and applied it specifically for webservers and their clients.

The process is relatively simple. If a client has the capability of handling compressed files, it says so by adding a header to requests, either: Accept-Encoding: gzip, deflate or Content-Encoding: gzip

When the server sees the header, it sends a compressed version of the file.

The most common forms of file compression implement a version of the deflate algorithm. For this there are two stages:

1.Find duplicate strings and replace with a pointer
During this process, the file is scanned for strings or series of characters which appear more than once. Each instance is then replaced with a pointer back to the first example. This works well where there is lots of repetition and is ideal for Phil Collins songs.

Let’s take an example of this in action. Here are two files recording similar information about how long it takes to do an activity.

timings.txt

Times

9850 seconds

6090 seconds

6467 seconds

And here is our second file with the same timings but with the unit of duration (seconds) specified in the heading rather than in each row

timings_lean.txt

Times (seconds)

9850

6090

6467

As a result, timings.txt is a much bigger file

Timings.txt            14007 bytes

Timings_lean.txt   6016 bytes

However, because file compression works well with repetition, when our files are compressed, they end up occupying a similar amount of space:

Timings.zip            2979 bytes

Timings_lean.zip  2748 bytes

2.Apply Huffman coding
The second process, known as Huffman coding which was invented in those space conscious 1950s, involves changing how common symbols are represented and replacing them with something smaller.

In an ascii file, each character is represented by eight bits. The great advantage of this system is that it is very simple to make a file reader. You read a file one byte at a time and look up its ascii value and print it out. For example, when you read the byte 01100001, you print the letter ‘a’.

However, this can be a little bit wasteful as you always need a full eight bits for each character. Huffman coding creates a tree of characters along with its frequency and the most frequent get a smaller bit representation.

Not all files are equal
Our timings files compressed very nicely as they were plain text files with lots of duplication, however other file formats will not produce such great results. Image files such as JPEG have already undergone a form of compression and as a result, gain little if anything from the process, but as a rule of thumb, we can take any file and send it compressed irrespective of what it is.

What next?
Now we have seen the role that compression plays in optimising performance, our next deep dive will be on trying to not send files unnecessarily and for that we will look at caching.

Ready for Part 3? Read about Caching Here >>

Missed Part 1? Read about Minifying Here >>

CLICK HERE FOR UPDATES

Subscribe to our RSS feed and get the latest updates in your inbox weekly

logo