For this month’s Geospatial Frequently Asked Question (G-FAQ), I explore a topic that perhaps some of us have given little consideration to: file compression. Specifically, I look at file compression of TIFFs which are commonly large datasets that we have to store for both short and long time periods. Take for instance the minimum order of 25 square kilometers (sq km) of 50-centimeter (cm) WorldView-2 imagery with 8-bit depth, this dataset alone is 400 megabytes (mb). Now consider the enormity of a 1-foot aerial collection over an entire county, and surely the topic of file compression now makes a bit more sense. I have compressed my archival copies of client orders for some time now to reduce the amount of hard drive space needed for long-term storage; and thus, to save a bit of money.
With this in mind, the January G-FAQ will address this core set of questions:
- What is the best compression format for TIFFs?
- What freeware applications are offered for file compression?
- Is there a format that is better for short versus long-term storage?
In past G-FAQs, I have spent some time introducing the principles behind the topics discussed; this month, however, I will leave the principles of file compression to the experts – if you want to read more about the topic, check out this through summary. Suffice it to say that compression saves space by looking for repeating patterns in words, numbers and images; and then by replacing them with shorter data strings in the compressed file. Some compressed file formats you might be familiar with are .bin (Mac only), .gz, .rar, .tar (does not actually compress, it just bundles multiple files together into one) and .zip – with .zip far and away the most commonly used.
Freeware File Compression Software
While there are a plethora of freeware programs that can create compressed files, here is a list of some good options:
But my favorite freeware program is 7-Zip which is also featured in this month’s Free For All. You can read more about 7-Zip’s features here. On the 7-Zip website, they show test results where .7z appears to be the best format for file compression, and that is part of the inspiration for this G-FAQ.
The compression tests that you see below were completed on a Sony Vaio Laptop with 12 gb of RAM, a 64-bit Windows 7 Professional operating system and an Intel Core i7-3520M CPU running at 2.90 Ghz. During the tests, my laptop was plugged in and all programs were shut down except for those that normally run in the background such as anti-virus software and NVIDIA settings. The TIFF file used for the tests was a 16-bit depth 50-cm natural color image collected by WorldView-2 and covering 5 sq km. The file’s original size was 122,611,083 kb and I had 7-Zip version 9.20 installed for the tests.
In 7-Zip, you have the ability to tweak a variety of options when you create a compressed file. Here is a short summary of each:
- Archive Format: this is the type of compressed file you will create, for example .7z or .zip
- Compression Level: this sets the compression method from the fastest to ultra
- Compression Method: specifies the algorithm used to compress your data
- Dictionary Size: a larger dictionary usually increases compression ratios but makes the process slower
- Word Size: specifies the length of words used to try and identify patterns in the compression process, a larger word size usually increases compression ratios
- Solid Block Size: only valid for .7z format, a solid block usually increases compression ratios
- Number of CPU Threads: specifies the number of computing requests to send to your processor at once, more threads usually means faster compression and more memory use
TIFF File Compression Results and Conclusions
I tweaked these options and tested 12 different combinations to get a sense of the best performing compression format. The results of these tests are presented in the chart below.
|Compress Specs||Memory Used (mb)||Compress Size (kb)||% Original*||% Baseline||Compress Time (s)||% Baseline Time||Addl Time (s)||Addl Compress (kb)||Compress / s||Addl Compress / Addl s|
|BASELINE - zip, normal, deflate, 32kb, 32, N/A, 2||67||61476727||50.1396%||13||4728979.0|
|Zip, normal, LZMA, 512mb, 273, N/A, 2||5384||49195420||40.1231%||80.0228%||93||715.3846%||80||12281307||528983.0||153516.3|
|Zip, ultra, deflate, 32kb, 258, N/A, 2||68||61374356||50.0561%||99.8335%||88||676.9231%||75||102371||697435.9||1364.9|
|Zip, ultra, LZMA, 512mb, 273, N/A, 2||5384||49195420||40.1231%||80.0228%||94||723.0769%||81||12281307||523355.5||151621.1|
|Gzip, normal, deflate, 32kb, 32, N/A, N/A||3||61476569||50.1395%||99.9997%||13||100.0000%||0||158||4728966.8||N/A|
|Gzip, normal, deflate, 32kb, 8, N/A||3||62022495||50.5847%||100.8878%||12||92.3077%||-1||-545768||5168541.3||N/A|
|Gzip, normal, deflate, 32kb, 258, N/A, N/A||3||61481422||50.1434%||100.0076%||14||15.0538%||1||-4695||4391530.1||N/A|
|Xy, normal, LZMA2, 16mb, 32, N/A, 2||192||49413336||40.3009%||80.3773%||73||561.5385%||60||12063391||676895.0||201056.5|
|7z, fastest, LZMA, 512mb, 273, solid, 2||6660||53261239||43.4392%||86.6364%||152||1169.2308%||139||8215488||350402.9||59104.2|
|7z, ultra, LZMA, 64kb, 8, non-solid, 2||38||52077107||42.4734%||84.7103%||16||123.0769%||3||9399620||3254819.2||3133206.7|
|7z, ultra, LZMA, 512mb, 273, solid, 2||5413||49195379||40.1231%||80.0228%||94||723.0769%||81||12281348||523355.1||151621.6|
|7z, ultra, LZMA2, 512mb, 273, solid, 2||5413||49208758||40.1340%||80.0445%||94||723.0769%||81||12267969||523497.4||151456.4|
Before I present you with my conclusions following from the compression tests, let me explain the meaning of each column (or variable) in the chart above:
- Compression Specs: this details the compression options used in each test, i.e. archive format, compression level, compression method, dictionary size, word size, solid block size and number of CPU threads; the baseline compression method is a .zip file with the default options in 7-Zip
- Memory Used: the amount of RAM used to compress the TIFF file, in megabytes (mb)
- Compress Size: the final size of the compressed file, in kilobytes (kb)
- % Original: calculated as, (Compressed File Size) / (Original File Size)
- % Baseline: calculated as, (Compressed File Size) / (Baseline Compressed File Size)
- Compress Time: time to compress the TIFF file, in seconds (s)
- % Baseline Time: calculated as, (Time to Compress) / (Time to Compress Baseline)
- Addl Time: additional time calculated as, (Time to Compress Baseline) – (Time to Compress), in seconds (s)
- Addl Compress: additional compression calculated as, (Baseline Compressed File Size) – (Compressed File Size), in kilobytes (kb)
- Compress / S: compression per second calculated as, (Compress Size) / (Compress Time), in kilobytes (kb)
- Addl Compress / Addl S: additional compression per additional second of compression time versus baseline calculated as, (Addl Compress) / (Addl Time), in kilobytes (kb); N/A means that this calculation was not possible for a given entry
Now that I have explained the variables used in this chart, let’s turn our attention to some conclusions that can be drawn from this data.
Best choice for long-term compression
- When you are storing data for a long time, the compression ratio is usually the most important consideration as smaller files mean less storage space, and then less money spent on storage.
- As such, the clear winner for long-term compression is .7z format using the ultra LZMA compression method. Increasing word, dictionary and block size will also increase compression ratios.
- Increased compression comes with the drawback of extra time to compress and uncompress files.
Best choice for short-term compression
- For short-term compression, the file size is important but then so is the speed to compress as you will not keep the data around for the long haul.
- The best balance between speed and compression ratio is achieved by using .7z format with LZMA compression, a 64kb dictionary, an 8 word size and a non-solid block. – This option created files that were 15% smaller than the baseline ZIP while only taking 3 additional seconds to compress.
- One drawback is that .7z format is not recognized by all software if you plan to share the compressed data. That said, 7-Zip is a freeware program so perhaps that is not a big issue!
- The .gzip format yielded the worst (lowest) compression ratios.
- The traditional .zip format was surprisingly robust in these tests as even the most compressed file was only about 20% smaller than the baseline.
- The most important influence on compression ratio was not the file format but rather the compression method. LZMA is by far the best performer, saving an additional 7 to 10% for each compression format.
Until our next edition of G-FAQ, happy GIS-ing!
Do you have an idea for a future G-FAQ? If so, let me know by email at [email protected].
Find Out More About This Topic Here:
Brock Adam McCarty