A subject I’ve recently taken some interest in is data compression. As far as compression algorithms themselves, you’ll learn nothing in this post that you can’t find out in a few minutes on Wikipedia. Nevertheless, I think there are two things worth pointing out, which are perhaps not known to even many I.T. professionals:
- There is no lossless compression algorithm which is optimal for all types of data.
- With Free Software programs at least, you have considerable freedom in choosing your compression algorithm and related parameters.
For those not in the know, lossless compression is compression that allows you to compress without losing any information. In contrast, the lossy compression algorithm used in JPEG basically discards some of the image data, sacrificing some fine resolution in such a way that hopefully you won’t notice the difference. PNG compression, on the other hand, is lossless, allowing you to view the image exactly the way you got it originally.
Point (1) above is true because of the mathematical Pidgeon Hole principle. Applied to compression algorithms, this basically means that any algorithm that is good at compressing one type of data, will be bad at compressing at least one other type of data.
Regarding point (2): On a Gnu/Linux system, it is likely that without even having to install anything, you will have at least three compression tools installed: gzip, bzip2, and 7z.
My version of gzip uses LZ77 (Lempel-Ziv) coding.
LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the uncompressed data stream.
https://en.wikipedia.org/wiki/LZ77_and_LZ78#LZ7
christopher@nightshade:~/Scratch$ gzip --version | head -n1
gzip 1.6
This is a simple idea that works decently applied to some text data:
christopher@nightshade:~/Scratch$ wget https://www.gnu.org/licenses/gpl-3.0.txt
--2019-03-07 16:34:50-- https://www.gnu.org/licenses/gpl-3.0.txt
Resolving www.gnu.org (www.gnu.org)… 2001:470:142:3::a, 209.51.188.148
Connecting to www.gnu.org (www.gnu.org)|2001:470:142:3::a|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 35149 (34K) [text/plain]
Saving to: ‘gpl-3.0.txt’
gpl-3.0.txt 100%[================================================================>] 34.33K --.-KB/s in 0.1s
2019-03-07 16:34:51 (280 KB/s) - ‘gpl-3.0.txt’ saved [35149/35149]
christopher@nightshade:~/Scratch$ gzip -vc gpl-3.0.txt > gpl-3.0.txt.default.gz
gpl-3.0.txt: 65.5%
Now you can (in principle) increase your compression amount by looking over a larger data window, which is to say, spending more cpu cycles looking for repeats. With gzip you can specify this by specifying a compression level. The default is 6, and you can go up to 9. In this particular case though, it doesn’t make a difference:
christopher@nightshade:~/Scratch$ gzip -v9c gpl-3.0.txt > gpl-3.0.txt.9.gz
gpl-3.0.txt: 65.6%
I haven’t had a chance to spend any quality time with bzip2 yet, but 7z (which technically does both archiving and compressing) by default uses LZMA2 algorithm. In this case (text data) we get only slightly better results, even with the compress level set high:
christopher@nightshade:~/Scratch$ 7z a gpl-3.0.txt.default.7z gpl-3.0.txt
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.utf8,Utf16=on,HugeFiles=on,64 bits,3 CPUs AMD Athlon(tm) II X3 455 Processor (100F53),ASM)
Scanning the drive:
1 file, 35149 bytes (35 KiB)
Creating archive: gpl-3.0.txt.default.7z
Items to compress: 1
Files read from disk: 1
Archive size: 11487 bytes (12 KiB)
Everything is Ok
christopher@nightshade:~/Scratch$ 7z a gpl-3.0.txt.9.7z gpl-3.0.txt -mx=9
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.utf8,Utf16=on,HugeFiles=on,64 bits,3 CPUs AMD Athlon(tm) II X3 455 Processor (100F53),ASM)
Scanning the drive:
1 file, 35149 bytes (35 KiB)
Creating archive: gpl-3.0.txt.9.7z
Items to compress: 1
Files read from disk: 1
Archive size: 11495 bytes (12 KiB)
Everything is Ok
However, 7z allows us to choose from several different compression algorithms. One is PPMd:
Predictions are usually reduced to symbol rankings[clarification needed]. Each symbol (a letter, bit or any other amount of data) is ranked before it is compressed and, the ranking system determines the corresponding codeword (and therefore the compression rate). In many compression algorithms, the ranking is equivalent to probability mass function estimation. Given the previous letters (or given a context), each symbol is assigned with a probability. For instance, in arithmetic coding the symbols are ranked by their probabilities to appear after previous symbols and the whole sequence is compressed into a single fraction that is computed according to these probabilities.
https://en.wikipedia.org/wiki/Prediction_by_partial_matching#Theory
This algorithm is especially good with text data, and gives us much compression in this case:
christopher@nightshade:~/Scratch$ 7z a gpl-3.0.txt.ppm.7z gpl-3.0.txt -m0=PPMd
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.utf8,Utf16=on,HugeFiles=on,64 bits,3 CPUs AMD Athlon(tm) II X3 455 Processor (100F53),ASM)
Scanning the drive:
1 file, 35149 bytes (35 KiB)
Creating archive: gpl-3.0.txt.ppm.7z
Items to compress: 1
Files read from disk: 1
Archive size: 9617 bytes (10 KiB)
Everything is Ok
On a Debian Gnu/Linux system, you can see the algorithms supported by 7z at file:///usr/share/doc/p7zip/DOC/MANUAL/cmdline/switches/method.htm using your Web browser. Our results so far:
christopher@nightshade:~/Scratch$ ls -l
total 96
-rw-r--r-- 1 christopher parents 35149 Sep 29 2017 gpl-3.0.txt
-rw-r--r-- 1 christopher parents 11495 Mar 7 16:41 gpl-3.0.txt.9.7z
-rw-r--r-- 1 christopher parents 12136 Mar 7 16:36 gpl-3.0.txt.9.gz
-rw-r--r-- 1 christopher parents 11487 Mar 7 16:37 gpl-3.0.txt.default.7z
-rw-r--r-- 1 christopher parents 12142 Mar 7 16:35 gpl-3.0.txt.default.gz
-rw-r--r-- 1 christopher parents 9617 Mar 7 16:37 gpl-3.0.txt.ppm.7z
Compression of images and audio is a different animal, for one because the storage format itself usually includes the compression. Perhaps that will be the subject of another post.