IntroductionAccording to Wikipedia:
Arithmetic coding is a method for lossless data compression. Normally, a string of characters such as the words ‘hello there’ is represented using a fixed number of bits per character, as in the ASCII code. Like Huffman coding, arithmetic coding is a form of variable-length entropy encoding that converts a string into another form that represents frequently used characters using more bits, with the goal of using fewer bits in total. As opposed to other entropy encoding techniques that separate the input message into its component symbols and replace each symbol with a code word, arithmetic coding encodes the entire message into a single number, a fraction ne, where (0.0 ≤ n < 1.0).
Some of the properties of arithmetic coding are:
1. When applied to independent and identically distributed (IID) sources, the compression of each symbol is provably optimal.
2. It is effective in a wide range of situations and compression ratios. The same arithmetic coding implementation can effectively code all the diverse data created by different processes such as modeling parameters, transform coefficients, signaling etc.
3. It simplifies the modeling of complex sources, yielding near optimal or significantly improved compression for sources that are not IID.
4. Its main process is arithmetic, which is supported with ever-increasing efficiency by all general purpose or digital processors.
5. It is suited for use as a compression black-box by those who are not coding experts or do not want to implement the coding algorithm themselves.
In this article, I present an implementation of an arithmetic coder in C# (coder.cs in the source code). The coding algorithm is adapted from an algorithm originally presented in C by Mark Nelson in his paper available here (
http://dogma.net/markn/articles/arith/part1.htm). The coding algorithm has been separated from the statistical model and encapsulated as a C# object.
DescriptionThe ‘coder’ object in the source code attached to this article can work as a black box implementation of an arithmetic encoder/decoder. However, to use it, we need to understand the basics of statistical modeling for arithmetic coding and a few of the basic concepts of arithmetic coding.
In general, using arithmetic coding depends on creating a statistical model of the data. In this example, I will assume that we are trying to encode the words “HELLO WORLD”.
Creating the statistical model of the data proceeds as follows:
1. Taking the number of independent characters in the words to be encoded (HELLO WORLD), we obtain, in alphabetical order:
1. D
2. E
3. H
4. L
5. O
6. R
7. W
The total number of characters to be encoded is 10. For convenience and clarity, I have ignored the space between the words HELLO and WORLD.
2. We can arrange these characters into a table with their corresponding frequency, as below:
| Character | Frequency |
| D | 1 |
| E | 1 |
| H | 1 |
| L | 3 |
| O | 2 |
| R | 1 |
| W | 1 |
3. Each of the characters will be assigned a range based on its frequency/ probability of occurrence. This range will be between 0 and 1, as below (note that I have not used any optimizations for the frequency and probability model; for the most optimal compression, this is necessary; however, for the purposes of our example, this should do).
| Character | Frequency | Probability | Range |
| D | 1 | 1/10 | 0.0 – 0.1 |
| E | 1 | 1/10 | 0.1 – 0.2 |
| H | 1 | 1/10 | 0.2 – 0.3 |
| L | 3 | 3/10 | 0.3 – 0.6 |
| O | 2 | 2/10 | 0.6 – 0.8 |
| R | 1 | 1/10 | 0.8 – 0.9 |
| W | 1 | 1/10 | 0.9 – 1.0 |
4. The algorithm for encoding is as below:
- Set low to 0.0
- Set high to 1.0
- While there are still input symbols do
- get an input symbol
- code_range = high - low.
- high = low + code_range * high_range of the symbol being coded
- low = low + code_range * low_range of the symbol being coded
- End of While
- output low
复制代码Applying this to our input (HELLO WORLD), we obtain:
- encoding H (Hs range is 0.2 - 0.3) Range(or code_range above) = 1 - 0 = 1
- low = 0 + (1 * 0.2) = 0.2
- high = 0 + (1 * 0.3) = 0.3
- no output
- encoding E (Es range is 0.1 - 0.2) Range = 0.3 - 0.2 = 0.1
- low = 0.2 + (0.1 * 0.1) = 0.21
- high = 0.2 + (0.1 * 0.2) = 0.22
- output 0.2
- encoding L (Ls range is 0.3 - 0.6) Range = 0.22 - 0.21 = 0.01
- low = 0.21 + (0.01 * 0.3) = 0.213
- high = 0.21 + (0.01 * 0.6) = 0.216
- output 0.21
- encoding the next L (Ls range is 0.3 - 0.6) Range = 0.216 - 0.213 = 0.003
- low = 0.213 + (0.003 * 0.3) = 0.2139
- high = 0.213 + (0.003 * 0.6) = 0.2148
- no output
- encoding O (Os range is 0.6 - 0.8) Range = 0.2148 - 0.2139 = 0.0009
- low = 0.2139 + (0.0009 * 0.6) = 0.21444
- high = 0.2139 + (0.0009 * 0.8) = 0.21462
- output 0.214
- encoding W (Ws range is 0.9 - 1.0) Range = 0.21462 - 0.21444 = 0.00018
- low = 0.21444 + (0.00018 * 0.9) = 0.214602
- high = 0.21444 + (0.00018 * 1.0) = 0.21462
- output 0.2146
- encoding O (Os range is 0.6 - 0.8) Range = 0.21462 - 0.214602 = 0.000018
- low = 0.214602 + (0.000018 * 0.6) = 0.2146128
- high = 0.214602 + (0.000018 * 0.8) = 0.2146164
- output 0.21461
- encoding R (Rs range is 0.8 - 0.9) Range = 0.2146164 - 0.2146128 = 0.0000036
- low = 0.2146128 + (0.0000036 * 0.8) = 0.21461568
- high = 0.2146128 + (0.0000036 * 0.9) = 0.21461604
- no output
- encoding L (Ls range is 0.3 - 0.6) Range = 0.21461604 - 0.21461568 = 0.00000036
- low = 0.21461568 + (0.00000036 * 0.3) = 0.214615788
- high = 0.21461568 + (0.00000036 * 0.6) = 0.214615896
- output 0.214615
- encoding D (Ds range is 0.0 - 0.1) Range = 0.214615896 - 0.214615788 = 0.000000108
- low = 0.214615788 + (0.000000108 * 0.0) = 0.214615788
- high = 0.214615788 + (0.000000108 * 0.1) = 0.2146157988
- output 0.2146157 and 88 from low
- ...
复制代码Giving the table below:
| Character | Frequency | Probability | Range | Low | High |
| H | 1 | 1/10 | 0.2 – 0.3 | 0.2 | 0.3 |
| E | 1 | 1/10 | 0.1 – 0.2 | 0.21 | 0.22 |
| L | 3 | 3/10 | 0.3 – 0.6 | 0.213 | 0.216 |
| L | 3 | 3/10 | 0.3 – 0.6 | 0.2139 | 0.2148 |
| O | 2 | 3/10 | 0.6 – 0.8 | 0.21444 | 0.21462 |
| W | 1 | 1/10 | 0.9 – 1.0 | 0.214602 | 0.214620 |
| O | 2 | 2/10 | 0.6 – 0.8 | 0.2146128 | 0.2146164 |
| R | 1 | 1/10 | 0.8 – 0.9 | 0.21461568 | 0.21461604 |
| L | 3 | 3/10 | 0.3 – 0.6 | 0.214615788 | 0.214615896 |
| D | 1 | 1/10 | 0.0 – 0.1 | 0.214615788 | 0.214615806 |
So the final low value 0.214615788 will encode the string “HELLOWORLD” (without the space, which was omitted for clarity).