Introduction

According to Wikipedia:

    Arithmetic coding is a method for lossless data compression. Normally, a string of characters such as the words ‘hello there’ is represented using a fixed number of bits per character, as in the ASCII code. Like Huffman coding, arithmetic coding is a form of variable-length entropy encoding that converts a string into another form that represents frequently used characters using more bits, with the goal of using fewer bits in total. As opposed to other entropy encoding techniques that separate the input message into its component symbols and replace each symbol with a code word, arithmetic coding encodes the entire message into a single number, a fraction ne, where (0.0 ≤ n < 1.0).

Some of the properties of arithmetic coding are:

  1. When applied to independent and identically distributed (IID) sources, the compression of each symbol is provably optimal.
  2. It is effective in a wide range of situations and compression ratios. The same arithmetic coding implementation can effectively code all the diverse data created by different processes such as modeling parameters, transform coefficients, signaling etc.
  3. It simplifies the modeling of complex sources, yielding near optimal or significantly improved compression for sources that are not IID.
  4. Its main process is arithmetic, which is supported with ever-increasing efficiency by all general purpose or digital processors.
  5. It is suited for use as a compression black-box by those who are not coding experts or do not want to implement the coding algorithm themselves.

In this article, I present an implementation of an arithmetic coder in C# (coder.cs in the source code). The coding algorithm is adapted from an algorithm originally presented in C by Mark Nelson in his paper available here (http://dogma.net/markn/articles/arith/part1.htm). The coding algorithm has been separated from the statistical model and encapsulated as a C# object.

Description


The ‘coder’ object in the source code attached to this article can work as a black box implementation of an arithmetic encoder/decoder. However, to use it, we need to understand the basics of statistical modeling for arithmetic coding and a few of the basic concepts of arithmetic coding.

In general, using arithmetic coding depends on creating a statistical model of the data. In this example, I will assume that we are trying to encode the words “HELLO WORLD”.

Creating the statistical model of the data proceeds as follows:

  1. Taking the number of independent characters in the words to be encoded (HELLO WORLD), we obtain, in alphabetical order:
        1. D
        2. E
        3. H
        4. L
        5. O
        6. R
        7. W

      The total number of characters to be encoded is 10. For convenience and clarity, I have ignored the space between the words HELLO and WORLD.
  2. We can arrange these characters into a table with their corresponding frequency, as below:
     
CharacterFrequency
D1
E1
H1
L3
O2
R1
W1

  3. Each of the characters will be assigned a range based on its frequency/ probability of occurrence. This range will be between 0 and 1, as below (note that I have not used any optimizations for the frequency and probability model; for the most optimal compression, this is necessary; however, for the purposes of our example, this should do).
     
CharacterFrequencyProbabilityRange
D11/100.0 – 0.1
E11/100.1 – 0.2
H11/100.2 – 0.3
L33/100.3 – 0.6
O22/100.6 – 0.8
R11/100.8 – 0.9
W11/100.9 – 1.0

  4. The algorithm for encoding is as below:
  1.       Set low to 0.0
  2.       Set high to 1.0
  3.       While there are still input symbols do
  4.           get an input symbol
  5.           code_range = high - low.
  6.                 high = low + code_range *  high_range of the symbol being coded
  7.               low = low + code_range * low_range of the symbol being coded
  8.       End of While
  9.       output low
复制代码
Applying this to our input (HELLO WORLD), we obtain:
  1.       encoding H (Hs range is 0.2 - 0.3) Range(or code_range above) = 1 - 0 = 1
  2.       low =  0 + (1 * 0.2) = 0.2
  3.       high = 0 + (1 * 0.3) = 0.3
  4.       no output
  5.       encoding E (Es range is 0.1 - 0.2) Range = 0.3 - 0.2 = 0.1

  6.       low =  0.2 + (0.1 * 0.1) = 0.21
  7.       high = 0.2 + (0.1 * 0.2) = 0.22
  8.       output 0.2
  9.       encoding L (Ls range is 0.3 - 0.6) Range = 0.22 - 0.21 = 0.01

  10.       low =  0.21 + (0.01 * 0.3) = 0.213
  11.       high = 0.21 + (0.01 * 0.6) = 0.216
  12.       output 0.21
  13.       encoding the next L (Ls range is 0.3 - 0.6) Range = 0.216 - 0.213 = 0.003

  14.       low =  0.213 + (0.003 * 0.3) = 0.2139
  15.       high = 0.213 + (0.003 * 0.6) = 0.2148
  16.       no output
  17.       encoding O (Os range is 0.6 - 0.8) Range = 0.2148 - 0.2139 = 0.0009
  18.       low =  0.2139 + (0.0009 * 0.6) = 0.21444
  19.       high = 0.2139 + (0.0009 * 0.8) = 0.21462
  20.       output 0.214
  21.       encoding W (Ws range is 0.9 - 1.0) Range = 0.21462 - 0.21444 = 0.00018
  22.       low =  0.21444 + (0.00018 * 0.9) = 0.214602
  23.       high = 0.21444 + (0.00018 * 1.0) = 0.21462
  24.       output 0.2146
  25.       encoding O (Os range is 0.6 - 0.8) Range = 0.21462 - 0.214602 = 0.000018
  26.       low =  0.214602 + (0.000018 * 0.6) = 0.2146128
  27.       high = 0.214602 + (0.000018 * 0.8) = 0.2146164
  28.       output 0.21461
  29.       encoding R (Rs range is 0.8 - 0.9) Range = 0.2146164 - 0.2146128 = 0.0000036
  30.       low =  0.2146128 + (0.0000036 * 0.8) = 0.21461568
  31.       high = 0.2146128 + (0.0000036 * 0.9) = 0.21461604
  32.       no output
  33.       encoding L (Ls range is 0.3 - 0.6) Range = 0.21461604 - 0.21461568 = 0.00000036
  34.       low =  0.21461568 + (0.00000036 * 0.3) = 0.214615788
  35.       high = 0.21461568 + (0.00000036 * 0.6) = 0.214615896
  36.       output 0.214615
  37.       encoding D (Ds range is 0.0 - 0.1) Range = 0.214615896 - 0.214615788 = 0.000000108
  38.       low =  0.214615788 + (0.000000108 * 0.0) = 0.214615788
  39.       high = 0.214615788 + (0.000000108 * 0.1) = 0.2146157988
  40.       output 0.2146157 and 88 from low
  41.       ...
复制代码
Giving the table below:

CharacterFrequencyProbabilityRangeLowHigh
H11/100.2 – 0.30.20.3
E11/100.1 – 0.20.210.22
L33/100.3 – 0.60.2130.216
L33/100.3 – 0.60.21390.2148
O23/100.6 – 0.80.214440.21462
W11/100.9 – 1.00.2146020.214620
O22/100.6 – 0.80.21461280.2146164
R11/100.8 – 0.90.214615680.21461604
L33/100.3 – 0.60.2146157880.214615896
D11/100.0 – 0.10.2146157880.214615806

      So the final low value 0.214615788 will encode the string “HELLOWORLD” (without the space, which was omitted for clarity).
TOP