13 views

Skip to first unread message

Jan 23, 1995, 6:28:05 PM1/23/95

to

Judging by my mail box, there was a *lot* of interest in higher order

compressors (no surprise). The comp.compression FAQ is also rather vague

in that area. So I'm going to extend it a little.

compressors (no surprise). The comp.compression FAQ is also rather vague

in that area. So I'm going to extend it a little.

First, arithmetic compression was very well described by Witten, Neal, and

Cleary in "Arithmetic Coding for Data Compression", _Communications of the

ACM_, June 1987, Vol. 30, #6, p/ 520-540. I can't possibly do that justice to

arithmetic compression compared to that article. The article includes source

in C as well. Source is available for ftp from ftp.cpsc.ucalgary.ca in

directory /pub/projects/arithmetic.coding. For others, see the FAQ.

The general idea is this: We have an input string of symbols. We want to

generate an encoded string as a result. We have a model which tells how to

encode the input string.

The basic order 0 model is described and source is given in that article.

Compression is achieved by assigning fewer bits to more probable symbols.

We will assume an adaptive model here, so all the symbol probabilities are

based on the string as far as it has been processed so far. This is nothing

revolutionary.

What arithmetic compression will give you is that the actual encoding process

is a purely mechanical function. Within some arbitrarily small bounds,

arithmetic encoding will automatically encode the symbol at the lowest entropy

it can based on the information from your model. The tough part is designing

a good model of the input string.

Okay, now here's the change. Suppose we are dealing with English text and

you are doing the encoding. If you see the string "tio", you will almost

always assume the next letter is "n" rather than "e" or "s" or a space or

some other more common symbol.

Why? Because when you look at the letter following the sequence "tio" you

can rule out most of the other combinations as improbable. The sequence

"tio" is a context. The probability of a particular symbol change depending

on the context you are assuming.

The order 0 model assumes no context at all. Every symbol is encoded strictly

based on the probabilities of the preceding symbols. Higher order models are

based on a past history of a few symbols (usually 1-5). The higher order

models will increase their predictive power up to a certain point, then they

will drop off. For instance, after seeing "tion ", what is the next letter?

It could be almost anything. There is a finite limit to the order of the

model before you start reaching the point where the gain in coding cuts

into your memory requirements (high order models take up lots of memory) and

speed (they also take a while to search).

DMC and PPM and various other "higher order compressors" are all based on the

simple extension of arithmetic encoding where you look at the previous

context of the source string to come up with your symbol table probabilities.

The original article on PPM is "Data Compression using adaptive coding and

partial string matching," by J. Cleary and I. Witten, IEEE Transactions on

Communications, Vol. COM-32, p. 396-402, Apr. 1984.

Cleary and Witten proposed this general idea:

Store a table. Context 0 contains the basic frequencies. Context 1 contains

frequency tables for the symbol following each symbol from context 0. Each

higher level branches out further. This is the generic model so if just

raw arrays were used, the storage for the frequency tables for an order 3

model would be 256+256^2+256^3+256^4 frequencies, which is why the "perfect"

model is never actually used.

So, just store the symbol combinations that have occurred so far (some sort

of garbage collection will be necessary just like in regular arithmetic

compression). When you run into a symbol that has not been seen before,

encode an "escape" and try encoding it with the context one order down.

Keep escaping until at least one context manages to make a match (the

context 0 model should always match).

One of the basic problems is what probability should be assigned to the

escape sequence? 3 suggestions have been made. They are all detailed

in "Implementing the PPM Data Compression Scheme", by Alistair Moffat,

_IEEE Transactions on Communications_, Vol. 38, #11, November 1990,

p. 1917-1921. I will use his notation for describing them.

In "Algorithm A", each symbol occurrance has a count. The escape probability

is "1" always. So the probability of "escape" is the inverse of the number

of times the context has been used except that the number of times the

context has been used is inflated by 1 to make room for the extra probability.

In "Algorithm B", the number of times the context has been used remains

normal but the probability for each symbol is decremented one. The escape

count is equal to the number of symbols.

The problem is that a symbol has to be "double counted" for algorithm B before

the symbol is recognized as in the context. The count of "1" for algorithm A

tended to mean the escape was given an artificially low probability when the

model was first being built up. Thus, the escape mechanism of Algorithm C.

The escape probability is added into the total count. Escape is equal to the

total number of symbols in the context.

These methods are still not really satisfactory. "HA", "Ashford", and others

have also tuned their escape probabilities but they haven't published the

exact details.

The usual problems with flushing the string buffer that Lempel-Ziv

derived algorithms have also occur with Markov modelling systems. Similar

to LZW, either throwing away the table and building it back on a buffer

of data or LRU seems to be the most common technique.

So far, the compression technique has mostly only considered the case of

bytes. Cleary and Witten do some work with regard to arbitrary alphabets,

but some specific examples exist for encoding on a bit level.

DMC is one version. It is described in "Data Compresion using Dynamic

Markov Modelling," by Gordon Cormack and Nigel Horspool, _Computer Journal_,

v. 30, #6 , p. 541-550, December 1987. Cormack and Horspool will tell you

that DMC is totally different from other Markov modellers because they used

Finite Automata for their model, but Bell and Moffat ("A note on the DMC data

compression scheme", _The Computer Journal_, V. 32, p. 16-20) went back and

rederived the original work and showed that the models are identical based

on the implementation Cormack and Horspool used (although leaving the door

open for some innovative work).

Urban is another version. It was one of the entries to the Dr. Dobbs data

compression contest. See the comp.compression FAQ for details because I

don't have any.

Both of these programs show better performance when you rank them against

object code. The prevailing theory is that the programs can capitalize on

the fact that certain bit patterns in machine instructions share common

formats with the bit patterns being broken up into various "fields" that

the byte-level compressors miss and also because of addresses in particular

code fragments showing similarity to others in the same fragment. Obviously,

these algorithms should do even better against byte-level systems if they

are used on some types of image data (such as bitmaps).

So, if someone is planning on extending the zip engine using one of these

monsters, I suggest that you implement both a bit-level and a byte-level

compressor. Also make absolutely sure that they can be turned OFF! These

things are notoriously slow, but the memory requirements can be lowered

to a tolerable level with a little work (though nothing you're going to see

in a Doublespace or other related compressing file system any time soon).

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu