6
$\begingroup$

QUESTION

Is there any efficient way to encode words into bits, without having to draw the graphs, solely by only using a list of probabilities?

EXAMPLE

This list of words:

  • Pr('Apple') = 0.5.
  • Pr('Banana') = 0.17.
  • Pr('Orange') = 0.17.
  • Pr('Pineapple') = 0.033.

Will get assigned these bits: enter image description here

So this words list: Apple Banana Banana Pineapple ...

Becomes this bits list: 0 100 100 110100

Problem: I used a graph, and this is time consuming. So I'd like to know if there is any neat algorithm that lets me to encode the words into bits, directly based on the probabilities, without having to draw graphs 1st.

$\endgroup$
4
  • 2
    $\begingroup$ A Huffman encoding always represents a tree. What do you mean by "having to draw" that graph? $\endgroup$
    – Bergi
    Commented Jul 1 at 11:56
  • $\begingroup$ @Bergi Like… drawing it on a pen and paper. You know, I'm sort of a retard and when I want to think about Huffman encoding, I pull a pen and a paper, draw the tree, then think which bits go where. Like, there is a hierarchy that my historic brain cannot get unless I draw graphs using a pen and a paper. So I'm asking my more evolved relatives to see if they know a better way. $\endgroup$
    – caveman
    Commented 2 days ago
  • $\begingroup$ I assume you're only drawing it on pen and paper for learning purposes. A computer algorithm will handle all the encoding and decoding for you? $\endgroup$
    – qwr
    Commented 2 days ago
  • $\begingroup$ @qwr - Yes, accepted answer solved it. But no, I was using pen and paper because I'm retarded, as I said earlier. Not sure how many times I should repeat this. Maybe I'm not alone ;). $\endgroup$
    – caveman
    Commented 2 days ago

1 Answer 1

16
$\begingroup$

If you're allowed to reorder the symbols, then the answer is yes. We'll approach this in three stages.

Linear-time Huffman tree construction

First things first. If the symbols are sorted into order of increasing probability, Huffman trees can be constructed in linear time and space.

The algorithm is extremely simple:

Push all symbols onto a queue q1, in order of increasing probability.
Let q2 be an empty queue.

while there is more than one symbol on q1 and q2 combined:
    Let x1 and x2 be the two lowest-probability symbols.
    (You only need consult the first two elements on q1 and the first
    two elements on q2.)

    Combine these into a single tree node, and push that onto q2.

The remaining symbol is the root of the Huffman tree.

To prove that this algorithm is correct, you only need show that the probabilities in q2 are in increasing (well, nondecreasing) order.

Canonical Huffman coding

The next thing to observe is that if we are free to reorder symbols, we don't need to construct a tree at all. We only need the length of each code.

Let's suppose we have the following symbols and their corresponding code length, sorted into decreasing order of code length:

Symbol   Length
a        6
b        6
c        5
d        4
e        4
f        4
g        3
h        3
i        3
j        3
k        2

You can verify that this is a valid Huffman code by adding the sum of the implicit probabilities corresponding to the code lengths and seeing that they sum to 1:

$$\displaystyle 2^{-2} + 4\times 2^{-3} + 3\times 2^{-4} + 2^{-5} + 2 \times 2^{-6} = 1$$

We can define a canonical Huffman code by starting at zero, and giving each symbol the value of the previous code plus one, and then truncating to the appropriate code length:

Symbol   Length    Code
a        6         000000
b        6         000001
c        5         00001
d        4         0001
e        4         0010
f        4         0011
g        3         010
h        3         011
i        3         100
j        3         101
k        2         11

You can see that this is a valid Huffman code by constructing the tree if you like.

But now, we don't actually need to store this entire table. You only need store a table with one entry per code length:

length   first_code    first_symbol
1        ∞             ?
2        11    = 3     k
3        010   = 2     g
4        0001  = 1     d
5        00001 = 1     c
6        00000 = 0     a

Ignore the entry for code length 1 for now; we'll show how to deal with the infinity soon. The important point is that it is a value that is greater than any possible code of that length.

Now, you can encode a Huffman code using this algorithm:

while there are more symbols {
    symbol := get_symbol
    length := min_code_length (2 in this example)
    while symbol < first_symbol[length] {
        length := length + 1
    }
    code := first_code[length] + symbol - first_symbol[length]
    output(code, using length bits)
}

And you can decode a Huffman code using this algorithm:

while there are more bits {
    code := 0
    length := 0
    do {
        code := 2*code + get_bit()
        length := length + 1
    } while code < first_code[length]
    symbol := code - first_code[length] + first_symbol[length]
    output(symbol)
}

No tree traversal is required, you need only traverse over the length of the code.

There are a few complications to this method.

The first is that maybe you can't reorder the symbols, or you need the symbols in a specific form. No problem; just add a layer of indirection to rename the symbols. An array of length $n$ (where $n$ is the number of symbols) will do the trick, at least for decoding.

Second, what about that infinity? You can deal with this by computing the first_code array in a slightly different, and much more efficient, way. Suppose you have the number of symbols of each length stored in an array:

length    number_of_symbols
1         0
2         1
3         4
4         3
5         1
6         2

You can now compute the first_code entries using this algorithm:

first_code[max_length] := 0
k := max_length - 1
while k > 0 {
    first_code[k] = (first_code[k+1] + number_of_symbols[k+1] + 1) / 2
    k := k - 1
}

This computes the table:

length   first_code    first_symbol
1        10    = 2     ?
2        11    = 3     k
3        010   = 2     g
4        0001  = 1     d
5        00001 = 1     c
6        00000 = 0     a

You can verify for yourself that this works, because the value 2 is greater than any code of length 1. Note that the "question mark" first_symbol entry is never consulted during decoding; for encoding, this needs a little more care.

There is one final problem. This algorithm only works if all Huffman codes fit in a machine word. That is, if $w$ is the machine word size (e.g. 32 or 64 on most modern CPUs), then all code words must be that length or less.

Note that is true if the smallest probability is greater than or equal to $2^{-w}$. So if the probabilities come from counting real-word items, it is almost certainly true; on a 64-bit machine, you can only "address" $2^{64}$ items, and therefore no symbol could ever have a probability of less than $2^{-64}$, unless it is zero.

Nonetheless, we're computer scientists, and we would like a guarantee. Or maybe this is a file format, and you'd like to limit code lengths to 32 bits for compatibility. And so, for the final piece of the puzzle...

Length-limited Huffman coding

What if we wanted to produce an optimal Huffman code on $n$ symbols, subject to the restriction that there is a maximum code length $L$?

It turns out that this can solved in $O(nL)$ time and $O(n)$ space, by recasting it as a binary coin collector's problem.

The details are a bit too complicated for this post, but the algorithm is detailed in this paper:

Lawrence L. Larmore and Daniel S. Hirschberg. 1990. A fast algorithm for optimal length-limited Huffman codes. J. ACM 37, 3 (July 1990), 464–473.

$\endgroup$
4
  • 1
    $\begingroup$ It seems that length := length - 1 in the loop where you start with min_code_length should actually be length := length + 1 $\endgroup$
    – Bergi
    Commented Jul 1 at 11:54
  • $\begingroup$ @Bergi Well spotted! Thanks. $\endgroup$
    – Pseudonym
    Commented Jul 1 at 13:23
  • $\begingroup$ Thanks—what does code length mean here? Is it like Apple has a length of $5$ letters? Or is it the Entropy of Apple given its probability and the probabilities of other symbols? If it's Entropy's, any idea how to calculate it? E.g. if my symbols' probabilities are [.5, .17, .17, 0.032, 0.032, 0.032, 0.032, 0.032], then the entropy of revealing which of them I chose is 2.1636992240442248 bits. Not sure I understand "code length". $\endgroup$
    – caveman
    Commented Jul 1 at 17:41
  • 1
    $\begingroup$ @caveman It's the length of the Huffman code. The point is that you don't need a Huffman tree, you only need the length of the code for each symbol. $\endgroup$
    – Pseudonym
    Commented Jul 1 at 23:47

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.