If you're allowed to reorder the symbols, then the answer is yes. We'll approach this in three stages.
Linear-time Huffman tree construction
First things first. If the symbols are sorted into order of increasing probability, Huffman trees can be constructed in linear time and space.
The algorithm is extremely simple:
Push all symbols onto a queue q1, in order of increasing probability.
Let q2 be an empty queue.
while there is more than one symbol on q1 and q2 combined:
Let x1 and x2 be the two lowest-probability symbols.
(You only need consult the first two elements on q1 and the first
two elements on q2.)
Combine these into a single tree node, and push that onto q2.
The remaining symbol is the root of the Huffman tree.
To prove that this algorithm is correct, you only need show that the probabilities in q2
are in increasing (well, nondecreasing) order.
Canonical Huffman coding
The next thing to observe is that if we are free to reorder symbols, we don't need to construct a tree at all. We only need the length of each code.
Let's suppose we have the following symbols and their corresponding code length, sorted into decreasing order of code length:
Symbol Length
a 6
b 6
c 5
d 4
e 4
f 4
g 3
h 3
i 3
j 3
k 2
You can verify that this is a valid Huffman code by adding the sum of the implicit probabilities corresponding to the code lengths and seeing that they sum to 1:
$$\displaystyle 2^{-2} + 4\times 2^{-3} + 3\times 2^{-4} + 2^{-5} + 2 \times 2^{-6} = 1$$
We can define a canonical Huffman code by starting at zero, and giving each symbol the value of the previous code plus one, and then truncating to the appropriate code length:
Symbol Length Code
a 6 000000
b 6 000001
c 5 00001
d 4 0001
e 4 0010
f 4 0011
g 3 010
h 3 011
i 3 100
j 3 101
k 2 11
You can see that this is a valid Huffman code by constructing the tree if you like.
But now, we don't actually need to store this entire table. You only need store a table with one entry per code length:
length first_code first_symbol
1 ∞ ?
2 11 = 3 k
3 010 = 2 g
4 0001 = 1 d
5 00001 = 1 c
6 00000 = 0 a
Ignore the entry for code length 1 for now; we'll show how to deal with the infinity soon. The important point is that it is a value that is greater than any possible code of that length.
Now, you can encode a Huffman code using this algorithm:
while there are more symbols {
symbol := get_symbol
length := min_code_length (2 in this example)
while symbol < first_symbol[length] {
length := length + 1
}
code := first_code[length] + symbol - first_symbol[length]
output(code, using length bits)
}
And you can decode a Huffman code using this algorithm:
while there are more bits {
code := 0
length := 0
do {
code := 2*code + get_bit()
length := length + 1
} while code < first_code[length]
symbol := code - first_code[length] + first_symbol[length]
output(symbol)
}
No tree traversal is required, you need only traverse over the length of the code.
There are a few complications to this method.
The first is that maybe you can't reorder the symbols, or you need the symbols in a specific form. No problem; just add a layer of indirection to rename the symbols. An array of length $n$ (where $n$ is the number of symbols) will do the trick, at least for decoding.
Second, what about that infinity? You can deal with this by computing the first_code
array in a slightly different, and much more efficient, way. Suppose you have the number of symbols of each length stored in an array:
length number_of_symbols
1 0
2 1
3 4
4 3
5 1
6 2
You can now compute the first_code
entries using this algorithm:
first_code[max_length] := 0
k := max_length - 1
while k > 0 {
first_code[k] = (first_code[k+1] + number_of_symbols[k+1] + 1) / 2
k := k - 1
}
This computes the table:
length first_code first_symbol
1 10 = 2 ?
2 11 = 3 k
3 010 = 2 g
4 0001 = 1 d
5 00001 = 1 c
6 00000 = 0 a
You can verify for yourself that this works, because the value 2 is greater than any code of length 1. Note that the "question mark" first_symbol
entry is never consulted during decoding; for encoding, this needs a little more care.
There is one final problem. This algorithm only works if all Huffman codes fit in a machine word. That is, if $w$ is the machine word size (e.g. 32 or 64 on most modern CPUs), then all code words must be that length or less.
Note that is true if the smallest probability is greater than or equal to $2^{-w}$. So if the probabilities come from counting real-word items, it is almost certainly true; on a 64-bit machine, you can only "address" $2^{64}$ items, and therefore no symbol could ever have a probability of less than $2^{-64}$, unless it is zero.
Nonetheless, we're computer scientists, and we would like a guarantee. Or maybe this is a file format, and you'd like to limit code lengths to 32 bits for compatibility. And so, for the final piece of the puzzle...
Length-limited Huffman coding
What if we wanted to produce an optimal Huffman code on $n$ symbols, subject to the restriction that there is a maximum code length $L$?
It turns out that this can solved in $O(nL)$ time and $O(n)$ space, by recasting it as a binary coin collector's problem.
The details are a bit too complicated for this post, but the algorithm is detailed in this paper:
Lawrence L. Larmore and Daniel S. Hirschberg. 1990. A fast algorithm for optimal length-limited Huffman codes. J. ACM 37, 3 (July 1990), 464–473.