Huffword

Huffword is a term used in Managing Gigabytes by Ian H. Witten, Alistair Moffat, and Timothy C. Bell, for Huffman coding of text where each symbol represents a word instead of a single character. The codec is useful for compressing text with random access even on machines with limited work RAM, such as text in an electronic book or character dialogue in a video game.

Huffword
All languages have inherent redundancy to protect against misunderstandings due to noise. This manifests in the written word as correlation among characters in a text. Assuming a memory with no noise, a text can be made to fit in a smaller memory using lossless data compression.

Word-based compression segments a text into words, defined as sequences of letters and numbers, and nonwords, defined as sequences of other characters. This improves on coding a character at a time because it captures context between characters (e.g. q appears most often before u) that a character-based Huffman code cannot capture alone. This definition of a word fails for languages written in scripts without word dividers or for highly synthetic languages with more hapax legomena (words that occur only once in a collection of text) than a far less synthetic language such as English, in the same way that word-based full-text search fails. Thus it may not perform so well on polysynthetic languages like Mapudungun where a word is used only four times on average, compared to 50 for a mildly fusional language like Spanish. But it should work well for English and other languages of western Europe.

The LZ77 family of lossless codecs also captures context. LZ77 captures context not only within a word but also between words, which Huffword alone cannot capture without some sort of Markov chain or higher-order dictionary. But LZ77's sliding window approach requires much more work RAM than a method using a static dictionary. Seeking within a compressed document is also much more time-consuming with LZ77, as the decoder has to restart from the beginning of a compressed block.

Making a Huffword table

 * 1) Scan the text, make the sequence of words and nonwords, and collect the multiset of all words and the multiset of all nonwords.
 * 2) Optional: Replace the least frequent words and nonwords with an escape code.
 * 3) Make canonical Huffman codes for each.
 * 4) Concatenate all the words into one string and all the nonwords into another string.
 * 5) Make canonical Huffman codes for these.

You'll end up with the following items:
 * Code length table for words
 * Sorted list of words
 * Code length table for nonwords
 * Sorted list of nonwords
 * Code length table for word characters
 * Sorted list of word characters
 * Code length table for nonword characters
 * Sorted list of nonword characters

Encoding

 * 1) For each document, store a bit for whether it begins with a word or nonword.
 * 2) Encode each word or each nonword using the respective word code. If the word does not appear in the document, encode the escape code, the length, and the individual characters using the character code.
 * 3) Encode the dictionaries using the respective character codes, and keep pointers to the start of each word.

Implementing it on an 8-bit machine
Section to be completed later.

Storing the dictionary
There is one "entry point" to the dictionary per distinct word, and these words are short enough that storing such entry points becomes a problem. One solution is to break the word list into groups of 16 and then for each group, store a single pointer to the start of the group holding these words along with the length of compressed data from the start of the block in bits for each of the 16 words. Provided each word can be encoded in 255 bits or fewer, this costs 148 bits per block or 9.125 bits per word: one 16-bit pointer to the start of the block, sixteen 8-bit lengths in bits, and on average 4 bits of slack space in the last byte of the group's data.

Given that words of the same code length are already sorted, one might think to store the common prefix of such words separately. This "front compression" method works well in cases where consecutive words share enough letters that storing the number of letters repeated from the previous word is a substantial win over repeating those letters in each dictionary entry. The words would be broken into blocks of 4 or 16 or so, and something like "general", "generic", "genesis" would become "general", 5, "ic", 4, "sis". To decode a single word, one would start at the beginning of a block and start decoding consecutive words until reaching the desired word. But this turns out not to save much space in Huffword, as words of similar frequency (and thus in the same code-length bin) don't share enough initial letters. One can, however, shrink the dictionary by cutting out the least frequent words. For example, instead of storing any hapax legomenon in the dictionary, store in the text a single escape code for a hapax followed by the spelling (in character codes) of that word.

Storing the text
A collection is divided into "documents", which may in turn be divided into "pages". For example, a role-playing game's script may have one page for each dialogue cue. Except in deliberately restricted languages such as Sonja Lang's Toki Pona, C. K. Ogden's Basic English, or the word-per-page presentation of English in Gray and Sharp's Dick and Jane, there are generally far fewer pages than distinct words. So we don't have to worry near as much about slack space within a byte or pointer overhead to an offset within a byte. A 24-bit pointer for each page, including a bank number, should be small enough.

Paging behavior
An 8-bit machine may have a 1 to 2 KiB work RAM, a larger video RAM, and a large nonvolatile memory divided into 16 KiB banks: one always accessible and one that can be paged out. Some memory configurations may impose a substantial penalty in cycles to switch to another bank, such as 64 CPU cycles on the Nintendo MMC1, comparable to a TLB miss. Therefore, we want to work in one bank as much as possible, storing intermediate results to the little work RAM we have.

One trick for this is to make three passes over the text, switching in a different bank each time:
 * 1) Decode several dozen indices to the word and nonword dictionaries.
 * 2) Decode characters from these dictionaries and perform word wrapping.
 * 3) Draw glyphs for a line's worth of characters to the frame buffer using a proportional font.

Another trick is to decode word and nonword indices as variable-length quantities in the intermediate buffer. Byte values $00-$7F represent indices 0 through 127, which are more common in the text due to the canonical Huffman code construction, while values $80-$FF are the first byte of a 2-byte quantity. This allows more words to be kept in a smaller buffer.

Test cases
I have been using two e-books published by Project Gutenberg as test cases: The Adventures of Pinocchio by Carlo Collodi, translated by Carol Della Chiesa, and The Secret Garden, by Frances Hodgson Burnett. My early results predict roughly 60 to 66 percent reduction in size compared to coding each character as 8 bits.