Algorithms for Data Compression

This set of exercises accompanies Lecture of Week 4 of the course Algorithm Design and Analysis.

Short comprehension exercises

Determine the Huffman codes for encoding the folowing input text: YABBADABBADABBADOO

Determine the LZW encoding for the following input text: YABBADABBADABBADOO

Implementation exercises

Implement a simple compression/decompression tool, in one of the variants:

A - Huffman codes

B - LZW compression

Variant A - Compression using Huffman codes

Minimal requirements for variant A

Implement a program that reads a string and builds and displays the Huffman codes for its characters.

Complete requirements for variant A

Implement a FileCompresser tool based on Huffman codes. The tool takes following arguments in the command line:

FileCompresser mode inputfile outputfile

where mode can be -c or -d, meaning compression or decompression.

The tool must accept for compression files of any type (text or binary). After using the tool for a sequence of compression followed by decompression, the obtained file must be identical with the original file.

Implementation: For doing compression, the program reads the inputfile as a binary file, each byte being a character of an alphabet of 256 symbols, and counts the frequencies of the characters. Huffman codes are built. After this, encoding (writing the outputfile) can begin. Information about the used codes (frequencies of characters) has to be also written in the outputfile, in order to enable later decoding. The inputfile is then read again and each character replaced by its Huffman code. For writing in the output file (the compressed file), the Huffman codes must be constructed at bit-level using bitwise operations ! If you are not working at bitlevel, but just output characters '0' and '1', then a code such as 10011 that was meant to be 5 bits will take 5 bytes and the tool will actually expand the size of the input file instead of compressing it ! Also, if the total number of bits used in the encoding of the input file is not a multiple of 8, some "padding" bits have to be added at the end of the output file.

For doing decompression, the program reads from the file the table of character frequencies and builds the Huffman codes (the Huffman tree), using the same algorithm as for the compression. Then, it reads the compressed data and using the Huffman tree decodes the original characters and writes them into the decompressed file.

Variant B - LZW compression

Minimal requirements for variant B

Implement a program that reads a string and displays the coded string and the computed dictionary.

Complete requirements for variant B

Implement a FileCompresser tool based on the LZW algorithm. The tool takes following arguments in the command line:

FileCompresser mode inputfile outputfile

where mode can be -c or -d, meaning compression or decompression.

The tool must accept for compression files of any type(text or binary). After using the tool for a sequence of compression followed by decompression, the obtained file must be identical with the original file.

Implementation:For doing compression, the program reads the inputfile as a binary file, each byte being a symbol of an alphabet of 256 symbols. For every symbol read, it updates the dictionary and maybe outputs some encoding (an index in the current dictionary). Use for the length of the index-word 12 bits in your implementation. If the size of the dictionary increases such that it exceeds the addressable size, exit with an error code.

For doing decompression, the program reads the encoded file as a binary file, where every 12 bits represent an index into the dictionary. For every read index, the program outputs the corrsponding string from the dictionary. The dictionary is continuously updated according to the LZW algorithm.