Compression

byuu · Post by **byuu** » Fri Nov 10, 2006 12:41 am

Sorry to post this here, but I'd like to get some intelligent programmer feedback and I know of few other forums with highly skilled programmers.

I've been thinking about the issues I have with compression of data recently, and what I'd prefer, and I've come up with the following:

This is what I favor, personally:
- fast decompression, with compression only being performed one time, decompression is much more important; though compression speed should be considered as well
- simple algorithm, preferrably less than 20 lines of C code; the more complex the algorithm, the less likely it is to be ported, especially to older architectures such as the SNES, and the more difficult it will be for someone to extract data in said format, and the more complex compression libraries for the format will be
- licensing, a good compression algorithm, along with its decoder, should be public domain, or possibly with the only requirement being to not change the format to prevent incompatibilities
- compression ratio to me seems the least important, you must consider the relative file sizes here; if lossless to said format gives a 35-50% ratio, and 349-zip provides a 45-60% ratio at the cost of all of the above conditions failing, the former is a better all around algorithm, as both significantly reduce file size
- internationalization, zones tells me that ZIP, GZ and JMA do not clearly define filename encoding; ideally a format that mandates UTF-8 would be preferrable
- adaptable, solid state should be an optional feature, as should compressionless storage when all that is desired is a language-agnostic container with a clean library interface

Now, what I'm curious about is what other people think is most important in a general purpose compression algorithm. Specifically Nach, as he undoubtedly has the most experience of anyone here.

Would indexing into a specific file position be important, eg for lossless audio compression? This would require an entropy coder, eg huffman, but would lose even more compression power. Or would something generic like LZSS be preferred, which makes solid state archiving possible, as well as gains additional compression power, but makes seeking impossible as you must know the previous information in order to decompress data?

Any other major issues I'm missing? I'm not suggesting actually making a big public format, actually more for just storing my own work in. I hate having to use zlib for eg GZ/ZIP decompression. I'd also like to share this as a library for other developers to use for their own projects if desired. Or would a compromise be best? eg implement both a simple entropy coder and a simple solid state encoder, and let the user decide? This could make things difficult for eg audio players when the user compressing the data doesn't choose the right option when making the archive.

One idea would be to store a list of "block indexes" before the file data with LZSS, say one per megabyte, and then simply flush the window buffer at each point. This would only marginally affect LZSS compression ratio, but allow arbitrary seeking. But with really large windows, it would defeat most of the advantage of using solid state archiving. In fact, the block size would be (in some way related to) the limit of the window size being useful.

Post by **Nach** » Fri Nov 10, 2006 1:37 am

An existing file format doesn't matter much if you want to do your own thing. Just take something good and come up with a simple format around it to match your needs.

Based on what you wrote, I think you want something LZO based, it has extremely fast decompression, and the code is smallish too. You also get the benefit of bragging that you now use outerspace technology.

If you want to work on a simple format based off of LZO together, I'm interested.

byuu · Post by **byuu** » Fri Nov 10, 2006 4:04 am

If you want to work on a simple format based off of LZO together, I'm interested.

I'm not familiar with LZO. I'll research it however, and I certainly don't mind working with you on another format.

Ok then, so the goals are:
- as small and simple as possible
- support UTF-8 and subdirectories
- as fast as possible
- public domain algorithm + source code library
- attempt to get somewhere close to ZIP if possible in terms of compression ratio
- allow granular blocks for the purposes of streaming data

Sound good? Any other suggestions? This shouldn't take more than two or three days to create the command-line tools for once we know what we want.

EDIT:

A free software tool which implements it is lzop. The original library was written in ANSI C, and it has been made available under the GNU General Public License. Versions of LZO are available for the Perl, Python and Java languages. The copyright for the code is owned by Markus F. X. J. Oberhumer.

GPL is unacceptable. We may need to create our own, or choose an LZ-variant that is not copyrighted or patented.

Post by **Nach** » Fri Nov 10, 2006 4:57 am

If you want pure PD, there aren't too many options available.
If you have nothing against zlib, that can be used in a different container without too much difficulty. And just for the decompression of a deflate stream, there isn't that much code.

byuu · Post by **byuu** » Fri Nov 10, 2006 10:29 pm

Well, there's always making our own algorithms, then.

Deflate is a little complicated, you have to support three separate compression modes involving huffman and LZ and an interesting hybrid combination of the two. It's got a good compression ratio, sure, but still.

I've been thinking about compression, and I have an idea for what I'll call "block huffman", or essentially adding combinations of bytes > 1 for data to the tree. It's nice because if your archive gets partially corrupted, you can restore all data except for 1-2 bytes in front of and after the corrupted section of the file. With LZ, one bad bit at the start could corrupt even the end of the file. Or if you want to extend it and make it adaptive, it could very well be a much better compressor than LZ+huffman alone, though it would be slower to both compress and decompress...

Post by **Nach** » Sat Nov 11, 2006 5:46 pm

byuu wrote: Deflate is a little complicated, you have to support three separate compression modes involving huffman and LZ and an interesting hybrid combination of the two. It's got a good compression ratio, sure, but still.

And you got zlib compress and decompress which is tiny code so you don't have to worry about handling any of the writing of the compression or decompression code itself.