Squish - compression module

Monkey Archive Forums/Monkey Projects/Squish - compression module

muddy_shoes(Posted 2011) [#1]
I wanted to make my json files smaller and somewhat encrypted for release and so I created a monkey implementation of the LZW algorithm. I figure that I'm likely to need to add more compression/encoding tools down the road and it also seems likely to be of use to others so I've created a new module and put it up on google code: http://code.google.com/p/monkey-squish/

There's very little to it right now, just an LZW class with two functions, CompressString and DecompressString. Example usage (ignore the PerfTimer stuff):



Compression of test json level definition:
uncompressed length: 3980
compressed length: 1525

Times(ms):

Chrome/HTML5:
From init
time to compress: 15
time to decompress: 1
Repeat run
time to compress: 9
time to decompress: 1

Android - ZTE Blade
From init
time to compress: 403
time to decompress: 357
Repeat run
time to compress: 361
time to decompress: 134


muddy_shoes(Posted 2011) [#2]
Bah... it looks like I've neglected to test with extended characters so this currently won't work with unicode strings or with the encoded data strings used in some modules. I'll take a look at that tomorrow.


muddy_shoes(Posted 2011) [#3]
Well the LZW compression is working. Now I'm left with the problem of Monkey's lack of binary file access. The in memory stuff is all fine, but saving and reloading puts the strings through various encoding mincers and the data doesn't survive.

I can encode the data in plain text but as the output seems to be using 2 bytes per char it renders the compression pointless for smaller files as it just bloats everything back out with 50% unused space.


muddy_shoes(Posted 2011) [#4]
I've managed to find a subset of values that don't upset XNA's desire to be able to resolve a character for each value. As LZW can be restricted to an arbitrary dictionary size it's easy enough to reduce that output range at the cost of some compression efficiency.

The next problem is that the os.SaveString writes out UTF-16 with a BOM. GLFW, XNA, HTML5 and Flash skip that BOM and don't include it in the output string but Android leaves it in. I assume this is a bug.

If I can skip the BOM then I at least have loading of compressed files via the App.LoadString function. Save/LoadState is something else though.


muddy_shoes(Posted 2011) [#5]
Actually, the android loader doesn't just include the BOM, it also completely ignores it and spits out 8-bit char encoding values. In other words, Monkey's own save format isn't compatible with itself across all targets. Great.


therevills(Posted 2011) [#6]
This looks great muddy_shoes :)

In other words, Monkey's own save format isn't compatible with itself across all targets
Sounds like a bug to me...


muddy_shoes(Posted 2011) [#7]
It's a bug, but one that reflects a larger hole in the design thinking around what a Monkey String is. Not only does the GLFW target save strings in a format that the Android target doesn't read correctly, but Strings are inconsistently handled across the Save/LoadState and Save/LoadString interfaces as well as across targets. In some places they're UTF16LE, in others they're treated as ascii byte streams, elsewhere they're read as 16bit integers and then truncated to single bytes.


skid(Posted 2011) [#8]
IMHO LoadString and SaveString should be renamed LoadAscii and SaveAscii with only 7 bit support and 0 meaning terminate ie illegal.

Muddy, you may have missed some earlier code I wrote to manage such things (mime64, utf etc.)

http://monkeycoder.co.nz/Community/posts.php?topic=56

I would look at wrapping your binary in mime64 as your file format and offer user both binary and string based interfaces to you API.

checking out squish....


muddy_shoes(Posted 2011) [#9]
Thanks, but I've got text encoding functions. As I said above, encoding the compressed strings into ascii text is a bit pointless when the text is then saved out as UTF16.

As far as I'm concerned, if Monkey's intention is to provide a consistent code-base, the String representation should be consistent across all targets and supported consistently in all interfaces that take or return Strings. With the need to support multiple languages, plain ascii is a no-go unless it's only being used as lowest common denominator encoding and some form of unicode support is layered on top.

Edit: And if you're looking at squish I think the version currently on Google has a bug where it fails to reset the dictionary when it runs out of index values. I've fixed it but I need to unravel some of the test code I've been putting in to diagnose the various string read/write issues before committing it.