Compression

BlitzMax Forums/BlitzMax Programming/Compression

ImaginaryHuman(Posted 2013) [#1]
One of the areas of programming I find interesting is compression algorithms, and the holy grail of compressing `random` data. It's very hard to compress random data, some say it's impossible.. .. I think it is possible but just very very hard ... but I try anyway. I think I might've got something that works but it's really slow and needs multiple passes because each time it only shaves off around 1-2% of the file... but seems to work - albeit pending a proper decompression test.

Anyone else dabbling with compression algorithms or made any interesting progress in this area?


xlsior(Posted 2013) [#2]
Truly random data is supposed to be in compressible since there are no patterns, which is what pretty much all compression algorithms depend on.

If you CAN shave off a few percent per iteration, then in all likelihood the data isn't truly random. (rnd and such only fake randomness, but statistically isn't completely random. There is a reason that true randomizers as for hardware encryption devices and such are very expensive. (iirc those are based on things like atomic decay rates)


ImaginaryHuman(Posted 2013) [#3]
lol yea I've done numerous experiments on compression and usually when it looks like it's compressing well and showing signs of being able to be applied multiple times there inevitably is some tiny tiny little oversight or bug or one wrong number which is throwing the whole thing off and leaking data. Been down that road and bought the whole wardrobe many times, lol

There has to be some point at which the data can't be squeezed any smaller, I'd find that natural, but I don't think that means `random values` can't be squeezed at all. Certainly like you say it's very hard to use traditional approaches like matching repeated patterns. That said, you'd be surprised to find out that a block of random data actually does have some percentage of repeated 2-byte strings, even a small handful of 3-byte strings. It maybe only represents a few % of the total file and the problem is it's very hard to encode their positions without producing more code than you save, especially because they're short and randomly located. It's fun trying though.

If nothing else it's interesting trying to come up with something better than PNG or lossless JPEG2000 for images.

A while ago I was trying to work with a Fast Fourier Transform using integer math, that could convert the image and do something to its values to make it more compressible... but I could never get that working.

Instead of focussing mainly on the back end repetition encoder/huffman-type variable coding I've become more interested in the front-end `filters` (like PNG filters) to do things like sorting the data in a reversible way. Pretty difficult to better what's already been done though. Makes for a good intellectual exercise.

Last edited 2013


Russell(Posted 2013) [#4]
This reminds me of a so-called 'fractal compressor' I downloaded many years ago that could supposedly compress a file by 99%. Well, imagine my surprise when I tested it on a 1.2M jpeg and it compressed it down to less than 2k! It seemed too good to be true, especially when I was able to successfully uncompress the 2k file back to its 'original' size. So, I viewed the 'compressed' file with a hex reader and discovered that all it was doing was moving the original file to a different location and copying that location information into the small output file... Had me going there for a while.

Anyway, I had an idea that even 'random' or highly compressed files could be compressed further if a 3d matrix was set up in which the input file's data would be analyzed in planes so that patterns would emerge like in a 3D tic-tac-toe (naughts and crosses to you blokes across the pond) game.

To find useful patterns this was would take a LONG LONG time, as many different matrix sizes would have to be tried to get the best fit. Because of this, compressing would take a very, very long time but de-compressing would be much quicker, as that information would be part of the output file.

Unfortunately, my 3D math skills are not up to the task. :(

Anyone want to tackle this idea? If it works, all I ask is complete credit! LOL. But seriously, is this idea viable at all?

Russell


xlsior(Posted 2013) [#5]
but I don't think that means `random values` can't be squeezed at all


Well, of course it does depend on the type of the data: Random alphanumerical data can of course be compressed significantly since you're only using a fraction of the 256 possible ASCII values to represent those digits.

And of course there's other exceptions, because a small enough data set -can- randomly consists out of the same character over and over. Extremely unlikely, but possible.

But when the dataset is large and encompassing the entire available bit range? Good luck with that!


*(Posted 2013) [#6]
I dont a random compressor many many years ago in C this generated a string according to a number, it then generated a string from that number if it matched the read data from a file it wrote the number instead.

It worked to a fashion BUT was incredibly slow I actually managed to get a 1Mb file down to 12 Bytes BUT that took two days to do :s


ImaginaryHuman(Posted 2013) [#7]
How did that string match thing work.... did it decompress properly and be applied to any file etc?

In other news I found out today that due to a tiny little bug in a part of my algorithm my latest effort was not working after all. lol I was quite convinced otherwise. Man it's such a difficult thing to do.


ImaginaryHuman(Posted 2013) [#8]
I gave up on my previous effort, then thought of another idea, and tried it, and thought it was working again, and then found out that it isn't. lol.. I should probably become a more accurate programmer first. But at least I'm coming (slowly) to the conclusion that compressing random-like stuff is extremely hard if not next to impossible. Or at least, I'm just not smart enough. Probably there's some kind of super math-laden way to do it. What's with the attraction to something you can never attain, anyway?


xlsior(Posted 2013) [#9]
Compressing a random 1MB file to 12 bytes sounds more like you generated a hash code (like MD5).
The biggest problem is not that it takes long, but that the values are NOT UNIQUE. Many different file will lead to the exact same hash code, meaning that you can't tell with certainty that the (re)generated file is the same as you started out with in the first place.


Floyd(Posted 2013) [#10]
While you are at it you should invent a perpetual motion machine. Then you will have free electricity to power the computer doing the compressing.

A simple counting argument shows that any compression algorithm will fail on nearly all possible files.

Consider trying to compress 2-byte files down to a single byte. There are 256*256 possible files of size 2. But there are only 256 files of size 1, which could then be decompressed to the original size of 2. Thus at most 256 such files are compressible with any given algorithm. More than 99% of files are not compressible.

The same arithmetic applies to compressing size n down to n-1. And when you compress by more than one byte the numbers rapidly get even more lopsided.


ImaginaryHuman(Posted 2013) [#11]
Understood.

However, when you think about, the counting argument would have to also apply to compresion of data that isn't quite random (where do you draw the line?), or compression of data that is fairly non-random or highly uniform. It makes it sound as though it's impossible to compress ANY file because by making the file smaller you somehow must be representing multiple different files with the same data. Even a PNG image once compressed must then surely represent multiple files. But it doesn't. Or a zip file, surely if it's 1/2 the size of the original then it must represent 2 files? It doesn't really make sense to me in the bigger picture. The key here is what the data MEANS and how it is interpreted, not just what the raw data is.

You CAN compress a given amount of data to smaller than its original size and not produce any overlap or flaws in decompression. So evidently the compressed file is moving more towards representing a permutation of a number, or like a hash code, or whatever, something that represents the version of the original data but in less space. So I don't see why this couldn't also apply to data that happens to be `hard to compress`. It just maybe means you may hit a limit sooner, or not be able to compress it by as much. I DO believe there is a limit somewhere to how much it can compress, but I don't think that limit is at the same size as random data. It should be compressible unless it perhaps has absolutely maximum entropy or something.

Last edited 2013


xlsior(Posted 2013) [#12]
However, when you think about, the counting argument would have to also apply to compresion of data that isn't quite random (where do you draw the line?), or compression of data that is fairly non-random or highly uniform. It makes it sound as though it's impossible to compress ANY file because by making the file smaller you somehow must be representing multiple different files with the same data. Even a PNG image once compressed must then surely represent multiple files. But it doesn't. Or a zip file, surely if it's 1/2 the size of the original then it must represent 2 files? It doesn't really make sense to me in the bigger picture. The key here is what the data MEANS and how it is interpreted, not just what the raw data is


It doesn't matter what the data is, just how uniform or random it is.

The more uniform/non-random, the easier it is to significantly compress. the more random it is, the less you can compress it. there's a curve of diminishing returns.


ImaginaryHuman(Posted 2013) [#13]
Been tinkering away at this again the past few days. I'm coming to the conclusion that because random data is so... random.... the ability to compress such data happens only randomly... lol ... In a few tests with certain files, for example, I was able to `accidentally` compress them a little, yet with many other files there was no improvement or the data got bigger. Also the nature of what worked to produce the compression was kind of a shot in the dark, guessing and trying stuff and hoping to get lucky. So I think you can compress random data, randomly, using a random algorithm that is chosen randomly. Which makes it not very reliable. I think somehow the algorithm itself needs to be very inconsistent and varying in order to stand any chance at all.

So anyway. As an aside I did manage to come up with a new filter/transform which helps to compress 24-bit images better.. .. around 10-20% better than PNG files.... but not quite as good as lossless jpeg2000 which is still in a league of its own. AS a side note though... if you compress an image to a really low quality level with JPEG2000 and use that as a predictor, and then find how far those values are from the actual image values to produce `error` data, the error plus the mini jp2 file together compress to slightly less than the lossless jpeg file. Kinda funny, but not particularly useful.

Last edited 2013


xlsior(Posted 2013) [#14]
10-20% better than PNG is pretty good though -- is that also lossless?

If you can read/write them quickly, that very well could be an interesting file format.


ImaginaryHuman(Posted 2013) [#15]
Yes it's lossless. I'm not interesting in lossy algorithms, they're way above my head ;-) It really only works well for 24-bit images. Stuff that's been color reduced or dithered etc is not so good. So on some small files PNG comes out better. But on bigger images mine is definitely better. It's pretty fast I'd say. The back end is LZMA which is faster than the gzip used in PNG (not trying to totally reinvent the wheel here). The front end is my pre-processing stuff which is quite quick. I think once optimized, for loading/saving an image etc you probably couldn't tell the difference in speed between PNG and this, might even be a bit faster.

Here are some compression results. The first is my result, second is PNG, third is lossless jpeg2000. These are 1024x768 24-bit images of differing complexity, so 2359296 bytes originally.
Pic1.png 1247580, png is 1499917, jp2 is 1145169
Pic2.png 379901, png is 405200, jp2 is 451624
Pic3.png 1430370, png is 1775196, jp2 is 1313982
Pic4.png 1611037, png is 1795204, jp2 is 1518795

What I'd really like is to be a good percentage below jpeg2000 but that's harder on most images. jpeg2000 appears to be good for real-world stuff with graduations or sustained areas of texture, but not very good at highly specific detail. I managed to beat in on a highly blurred image (Pic2) where even PNG is better.

When you compress an image only with gzip or only with LZMA with no filters/pre-processing the numbers are higher than all of these formats. So it goes to show you, you can come up with something better.

Last edited 2013


xlsior(Posted 2013) [#16]
Don't know how you do your compression, but are you subdividing the image into smaller parts as well? IIRC JPEG supports a multitude of different compression subsets, and upon compressing it it can try various algorithms for each sub-section to find the one with the best result. That's why a picture with very high compression starts looking 'blocky', those are the subdivided parts of the image that degrade noticably different than their neighbors.


ImaginaryHuman(Posted 2013) [#17]
I don't do that at the moment. I have tried using blocks ie like reading a square of 4 bytes or pixels, or even a 4x4 block, figuring that there is likely to be a close-knit relevance of values within that block which might otherwise be split up by dealing with separate rows, but I found it only made matters worse. In fact, I got fancy and did a kind of flood-fill extraction algorithm which could use bitcodes to mark each byte to say whether you can flood below or to the right, and wherever the flood proceeded within a threshold it would extract those bytes. Even with a threshold of 0 where the bytes have to be the exact same value it resulted in worse compression for me, at least when used in combination with my current method. But it's a cool idea that I want to revisit maybe in combination with some other compression passes. The trouble is it makes lots of codes. One of the worse issues with any compression method is when it has to store codes to decompress because often that eats up all the savings.

One idea I have had is to work on every other byte in a hash pattern ie on the next row I alternate which of two bytes I look at.. this provides four pixels surrounding every unpredicted pixel, plus you get the upper left and upper right. So overall this gives you 6 pixels to use as a predictor instead of 3, which should give a more accurate prediction. Actually I haven't totally implemented this yet but the one test I did again came out with worse performance than much simpler methods. Sometimes the ideas that sound like they should work better just don't when you actually try them. But I think the 6-byte thing might work if the predictor is better, like Paeth or something using splines.

I agree about adapting the prediction method as you go along giving better results, which is why PNG lets you choose a filter on a per-row basis. I tried a method of choosing filters on a per-smaller-block basis and it was slightly better performance, but not a huge leap, plus it generates more codes to flag which filter to use. That said maybe I can use previous history of which filters worked best on the data that was just compressed to decide which to use on the next byte.

I took a visit to some of the compression forums online. There are image compressors out there that perform 10-15% better than JPEG2000 which is pretty impressive and I have no idea how to get that kind of performance. It's kind of humbling, there are some real proficient boffs working on this stuff so it makes my `oh I did better than PNG` seem not quite so amazing, lol.... but I'm still happy with that.

Also I get the sense that you have to work at the bit level not the byte level if you really want to scrape up every last bit that you can save.

Last edited 2013


ImaginaryHuman(Posted 2013) [#18]
Hey. I need an algorithm which can skew the count of how many times each byte value occurs in a file. So let's say I have a file where every byte occurs 100 times - not necessarily in any order, in fact the position of the bytes are totally randomly scattered. But I want to end up with a file that has , say, several times more occurances of some higher (or lower) byte values and several times less occurance of the rest... same file size, and preferably without needing any extra data to reverse it - and yes it just be exactly reversible.

Has anyone got any ideas for how I can do this? I thought of `mapping` several byte values to one byte value, but then there is no apparent way to undo it because one value represents multiple values.


xlsior(Posted 2013) [#19]
Given that you don't want to change the filesize and don't want to supply any extra data to reverse it, you're going to be pretty limited in what you'll be able to do...

Only approach I can think of is something really basic like e.g. do a XOR or other reversible operation on every 10th byte (or 5th, or whatever) in the file to flip it to a different value or something.
That way 10% (or 20%, or...) of the values in your file will change, and since they aren't in order, it means that the ratios (value byte 100 times) will likely be skewed a bit too when you're done.


ImaginaryHuman(Posted 2013) [#20]
MM that's than idea.. although because the position of the bytes is totally random, it will work out that `on average`, bytes of every value will have been xor'd by the time you're done. ... mm, although if I perhaps apply this on already partially skewed data it might work. I'll try it.


xlsior(Posted 2013) [#21]
.. mm, although if I perhaps apply this on already partially skewed data it might work. I'll try it.


Even if it's not skewed: if the data is more or less randomly arranged in the file and you're exchanging a bytes at a non-random intervals, I'd expect the distribution of values in the end results to follow a bell curve rather than remain identical?


ImaginaryHuman(Posted 2013) [#22]
Why would it make a bell? I tried it and the distribution is still just as random. Like I said, the position of the bytes is totally randomized so even if I modify every 10th byte the chances are that over the course of the file each of those 10th bytes will be any one of the 256 possible values, so they all approximately equally will be modified.

I can see how I can for example stretch out the `dynamic range` of the data, say doubling it, by choosing perhaps to repeat all bytes with a value >127 more than once... but this creates more data. I'd rather somehow increase the occurance of some bytes without expanding the data.


SpaceAce(Posted 2013) [#23]
This is obviously useless from a practical standpoint, but it sounds kind of fun... What about building tables of two-, three-, four-, etc-byte patterns which the compression and decompression software can refer to? Maybe data analysis will reveal that machine instruction sets and human foibles lead to 100,000,000,000 or so patterns showing up more than their share of the time. Of course, the combination of characters used to represent the patterns would have to be shorter than the patterns, themselves.

ImaginaryHuman, would you like to share any of your compression code? I'd like to have a look at it.


ImaginaryHuman(Posted 2013) [#24]
There is great benefit to being able to make a given byte value occur more often - it leads to more repetition which can be compressed. The hard part is finding a reliable way to do this and make it reversible without eating up all the savings in extra information.

I found that skewing the data in a way that required too much overhead did in fact allow the data to compress better, but then the overhead was more than was saved. So if there is a way to skew it reversibly with less data needed (or none) that would be ideal.

Your thing about building tables of matches, sounds like a basic dictionary compressor which has been done before in many different ways and doesn't work on random data because there aren't enough repetitions. There are SOME, but no where near enough.


xlsior(Posted 2013) [#25]
...because keep in mind that each and every time you substitute one string with another you need a marker telling you that the following data is substituted, which all adds overhead and potentially negates the savings.


ImaginaryHuman(Posted 2013) [#26]
Yup. It's very hard to encode the position of something in a highly random file, using less bytes than is saved, because usually what you can chop out is so small that the code to identify it's position is longer. Like say you find a repetition of a byte as 255 255, but it's position is the 100,000th byte in the file and there were no other matches before it, and the next one isn't until the 250,000th byte.. .there is just no way to code that in less than 2 bytes.


zzz(Posted 2013) [#27]
I missed this thread completely :) Hoping there still is some interest in this.

As people already have said, putting effort into random data compression is pretty much a waste of time, because a) there already exist compression algorithms that achieve (very!) close to ideal value distribution and entropy, and b) while that is a theoretical measurement of compression we just cant make any assumptions on random data.

You need those assumptions to succeed, because it can give you alternative ways to analyse and reorder your data.


ImaginaryHuman(Posted 2013) [#28]
I think it basically goes like this.. if you have access to the original creator of the image/data, then you can completely recreate that image/data with little information... e.g. the settings for an animation plus some 3d models. If you don't have that and you're working off `end results` only, then now compression is already harder because you now have to try to predict what the original intelligence was that created the data. Worse, still if you take the compressed `random` output of such compression methods and try to compress it further, it's really difficult because now you need to know not only the nature of the previous compression step but also of the original data creation. It's very hard. So really the answer to `how to compress better` is to actually avoid compression altogether, stop wasting your time and focus on real-time generation of original data. There's only so much you can compress something once you no longer are really in touch with its source.