Tile-Based Image Compression

I’m not any kind of data compression expert, but I find the topic quite interesting. There’s something fascinating about the idea of data compression and its relationship to machine learning. What I mean to say is that to compress a data stream efficiently, you should ideally capture something fundamental about the structure of the data being encoded. The ideal compression algorithm should somehow be able to “learn” and “understand” the core nature of the data being compressed, so that it can encode it in a very compact way, with as little redundancy as possible. The compression algorithms in use today do this in at a rather superficial level, seeking to shorten the encoding of repeated patterns of bits. LZW, for example, will build a dictionary (a simple grammar) as it encodes a binary data stream, and reuse pieces from this dictionary, when useful, to shorten the encoding. This simple yet clever strategy is remarkably effective, and quite difficult to improve upon.
Image compression is something I find particularly interesting. Images we typically see around, photos in particular, are not at all random. There’s structure: edges, shapes, visible patterns and repetitions. Real-world objects, letters, people, curves, textures, all of these things could in theory be mathematically approximated in a way that’s much more compact than an array of millions of pixel samples. Our brain is somehow able to recover the underlying structure of the world from images, and transform millions of samples into high-level semantic entities. I’ve seen people use genetic algorithms to approximate images using colored triangles. This got me thinking: what if you had something more expressive than triangles? Could you use genetic algorithms to evolve boolean functions to encode the pixels of an image as a function of their x/y coordinates?
I did a number of experiments trying to come up with boolean functions to approximate one-bit-per-pixel black and white images using mutations and hill climbing, but got surprisingly poor results. There are a number of issues. For one, evaluating complex boolean functions on millions of pixels is very time consuming. Another problem is that there are many complex functions to evaluate, and very very very few of them actually resemble the output we want. It’s like looking for a needle in a galaxy. I think that in all probability, an approach using neural networks working with floating-point values would have a better likelihood of panning out. There are training algorithms for neural networks, which would be a better starting point than completely random mutations. Furthermore, floating-point outputs are probably better for learning than boolean outputs, as there are multiple gradations of how wrong a given output is, rather than all or nothing, but this is a topic for another blog post.
After the horrible failure of my boolean image approximation attempt, I sobbed while sipping on one ounce of whisky, but then, I had an idea. Simple approaches that exploit fundamental properties of a problem space tend to do quite well in practice. Maybe something resembling the LZW algorithm would work better. Maybe I could exploit the self-similarity of images (recurring patterns) in a very simple way. I started thinking about an algorithm to encode images using tiles. Split the image into NxN tiles (e.g.: 4×4 or 8×8) and encode each tile either as a direct listing of pixel values, or as a reference to another part of the image. The idea is to save space by reusing image information when a similar enough tile already exists. I used the Mean Squared Error (MSE) of tile pixel values to decide when tiles are similar enough to allow reuse. Someone probably has already come up with a similar image compression algorithm before, but my google-fu has not revealed anything obvious.
The tile-based algorithm, which my friend Tommy has codenamed übersmüsh (USM for short), performs surprisingly well. It’s not comparable to JPEG, which is a very sophisticated format layering many levels of clever tricks to get an encoding as compact as possible, but it does much better than I would have expected. Using 4×4 tiles to compress a 512×512 version the classic Lenna image, I was able to generate something visually quite similar to the original by supplying only 610 tiles of source pixel data, out of 16384. This means that only 3.7% of the tiles in the compressed image come from the source image, the rest are all backreferences to x/y regions of the compressed image. I allowed backreferences to rescale the color values of the tiles being referenced for additional flexibility, which makes for a much smaller tile use.
There are some issues, such as the fact that we probably use too few tiles for the background and too many in some detailed regions. This can probably be improved by tweaking or replacing the heuristic deciding when to use source tiles or backreferences. I can think of several other ideas to improve on the algorithm as a whole:
- Pre-filtering of random noise to make tiles more reusable. The Lenna image has quite a bit of high-frequency noise, as most photographs probably do, due to the nature of image sensors in cameras.
- Reducing the size of source pixel data using chroma subsampling. This may also have the benefit of making tiles more similar to each other, and thus more reusable.
- Avoiding supplying whole source tiles and instead supplying only enough source pixels to get an acceptable error threshold. The rest of the pixels would always come from a backreference.
- Using bigger tiles when possible (e.g.: 16×16 or 8×8) and resorting to a recursive subdivision scheme only when necessary. This would reduce the number of tiles we need to encode.
Some important weaknesses right now might be the difficulty of encoding the tile reference table compactly. The code I wrote is also unoptimized and very slow. It’s a brute-force search that looks at all possible previous pixel coordinates to find the most-resembling 4×4 tile in what has yet been compressed. This currently takes over an hour on my Code 2 Quad. The algorithm could probably be made much faster with a few simple tweaks to reduce the candidate set size. For example, the tiles could be sorted or hashed based on some metric, and only a few candidate tiles examined, instead of hundreds of thousands. The source code is written in D. It’s not the cleanest, but if there’s interest I might just clean it up and make it available.
Happy new year to all :)
Edit: Challenge Accepted
A friend suggested I try übersmüsh on a different kind of image: a screenshot of Facebook containing text and large flat areas. I was curious to see how well it would do, and so I accepted his challenge. I quickly hacked the algorithm to look for matches in restricted size windows (close in position to the current tile). This probably reduces its effectiveness somewhat, but was needed to get a reasonable execution time (minutes instead of multiple hours).
A total of 3354 source tiles was used out of 49000 tiles total, which is about 6.8% of the source image data. Predictably, the algorithm does very well in flat areas. There is some amount of reuse in text, but this could be better if I hadn’t reduced the search window size, and instead implemented a better search heuristic. Some of the ad pictures, namely the ones in the bottom-right corner, show very poor reuse, probably because my mean-squared-error heuristic is a poor approximation of what error is visually acceptable.
Fractal image compression does something very similar. In simplest case for each tile it stores location of other (bigger) tile, and coefficients of affine transformation to apply per each color channel. Then it repeatedly reconstructs all tiles by copying original tile and transforming its color. Because copying happens from bigger tile, sooner or later image will diverge to fixed image. Of course, you can use additional transformations (like rotate or shear tiles) to improve quality.
Cool thing about fractal image compression is that decompression can happen in any resolution – also bigger that original. It will generate artificial details – sometimes really good details.
Click to access fractal-compression.pdf
Click to access barnsley.pdf
Click to access bpra065.pdf
Your method could probably be made faster by using a form of hashing for the tiles as a method to quickly discard tiles that are not a match. Let’s say something like an aggregate of %red-%green-%blue for all pixels of a tile. Since you seem to reuse tiles, a simple hash function like this would group the most likely tiles to search without having to search all tiles for a match.
I did mention using a hashing scheme. I’m not sure what you mean by “%red-%green-%blue”. Ideally, whatever metric is used, it should probably use some normalized representation because I allow re-coloring of the tiles when reusing them. Using absolute color values in the hashing scheme could miss good candidates. Possibly, some sort of color variance measure would work best.
This is extremely similar to fractal compression.
https://en.wikipedia.org/wiki/Fractal_compression
Your results look visually very good.
I wanted to complete my first impression :
Initially, I though “only 3.7% of the tiles in the compressed image come from the source image” > “wow, only 3.7% squares as source, that’s so impressive, an excellent compression ratio perspective”.
Then I remember a common pitfall I experienced in the past.
Take an extreme example : let’s imagine you compress a very large file, and you are going to compress (back-reference) as soon as 2 bytes match.
You are very likely to end up with a compressed file of only 64KB, whatever the size of the source file. “the rest are all backreferences”
As can be guessed, “Some important weaknesses right now might be the difficulty of encoding the (…) reference table compactly.”
Indeed. Especially if 2 bytes are required to reference 2 bytes.
In LZ compression (such as zip or 7zip), a huge majority of the cost of of the compressed file *is* back-reference description (offset & matchlength). The “remaining” part (called “literals”) is very likely to represent a very small fraction of original size.
Obviously, the smallest the size of back-reference items, the more likely you’ll get it already registered somewhere (hence 4×4 tiles are much better than 8×8 tiles).
And the more costly the reference table.
Back-reference compactness is not an auxiliary problem, it is the main problem a dictionary-based compressor must solve for its efficiency.
It is an issue, but not necessarily as deadly as you might imagine.
For 4×4 tiles in an RGB888 image, we have that each literal tile is 4x4x3 = 48 bytes. Hence, if we use a few bytes to describe a backreference (e.g.: 3), we can still get a pretty good compression ratio.
I haven’t had time to play more with the algorithm since this post as I’m busy with Higgs, but I did think of a few possibilities to improve upon it. One would be to use an adaptive scheme that is recursive. For example, we use 8×8 or even 16×16 tiles whenever we can get away with it. When we can’t, we break this down into smaller tiles and try that. Each large tile that you can fit in there saves you the encoding of several small tiles.
This could possibly be improved by working at the 8×8 level, and then using 4×4 or 2×2 subtiles to “cover up” regions that don’t match in the parent tile. Meaning you can get away with a large tile that partially matches, if it’s only a small region that doesn’t match up well. You reduce the number of backreferences needing to be encoded.
A simpler variant could be to just use variable-width tiles to find a back reference as large as we can get away with. I think this may work quite well in some cases, especially if we have large uniform-ish regions. Then there’s the possibility of informing the matching algorithm with some better heuristic measures of what is a “visually-acceptable error”.
I think there are many tricks that could be tried to reduce the number of backreferences. For now, those are left as an exercise for the reader. I think I did at least demonstrate that you can get a fairly good compression ratio and decent image quality with a surprisingly simple algorithm. It’s very far from beating jpeg, but it’s surprisingly good for something so simple, in my opinion.
:)
That’s 3 times better than I initially thought.
Sorry, for some reason, I was reading your results to be only applicable in the Y range, following a classic YUV decomposition. no reason, maybe a bias from JPEG…
Your method is in fact complementary to JPEG, since the DCT transform is mostly in charge of compressing/decompressing individual squares (the remaining ones in your case).
3 bytes might seem a lot, and it is if just about distance,
however any parameter must also be counted in. For fractal, it typically is rescale factor, luminosity and orientation. In your case, you “rescale the color values of the tiles being referenced for additional flexibility”, which is also a transported parameter.
Anyway, it’s an interesting methodology you described, I might have a deeper look at it when I get more free time…
I have implemented an advanced version of ’tile picking’ using mutation, blending and genetic/hill climbing algorithms here.. read the manual for further information.. demo of features here..