Gzip for compressing planets
Posted by balrog-kun on 18 June 2009 in English. Last updated on 20 March 2010.I've eventually set up my own live mirror of the planet file today and needed to write the update scripts etc (I could probably have downloaded somebody else's from somewhere but nobody answered quickly enough on IRC).
I first tried to have osmosis take the bzipped planet, patch it with the daily diff and generate a new bzipped planet (yes, I took care for the bzipping and unbzipping to happen in separate processes so they could run on the second core). The decompression speed was ok, the processing speed was amazingly ok too (despite java) but compression was taking 80% of the three processes' total CPU time, the ratio was about 10/15/100 (bzcat/osmosis/bzip2). It would have taken likely more that 20h to for this to complete, the result would be about 6GB.
Then I tried saving uncompressed and the thing became IO bound by my SATA disk I think, might have finished in 5h perhaps and occupied about 150GB.
Now I thought the best option would be a compromise using a really cheap algorithm, which should still be useful considering the planet is all text. I considered gzip and lzma, the benchmark here: http://tukaani.org/lzma/benchmarks made it pretty clear that lzma was even heavier on time than bzip2, and that gzip -1 (or --fast -- the lowest compression ratio setting) is clearly what I wanted. Both compression and decompression is multiple times faster than with bzip2 or lzma in --fast mode. The compression ratio is still in the same order of magnitude with bzip2 and lzma (not more than 2x worse), enough to pretty much guarantee that the whole conversion won't be IO bound on an average SATA 1TB disk and an Athlon 64 2X. The ratio of cpu cycles consumed (bzcat/osmosis/gzip) is quite sane now, 35/100/15. If the benchmark I linked is right then also reading the new planet.osm for further processing should be multiple times faster than either of: uncompressed, bzip2 or lzma.
EDIT: ungzip/osmosis/gzip process takes just over a 3h on the above setup and the planets are 10GB in size, the CPU usage proprtions were about 7/100/13 on average.
Discussion
Comment from Mungewell on 18 June 2009 at 23:14
Interesting.... there's another compressor you could try 'LZO' which is intended to be fast on the decompress side. After all the planets are a compress once, decompress many scheme.
Most of these compression algorithms spend some time computing the best 'dictionary' to use. Since the planets contain pretty much the same data is it possible to force a pre-defined 'dictionary' into the compressor to speed it up?
Comment from Mungewell on 18 June 2009 at 23:42
oh I should add... you did recompile the compressors with optimisation for your machine, didn't you?
Comment from Gavin on 19 June 2009 at 08:26
The optimum would seem to be a compression algorithm which could be patched without decompressing. I would guess that the number of actual compression frames that change on a daily basis is actually quite small and some sort of post compression diff could be used. How this would actually work (how many bytes of the file change) I don't know and I would assume that the algorithm would have to be specifically designed to allow it.
Comment from Firefishy on 19 June 2009 at 11:33
planet.osm is compressed using pbzip2, you will likely be able to get slightly faster decompression speed using pbzip2 instead of bzip2.
Comment from balrog-kun on 19 June 2009 at 15:03
Thanks for the suggestions, as for pbzip in case of the dual core CPU it could give at most 2x speedup which would still make the compression the bottleneck in the whole process (but note that the second core isn't idle either, so perhaps less than 2x). If the benchmark I linked is to be trusted, you would need to be running pbzip on an 8-or-so-core system for it to be able to compete with gzip in terms of both compression and decompression.
(I haven't recompiled the bzip or gzip per se, but I'm running this on a Gentoo system and I let the system "build itself" with what it thought was good, I set the CFLAGS quite aggresively including -O3, -march=athlon64, -msse, -m3dnow)
And as for LZO: oooh I think that's what I was really looking for when I googled LZMA yesterday (didn't remember the exact name). I'll try it on one of the planets this week and report back.
Comment from amapanda ᚛ᚐᚋᚐᚅᚇᚐ᚜ 🏳️⚧️ on 20 July 2009 at 20:45
re: using a custom dictionay. If you want to get that specific you could just make an OSM binary file format and use that. It should give the most bang for buck