Concatenating compressed files

Bismillaahir Rahmaanir Raheem

I have my new server (aalimraan.hidayahonline.net – the one hosting Audio Islam) setup to log web accesses each day to its own file which is then bzip2-compressed.  This is convenient for a variety of reasons.

  • Firstly, I can easily access statistics for each day by processing the appropriate file.
  • B, I can see the relative activity on each day at a glance by seeing the size of the file, keeping in mind it’s only a rough estimate since compression can skew the results (e.g., many requests for the same, exact URL may compress much more than fewer, disparate URL requests, resulting in a smaller file and a seemingly less active day)
  • 3, by being compressed text files, they take up very, very little space.

The fact that the files are compressed using bzip2 means they are relatively tiny (by an order of magnitude compared with their uncompressed forms).  On top of this, I can easily access their contents using bzcat, which simply decompresses the files on the fly, allowing me to redirect the uncompressed text stream to whatever utility I’m hoping to process them with – such as grep, wc, or whatever.

So, I wanted to download the files in bulk to my laptop at home so I could write a PHP script that will process the access logs and store them in a database so I can extract all kinds of goodies from them such as which sites link to Audio Islam the most, which URLs are the most popular, and so on.  So, rather than download all the separate files individually, I wanted to concatenate the log files created thus far into one larger file so that I could just download it in one go.  Additionally, several text files concatenated together would naturally yield a smaller file (at least, so one would assume).

So, I ran the following command (or something like it):
cat access_log*.bz2 | bzip2 -c | ~/aalimraan.hidayahonline.org-access_log-20081226.bz2
The intention being that I wanted to create one large bzip2-compressed file as the concatenation of all the other, smaller daily files by first decompressing them into a continuous text stream (as if it were one large log file) and then recompressing them into the larger, single file.  But when I got around to processing that file on my local machine, the uncompressed output was garbage!  I was surprised, and then an idea hit me – I decided to run the uncompressed output through bzcat once again (the idea being that I am now decompressing it twice).  Lo and behold, the output of that invocation was something that looked astonishingly like a web server access log!  So what happened?

Look closely at my command, and you’ll see that I made the quite silly mistake of using plain old

cat

instead of

bzcat

.  Thus, I was recompressing a stream of already bzip2-compressed files, which is, to say the least, quite pointless.  In fact, the resulting file is all but useless to me, because I cannot really tell where one file begins or ends, so I only get output from the first bzip2 file.

Needless to say, let this be a lesson to make sure, when piping data around, you know what you’re doing.

This has been a public service announcement from you local system admin.  Thank you for listening!

Leave a Reply

Your email address will not be published. Required fields are marked *