The perfect compression algorithm

Jan 09 2010

A story from my college days.

I stormed in with full excitement to the room of two of my friends. I had found the perfect compression algorithm.

“I found a compression algorithm that can reduce the size of any file!”

“That is impossible. What is your algorithm?”

“I cannot share the algorithm right now, but I will tell you once I write a program to compress files.”

“Don’t you know that it is impossible to write an algorithm to compress all files?”

“Yes, but… now I have this algorithm!”

So what was my algorithm?

1. Consider the text to compress as a large number.

2. Find the integer square root of the number.

3. The square root and the error form the compressed text.

For example, let us compress the number 98653784578.

Integer square root of  98653784578 = 314092

Error = 114 (i.e. 98653784578 – 3140922)

314092 and 114 form the compressed text. We just saved 2 characters using our algorithm!

If the original number is of length n, then the square root and the error will be of maximum lengths n/2 each. So this will mean that the final compressed text will be of a maximum length n and possibly even smaller.

Can you find the flaw in the logic?

Looks like I was not the only one who invented a perfect compression algorithm. At least I did not spend 12 years to understand that perfect compression is not possible and I did not get publicly humiliated (not until this post, but who is reading this anyway ;) ). There are various proofs for why perfect compression cannot be achieved.

18 responses so far

  • Joyce Babu says:

    I think you will face the following problem with this logic. In order to differentiate between different integral-error pairs, you can follow two approaches.

    1. You can insert a separator between the integral part and the error. Also you need an separator between two different pairs. Now the problem is that the separator has to be unique and should not be confused as part of the integral part or the error. For that the separator should have a minimum width of n/2+1.

    2. You can set a fixed width for both integral part and error. In which case both integral and error requires a minimum length of n/2.

  • Faiz says:

    I used to think the same way. Instead I was thinking about the storing the combination sequence number (which will be larger than the number itself, if number is greater than 21). Finally my brother helped me understand the concept. If you need to store 1 bit of information, you need 1 bit. There is no other way.

  • Hugo says:

    correct me if I’m wrong, but to me the flaw seems to be in this statement:

    “If the original number is of length n, then the square root and the error will be of maximum lengths n/2 each”

    for numbers where n is uneven, the maximum length of square root and error are (n+1)/2. The same algorithm on the number 10,200:

    int_sqrt(10,200) = 100
    error = 10,200 – 100^2 = 200

    our resulting numbers are six digits long together, one longer than the original number.

  • Fazil says:

    1. Sometime length n/2 is neither correct for root nor error. Consider number 7, needs 3 bits to store. Root is 2, need 2 bits to store. Error is 3 need another 2 bits. Total of 4 bits.

    2. Second challenge is where to store length of root and error.

  • Ben says:

    This doesn’t work for all inputs. You can’t compress single bit input.

  • Ali says:

    Well, not in square but I think there is a way to save some bits if we knew some additionaly logical information.
    e.g. in single precision floating point the significand (mantissa) is 23-bits but bit-precision is 24-bits because you know that the significand always begin with 1. Thats the way it should go.

  • Ali says:

    We need logical information that does not have to be written, we should know it in advance. I actualy don’t know how, it has to be invented, but I image it in that way –> Compress a file as good as possible, than invert every second byte and compress it again. So we don’t need to save the information about byte-inversion, we just know it.
    (inverting the bytes is to easy, i wished I knew the right way)

  • hohums says:

    I think what you do is use string theory to compress the data into another dimension. Since the data is uncompressed in another dimension and doesn’t exist in your dimension its zero bytes… but there is a problem… how do you get the data back when you need it? You need to store something about where it was stored and what dimension.

    So what you do is store it one dimension over in the same spot on the same harddrive. Then you can always retrieve it. The problems is that inevitably some other dimension will try to use your HD to store some sort of useless information on it. Therefore you need to make sure that the HD is properly shielded from inter-dimensional-data-hijacking.

    There you go… zero byte compression. So simple.

  • Ali says:

    hohums, I see you watched too long the films like matrix or the-one.
    But I’ve got some news for you, the earth is a globe!

  • Who took 12 years to figure it out? Not sure if that person was working on the same dimension.

    I think like Ali said, you may have watched too much matrix or science fiction stuff. Maybe this is breakthrough but I have yet to confirm that it is. Neat concept though.

  • Sridni says:

    Can I know the name of the one responsible for perfect compression algorithm please.. “sir can i know your name?”

  • Albert says:

    If you think about compression, you’ll have to think the whole way, not stop halfway. It’s nice to have 2 separate numbers with less characters than the single number started with, but how do you store these two numbers such that a de-compressor can still separate them? Either you need fixed length blocks or you need an algorithm that can perform a 1 to 1 transformation, back and forth, so from one number to two numbers and vice versa. This seems to be a bottleneck.

  • 5yrsTryingAndCounting says:

    Well, I gotta admit, I’m still churning this one over… From what I can determine, the most likely way to ‘solve’ this would be to create some sort of algorithm that _describes_ the original data… don’t try to think of it in any other way. That Magic Algorithm would probably do one thing on the compression end: create another algorithm that could be ‘executed’ and would spit out the original file… I don’t know enough math to do this, so there’s a large level of reasonable doubt, but has anyone ever thought of charting a binary sequence on a graph? 1 goes up, 0 goes down (y-axis) as position through the sequence goes from 0-n on the x-axis where n is the length of the sequence… once that is done… can an equation be made to describe those intersection points? if so… then possibly use some higher math functions (analog concepts, not digital) like a Fourier Analysis or something to break out its component parts..?

    This would sort of ‘transition’ this digital problem into the realm of analog signals processing… where a lot of ‘tricks’ can be played on the “data” … just a thought for any of you geniuses out there who might have some tricks up your sleeves… if you do figure a way out… please keep it OpenSource under GNU Licensing.

    Thanks.

    • 5yrsTryingAndCounting says:

      Just thought I’d comment back here. After talking with someone who’s familiar with Fourier Series, this won’t work. Fourier Series require repeating patterns and even then, only on pre-determined harmonics… Still might be possible to represent a bit sequence as a harmonic frequency, but it’s most likely to cause a data-explosion such that the outcome is a larger file than the file itself.. :-(

      • 5yrsTryingAndCounting says:

        Although the original idea of (charting the “position” of a bit and finding a mathematical 2-D function to represent the data, such that each “whole number” in the formula corresponds to a point on the graph) might still work… but it bears a whole lot more thinking…. also, if that can be shown to have even marginal success, then how about trying to chart everything in 3-D or 4-D (or n-D) and creating a function that “hits” all those spots in ‘sequence’ …

        Best of luck to anyone out there w/ enough time/experience/knowledge/interest to pursue this approach!

  • 5yrsTryingAndCounting says:

    Here’s a fun thought for you all…. in the int_sqrt scenario… why store the error? so you end up w/ a set of “candidate” bit sequences… if you hash the original bit sequence beforehand (say MD5) and also record the original bit sequences’ length then you’ll confine the possible permutations reasonably enough to brute-force MD5 each of the “candidate” bit sequences.

    In the case of MD5, your limit case would be 16 bytes (the ‘new’ MD5) + int_sqrt( previous iteration and its MD5)

    w/o doing the math on that limit case, I’d guess it to be around 32bytes… not that big…

    Trade-off: since the “compression” is lossy, you trade off filesize for computation time to pick the correct “candidate” bit-sequence.

    (we have really fast computers now ;-) )

    • 5yrsTryingAndCounting says:

      In the scenario above of 10200:

      assume you have a file, when read as 1′s and 0′s it magically happens to correspond to the decimal number 100…

      100*100 = 10000

      the next possible number would be 101.

      101*101 = 10201

      10201 – 10000 = 201 possibilities…

      so starting w/ 10000, md5 it.
      no good
      iterate so you have 10001.
      md5 it
      no good
      … .etc etc etc….

      10200, md5 it -> BINGO!

      as it would pertain to file compressions… start w/ an original file (the one to be compressed). MD5 it and store that key in a meta-data file. interpret it as a bit sequence (a really really really really big “number”). int_sqrt it. store that new (much smaller number) in a “compressed” file.

      when you need to de-compress, read “compressed” file and interpret as a really really really big number (1 less ‘really’ ;-) ) . iterate it to find the upper bound on what it can’t be. then:

      square = bigNumber * bigNumber
      md5 square.
      no good
      iterate square
      md5 square.
      no good.

      etc.. etc.. etc…
      iterate square
      md5 square -> BINGO! YAY!

Leave a Reply