And it doesn't matter if all of Wikipedia is embedded in the compressor — if you can reconstruct the original using only the decompressor and the compressed data, the process can be as expensive as you like.
That's the exciting part.
It's also a vague connection to what's exciting about LLMs, regarding the amount of information they can reproduce with comparatively small size of their weights. But since decompression must be lossless, it's unclear to me if this approach could really help here.
> And it doesn't matter if all of Wikipedia is embedded in the compressor
That's not what I said. I said you can embed a better result. As in, make an algorithm that performs ok in general case, but bake in a super-optimised result that specifically applies to Wikipedia, even if it takes months to precompute.
> Why is a time limit needed?
Would you like to judge my compressor against other ones? It will take 2000 years to get the result, but it's totally worth it, I promise. It's a random search of optimal binaries that produce the Wikipedia and it will stop once I'm over the winning threshold.
Thanks for replying. I see a bit more of your point now I guess.
I still tend to consider the needed complexity to produce that binary as irrelevant, because the challenge is not interested in practical compression, only in space complexity and the relation between information entropy and intelligence, no?
If such a highly-specialized binary could be theoritically produced for other corpuses of the same size as well, that would be very interesting! Regardless of how practical their computation is.
And if it's not possible for othe corpuses, the complexity has to be somehow embedded in the decompressor, not the compressor, or am I missing something? Only alternative would be that the compression exploits some data inherent to the machine architecture that is not part of the executable. Still, this only shifts the problem around.
The idea reminds of an optimized ordering of Gödel numbers.
The pidgeonhole principle limits what a general lossless compressor can do, but we don't know how the minimum size of a compressed file scales with the output size, because this problem is related to the Halting problem. Even if we assume infinite compute, we can't know the exact limits for calculating a function |c(n)| that specificies the minimum length of a program that computes an arbitrary output of size |n|. If I'm not totally ignorant and lacking information here.
>since decompression must be lossless, it's unclear to me if this approach could really help here.
All lossy compression can be made lossless by storing a residual. A modern language transformer is well suited to compression because they output a ranked list of probabilities for every token, which is straightforward to feed into an entropy encoder.[0]
On the other hand, LLMs expose the flaw in the Hutter prize - Wikipedia is just too damn small, and the ratio of "relevant knowledge" to "incidental syntactic noise" too poor, for the overhead of putting "all human knowledge" in the encoder to be worth it. A very simple language model can achieve very good compression already, predicting the next word almost as well as a human can in absolute terms. The difference between "phone autocomplete" and "plausible AGI" is far into the long tail of text prediction accuracy.
Probably a state of the art Hutter prize winning strategy would simply be to train a language model on Wikipedia only, with some carefully optimal selection of parameter count. The resulting model would be useless for anything except compressing Wikipedia though, due to overfit.
[0] A practical difficulty is that large language models often struggle with determinism even disregarding the random sampling used for ChatGPT etc, when large numbers of floating point calculations are sharded across a GPU and performed in an arbitrary order. But this can easily be overcome at the expense of efficiency.
> On the other hand, LLMs expose the flaw in the Hutter prize
I'm not sure why it would be a flaw — isn't this more a sign of how universally interestint the rules are?
Out of curiosity, I didn't dive deeply into the previous winner's implementation, but are there NNs (+ error correction matrix, that's why I mentioned the FAQ in other comment) among them?
And it doesn't matter if all of Wikipedia is embedded in the compressor — if you can reconstruct the original using only the decompressor and the compressed data, the process can be as expensive as you like.
That's the exciting part.
It's also a vague connection to what's exciting about LLMs, regarding the amount of information they can reproduce with comparatively small size of their weights. But since decompression must be lossless, it's unclear to me if this approach could really help here.
Edit: all of this is already addressed in the FAQ at http://prize.hutter1.net/
Especially the paragraph "Why lossless compression?" is interesting to read.