Why is a time limit needed? And it doesn't matter if all of Wikipedia is embedde...

viraptor · on Jan 1, 2024

> And it doesn't matter if all of Wikipedia is embedded in the compressor

That's not what I said. I said you can embed a better result. As in, make an algorithm that performs ok in general case, but bake in a super-optimised result that specifically applies to Wikipedia, even if it takes months to precompute.

> Why is a time limit needed?

Would you like to judge my compressor against other ones? It will take 2000 years to get the result, but it's totally worth it, I promise. It's a random search of optimal binaries that produce the Wikipedia and it will stop once I'm over the winning threshold.

moritzwarhier · on Jan 1, 2024

Thanks for replying. I see a bit more of your point now I guess.

I still tend to consider the needed complexity to produce that binary as irrelevant, because the challenge is not interested in practical compression, only in space complexity and the relation between information entropy and intelligence, no?

If such a highly-specialized binary could be theoritically produced for other corpuses of the same size as well, that would be very interesting! Regardless of how practical their computation is.

And if it's not possible for othe corpuses, the complexity has to be somehow embedded in the decompressor, not the compressor, or am I missing something? Only alternative would be that the compression exploits some data inherent to the machine architecture that is not part of the executable. Still, this only shifts the problem around.

The idea reminds of an optimized ordering of Gödel numbers.

The pidgeonhole principle limits what a general lossless compressor can do, but we don't know how the minimum size of a compressed file scales with the output size, because this problem is related to the Halting problem. Even if we assume infinite compute, we can't know the exact limits for calculating a function |c(n)| that specificies the minimum length of a program that computes an arbitrary output of size |n|. If I'm not totally ignorant and lacking information here.

dTal · on Dec 31, 2023

>since decompression must be lossless, it's unclear to me if this approach could really help here.

All lossy compression can be made lossless by storing a residual. A modern language transformer is well suited to compression because they output a ranked list of probabilities for every token, which is straightforward to feed into an entropy encoder.[0]

On the other hand, LLMs expose the flaw in the Hutter prize - Wikipedia is just too damn small, and the ratio of "relevant knowledge" to "incidental syntactic noise" too poor, for the overhead of putting "all human knowledge" in the encoder to be worth it. A very simple language model can achieve very good compression already, predicting the next word almost as well as a human can in absolute terms. The difference between "phone autocomplete" and "plausible AGI" is far into the long tail of text prediction accuracy.

Probably a state of the art Hutter prize winning strategy would simply be to train a language model on Wikipedia only, with some carefully optimal selection of parameter count. The resulting model would be useless for anything except compressing Wikipedia though, due to overfit.

[0] A practical difficulty is that large language models often struggle with determinism even disregarding the random sampling used for ChatGPT etc, when large numbers of floating point calculations are sharded across a GPU and performed in an arbitrary order. But this can easily be overcome at the expense of efficiency.

moritzwarhier · on Jan 1, 2024

> On the other hand, LLMs expose the flaw in the Hutter prize

I'm not sure why it would be a flaw — isn't this more a sign of how universally interestint the rules are?

Out of curiosity, I didn't dive deeply into the previous winner's implementation, but are there NNs (+ error correction matrix, that's why I mentioned the FAQ in other comment) among them?