They sorted the results by speed of 1st run. For a language like Julia, which is JIT-compiled, that's not a fair comparison, considering that you compile once and run millions of times.
Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run (as expected...). It's funny because once you notice it those benchmarks are basically an ad for Julia.
EDIT: Also..... lets think critically about some of the entries there. Most of them are languages, but then you have things like Arrow, which is a data format, Spark, which is an engine, ClickHouse and DuckDB are databases. The databases (and spark) will have to read from disk. They have no chance of competing with anything that's reading from ram, no matter how slow it is. They were built for different purposes. These are borderline meaningless comparisons.
- On advanced query: 3rd, 6th, 6th, 4th (up 1), - (out of memory).
> The databases (and spark) will have to read from disk. They have no chance of competing with anything that's reading from ram, no matter how slow it is.
Not true. Upon quick peek on the bench code, ClickHouse and Spark use in-memory table. I assume other engines too.
Note that the compile times of julia are not included in the benchmarks. If you read the website, you'd seen that the grapsh show the first (excluding the compilation) and the second run (with hot cache).
Also in the second run, julia is not the fastest. Julia would not be faster than Rust, its got a garbage collector. This is what you see in the join benchmarks that really push the allocator.
Next to that, the databases run in in-memory mode, so there is not disk overhead. Spark is slower because JVM + row-wise data.
> Note that the compile times of julia are not included in the benchmarks. If you read the website, you'd seen that the grapsh show the first (excluding the compilation) and the second run (with hot cache).
Here's my view: The author of that page has commented here on HN; If my claim was so outrageously wrong as you claim, he would've corrected it.
as mentioned in that thread, GC and strings, or especially a combination of the two, can be very much a downer in terms of julia performance. That's actually pretty surprising since strings are often as important if not more important than numbers for a lot of data processing needs.
I'd also say in terms of compilation time, some autocaching layer outside of precompilation would do wonders.
Agree .. and I was looking for an option to sort by second run.
One trick I've tried to some effect is to run jl code on a smaller data sizes so the compilation gets done and then repeat on the large one so it doesn't get interrupted by compilation. Not sure if this is a recommended approach. Benchmarking Julia is a pain for this reason - compilation always gets mixed up with runtime. But it hasn't prevented me from using it interactively. Pretty happy with it actually.
Not really. They are designed to showcase a common use case across multiple technologies.
The beauty of this benchmark is that there is a hardware limit included so that it forces you to create novel solutions to perform well.
>Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run (as expected...). It's funny because once you notice it those benchmarks are basically an ad for Julia.
Not sure where you're getting that but even on second run Julia doesn't really compete with DT/Polars
It's obvious that you're promoting duck eggs at the expense of, say, chicken eggs or quail eggs or even ostrich eggs. Maybe you could tone that down a bit.
> considering that you compile once and run millions of times.
If you’re writing data pipelines then yes, but a lot of Pandas users use it interactivity. As much as I’d rather use Julia, the last time I tried it I found myself waiting for computation far more often than with a Jupyter/Python workflow.
Give it another try. They've improved the first run times quite a bit over the last few versions. Package precompilation has gotten way better as well.
They sorted the results by speed of 1st run. For a language like Julia, which is JIT-compiled, that's not a fair comparison, considering that you compile once and run millions of times.
Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run (as expected...). It's funny because once you notice it those benchmarks are basically an ad for Julia.
EDIT: Also..... lets think critically about some of the entries there. Most of them are languages, but then you have things like Arrow, which is a data format, Spark, which is an engine, ClickHouse and DuckDB are databases. The databases (and spark) will have to read from disk. They have no chance of competing with anything that's reading from ram, no matter how slow it is. They were built for different purposes. These are borderline meaningless comparisons.