The benchmarks speak volumes of dishonesty. They sorted the results by speed of ...

apd_ · on Dec 17, 2021

> Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run (as expected...).

Not true. If we'd rank them by second run Julia would be:

- On simple query: 1st, 1st, 4th, 1st, 5th (down 1).

- On advanced query: 3rd, 6th, 6th, 4th (up 1), - (out of memory).

> The databases (and spark) will have to read from disk. They have no chance of competing with anything that's reading from ram, no matter how slow it is.

Not true. Upon quick peek on the bench code, ClickHouse and Spark use in-memory table. I assume other engines too.

ritchie46 · on Dec 17, 2021

Note that the compile times of julia are not included in the benchmarks. If you read the website, you'd seen that the grapsh show the first (excluding the compilation) and the second run (with hot cache).

Also in the second run, julia is not the fastest. Julia would not be faster than Rust, its got a garbage collector. This is what you see in the join benchmarks that really push the allocator.

Next to that, the databases run in in-memory mode, so there is not disk overhead. Spark is slower because JVM + row-wise data.

sdfgsdf · on Dec 17, 2021

> Note that the compile times of julia are not included in the benchmarks. If you read the website, you'd seen that the grapsh show the first (excluding the compilation) and the second run (with hot cache).

Here's my view: The author of that page has commented here on HN; If my claim was so outrageously wrong as you claim, he would've corrected it.

fault1 · on Dec 18, 2021

yeah, but your claim was "Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run"

notice this isn't even a language vs language benchmark. it's libraries and frameworks.

plus I don't think even the author of the julia library in question would agree with your statement: https://discourse.julialang.org/t/the-state-of-dataframes-jl...

as mentioned in that thread, GC and strings, or especially a combination of the two, can be very much a downer in terms of julia performance. That's actually pretty surprising since strings are often as important if not more important than numbers for a lot of data processing needs.

I'd also say in terms of compilation time, some autocaching layer outside of precompilation would do wonders.

rscho · on Dec 17, 2021

> Julia would not be faster than Rust, its got a garbage collector.

Having a garbage collector does not intrinsically make things slower. Especially so outside of the benchmarking microcosm.

adgjlsfhk1 · on Dec 17, 2021

that said, Julia currently has a slow GC so it does hurt. GC performance is being worked on though. I have high hopes for a year or 2.

sriku · on Dec 17, 2021

Agree .. and I was looking for an option to sort by second run.

One trick I've tried to some effect is to run jl code on a smaller data sizes so the compilation gets done and then repeat on the large one so it doesn't get interrupted by compilation. Not sure if this is a recommended approach. Benchmarking Julia is a pain for this reason - compilation always gets mixed up with runtime. But it hasn't prevented me from using it interactively. Pretty happy with it actually.

nojito · on Dec 17, 2021

>The benchmarks speak volumes of dishonesty.

Not really. They are designed to showcase a common use case across multiple technologies.

The beauty of this benchmark is that there is a hardware limit included so that it forces you to create novel solutions to perform well.

>Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run (as expected...). It's funny because once you notice it those benchmarks are basically an ad for Julia.

Not sure where you're getting that but even on second run Julia doesn't really compete with DT/Polars

adgjlsfhk1 · on Dec 17, 2021

the benchmarks are a bit out of date (missing DataFrames 1.2/1.3, Julia 1.7, CSV 0.9). I'm planning on running an updated version this weekend.

1egg0myegg0 · on Dec 17, 2021

If you wouldn't mind, please update DuckDB as well!

adgjlsfhk1 · on Dec 17, 2021

Can you make a PR to https://github.com/oscardssmith/db-benchmark? I don't know DuckDB, so I don't know what the change would be.

throwawaybutwhy · on Dec 18, 2021

It's obvious that you're promoting duck eggs at the expense of, say, chicken eggs or quail eggs or even ostrich eggs. Maybe you could tone that down a bit.

prionassembly · on Dec 17, 2021

Julia doesn't really compete with anything, despite having some cool tech behind it.

It's like -- Julia is the Rory Gilmore of programming languages.

paulgb · on Dec 17, 2021

> considering that you compile once and run millions of times.

If you’re writing data pipelines then yes, but a lot of Pandas users use it interactivity. As much as I’d rather use Julia, the last time I tried it I found myself waiting for computation far more often than with a Jupyter/Python workflow.

queuebert · on Dec 17, 2021

Give it another try. They've improved the first run times quite a bit over the last few versions. Package precompilation has gotten way better as well.

paulgb · on Dec 17, 2021

Glad to hear it, I will!

adgjlsfhk1 · on Dec 17, 2021

DataFrames1.3 is a lot faster specifically.

rscho · on Dec 17, 2021

Maybe you should hop on the website of duckdb before commenting...