Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

The benchmarks speak volumes of dishonesty.

They sorted the results by speed of 1st run. For a language like Julia, which is JIT-compiled, that's not a fair comparison, considering that you compile once and run millions of times.

Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run (as expected...). It's funny because once you notice it those benchmarks are basically an ad for Julia.

EDIT: Also..... lets think critically about some of the entries there. Most of them are languages, but then you have things like Arrow, which is a data format, Spark, which is an engine, ClickHouse and DuckDB are databases. The databases (and spark) will have to read from disk. They have no chance of competing with anything that's reading from ram, no matter how slow it is. They were built for different purposes. These are borderline meaningless comparisons.



> Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run (as expected...).

Not true. If we'd rank them by second run Julia would be:

- On simple query: 1st, 1st, 4th, 1st, 5th (down 1).

- On advanced query: 3rd, 6th, 6th, 4th (up 1), - (out of memory).

> The databases (and spark) will have to read from disk. They have no chance of competing with anything that's reading from ram, no matter how slow it is.

Not true. Upon quick peek on the bench code, ClickHouse and Spark use in-memory table. I assume other engines too.


Note that the compile times of julia are not included in the benchmarks. If you read the website, you'd seen that the grapsh show the first (excluding the compilation) and the second run (with hot cache).

Also in the second run, julia is not the fastest. Julia would not be faster than Rust, its got a garbage collector. This is what you see in the join benchmarks that really push the allocator.

Next to that, the databases run in in-memory mode, so there is not disk overhead. Spark is slower because JVM + row-wise data.


> Note that the compile times of julia are not included in the benchmarks. If you read the website, you'd seen that the grapsh show the first (excluding the compilation) and the second run (with hot cache).

Here's my view: The author of that page has commented here on HN; If my claim was so outrageously wrong as you claim, he would've corrected it.


yeah, but your claim was "Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run"

notice this isn't even a language vs language benchmark. it's libraries and frameworks.

plus I don't think even the author of the julia library in question would agree with your statement: https://discourse.julialang.org/t/the-state-of-dataframes-jl...

as mentioned in that thread, GC and strings, or especially a combination of the two, can be very much a downer in terms of julia performance. That's actually pretty surprising since strings are often as important if not more important than numbers for a lot of data processing needs.

I'd also say in terms of compilation time, some autocaching layer outside of precompilation would do wonders.


> Julia would not be faster than Rust, its got a garbage collector.

Having a garbage collector does not intrinsically make things slower. Especially so outside of the benchmarking microcosm.


that said, Julia currently has a slow GC so it does hurt. GC performance is being worked on though. I have high hopes for a year or 2.


Agree .. and I was looking for an option to sort by second run.

One trick I've tried to some effect is to run jl code on a smaller data sizes so the compilation gets done and then repeat on the large one so it doesn't get interrupted by compilation. Not sure if this is a recommended approach. Benchmarking Julia is a pain for this reason - compilation always gets mixed up with runtime. But it hasn't prevented me from using it interactively. Pretty happy with it actually.


>The benchmarks speak volumes of dishonesty.

Not really. They are designed to showcase a common use case across multiple technologies.

The beauty of this benchmark is that there is a hardware limit included so that it forces you to create novel solutions to perform well.

>Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run (as expected...). It's funny because once you notice it those benchmarks are basically an ad for Julia.

Not sure where you're getting that but even on second run Julia doesn't really compete with DT/Polars


the benchmarks are a bit out of date (missing DataFrames 1.2/1.3, Julia 1.7, CSV 0.9). I'm planning on running an updated version this weekend.


If you wouldn't mind, please update DuckDB as well!


Can you make a PR to https://github.com/oscardssmith/db-benchmark? I don't know DuckDB, so I don't know what the change would be.


It's obvious that you're promoting duck eggs at the expense of, say, chicken eggs or quail eggs or even ostrich eggs. Maybe you could tone that down a bit.


Julia doesn't really compete with anything, despite having some cool tech behind it.

It's like -- Julia is the Rory Gilmore of programming languages.


> considering that you compile once and run millions of times.

If you’re writing data pipelines then yes, but a lot of Pandas users use it interactivity. As much as I’d rather use Julia, the last time I tried it I found myself waiting for computation far more often than with a Jupyter/Python workflow.


Give it another try. They've improved the first run times quite a bit over the last few versions. Package precompilation has gotten way better as well.


Glad to hear it, I will!


DataFrames1.3 is a lot faster specifically.


Maybe you should hop on the website of duckdb before commenting...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: