I can help! Emailing you now :) Our company (scratchdata.com, open source) is li...

fock · on May 8, 2024

I - by my HPC-background - am wondering quite a bit what happened that 15GB-files are considered large data? Not being a crazy parquet-user, but:

- does this decompress to giant sizes? - can't you split the file easily, because it includes row-based segments? - why does it take months to solve this for one file?

semi-extrinsic · on May 8, 2024

As a fellow HPC user, I tried a couple of years ago to do a tricky data format conversion using these newfangled tools. I was essentially just taking a huge (multi-terabyte) 3D dataset, transposing it and changing the endianness.

The solutions I was able to put together using Dask and Spark and such were all insanely slow, they just got killed by Slurm without getting anywhere. In the end I went back to good ole' shell scripting with xxd to handle most of the heavy lifting. Finished in under an hour.

The appeal of these newfangled tools is that you can work with data sizes that are infeasible to people who only know Excel, yet you don't need to understand a single thing about how your data is actually stored.

If you can be bothered to read the file format specification, open up some files in a hex editor to understand the layout, and write low-level code to parse the data - then you can achieve several orders of magnitude higher performance.

fifilura · on May 9, 2024

I think command line tools is going to be fine if all you do is process one row at a time. Or if your data has a known order.

But if you want to do some kind of grouping or for example pivoting rows to columns, I think you will still benefit from a distributed tool like Spark or Trino. That can do the map/reduce job for you in a distributed way.

memset · on May 9, 2024

Because most people don’t have an HPC background, aren’t familiar with parquet internals, don’t know how to make their language stream data instead of buffering it all in memory, have slow internet connections at home, are running out of disk space on their laptops, and only have 4 GB of ram to work with after Chrome and Slack take up the other 12 GB.

15 GB is a real drag to do anything with. So it’s a real pain when someone says “I’ll just give you 1 TB worth of parquet in S3”, the equivalent of dropping a billion dollars on someone’s doorstep in $1 bills.

vladsanchez · on May 9, 2024

Funny analogy! I loved it. I'm ready to start with ScratchData which btw and respectfully never heard of.

Thanks again for sharing your tool and insightful knowledge.

fifilura · on May 9, 2024

How do you see the competition from Trino and Athena in your case?

Depends a lot on what you want to do with the data of course, but if you want to filter and slice/dice it, my experience is that it is really fast and stable. And if you already have it on s3, the threshold for using it is extremely small.

fock · on May 9, 2024

what is your point? They talked about 15GB of parquet - what does this have to do with 1TB of parquet?

Also: How does the tool you sell here solve the problem - the data is already there and can't be processed (15GB - funny that seems to be the scale of YC startups?)? How does a tool to transfer the data into a new database help here?

wodenokoto · on May 10, 2024

> How does a tool to transfer the data into a new database help here?

Maybe because the problem literally is "how to transfer this data into a database"

jjtheblunt · on May 9, 2024

Parquet is column oriented and so row-based manipulation can be inefficient

fock · on May 9, 2024

> Hierarchically, a file consists of one or more row groups.

https://parquet.apache.org/docs/concepts/

Maybe the file in question only has one row group. Which would be weird, because the creator had to go out of their way to make it happen.

jjtheblunt · on May 9, 2024

Yep. I use it all the time. But, as you said, depends on specific layouts, so can’t expect it to be row-convenient .