More

jaychia · 2026-01-15T18:46:02 1768502762

It's fairly surprising to me how naive/early we are still in the techniques that we use here.

Anthropic's post on the Claude Agent SDK (formerly Claude Code SDK) talks about how the agent "gathers context", and is fairly accurate as to how people do it today.

1. Agentic Search (give the agent tools and let it run its own search trajectory): specifically, the industry seems to have made really strong advances towards giving the agents POSIX filesystems and UNIX utilities (grep/sed/awk/jq/head etc) for navigating data. MCP for data retrieval also falls into this category, since the agent can choose to invoke tools to hit MCP servers for required data. But because coding agents know filesystems really well, it seems like that is outperforming everything else today ("bash is all you need").

2. Semantic Search (essentially chunking + embedding, a la RAG in 2022/2023): I've definitely noticed a growing trend amongst leading AI companies to move away from this. Especially if your data is easily represented as a filesystem, (1) seems to be the winning approach.

Interestingly though this approach has a pretty glaring flaw: all the approaches today really only provide the agents with raw unprocessed data. There's a ton of recomputation on raw data! Agents that have sifted through the raw data once (maybe it reads v1, v2 and v_final of a design document or something) will have to do the same thing again in the next session.

I have a strong thesis that this will change in 2026 (Knowledge Curation, not search, is the next data problem for AI) https://www.daft.ai/blog/knowledge-curation-not-search-is-th... and we're building towards this future as well. Related ideas here that have anecdotal evidence of providing benefits, but haven't really stuck yet in practice include: agentic memory, processing agent trajectory logs, continuous learning, persistent note-taking etc.

jaychia · 2025-11-29T21:01:09 1764450069

Great article and a timely reminder for many :)

Applicable not just for grad school applications, but also to job apps, startups, and relationships.

Hang in there y'all, all it takes is for one to work out. Keep working hard, kings & queens.

jaychia · on March 19, 2025

Check out Daft (www.getdaft.io) - we've been working really hard on our Iceberg support. Supports full reads/writes (including partitioned writes) and our SQL support is also coming along quite well!

Also no cluster, no JVM. Just `pip install daft` and go. Runs locally (as fast as DuckDB for a lot of workloads; faster, if you have S3 cloud data access) and also runs distributed if you have a Ray cluster you can point it at

(Disclaimer: I work on it)

jamesblonde · on March 19, 2025

Daft is making great progress with Iceberg - faster than PyIceberg in many ways.

jaychia · on Dec 16, 2024

Hey, I'm one of the developers of Daft :)

Thanks for the feedback on marketing! Daft is indeed distributed using Ray, but to do so involves Daft being architected very carefully for distributed computing (e.g. using map/reduce paradigms).

Ray fulfills almost a Kubernetes-like role for us in terms of orchestration/scheduling (admittedly it does quite a bit more as well especially in the area of data movement). But yes the technologies are very complementary!

jaychia · on Nov 5, 2024

One of the maintainers of Daft here.

Just dug through the datachain codebase to understand a little more. I think while both projects have a Dataframe interface, they're very different projects!

Datachain seems to operate more on the orchestration layer, running Python libraries such as PIL and requests (for making API calls) and relying on an external database engine (SQLite or BigQuery/Clickhouse) for the actual compute.

Daft is an actual data engine. Essentially, it's "multimodal BigQuery/Clickhouse". We've built out a lot of our own data system functionality such as custom Rust-defined multimodal data structures, kernels to work on multimodal types, a query optimizer, distributed joins etc.

In non-technical terms, I think this means that Datachain really is more of a "DBT" which orchestrates compute over an existing engine, whereas Daft is the actual compute/data engine that runs the workload. A project such as Datachain could actually run on top of Daft, which can handle the compute and I/O operations necessary to execute the requested workload.

jaychia · on Aug 1, 2024

I work on Daft and we’ve been collaborating with the team at Amazon to make this happen for about a year now!

We love Ray, and are excited about the awesome ecosystem of useful + scalable tools that run on it for model training and serving. We hope that Daft can complement the rest of the Ray ecosystem to enable large scale ETL/analytics to also run on your existing Ray clusters. If you have an existing Ray cluster setup, you absolutely should have access to best-in-class ETL/analytics without having to run a separate Spark cluster.

Also, on the nerdier side of things - the primitives that Ray provides gives us a real opportunity to build a solid non-JVM based, vectorized distributed query engine. We’re already seeing extremely good performance improvements here vs Spark, and are really excited about some of the upcoming work to get even better performance and memory stability.

This collaboration with Amazon really battle-tested our framework :) happy to answer any questions if folks have them.

thedood · on Aug 1, 2024

Good to see you here! It's been great working with Daft to further improve data processing on Ray, and the early results of incorporating Daft into the compactor have been very impressive. Also agree with the overall sentiment here that Ray clusters should be able to run best-in-class ETL without requiring a separate cluster maintained by another framework (Spark or otherwise). This also creates an opportunity to avoid many inefficient, high-latency cross-cluster data exchange ops often run out of necessity today (e.g., through an intermediate cloud storage layer like S3).

jaychia · on May 14, 2024

There’s a lot of interesting work happening in this area (see: XTable).

We are building a Python distributed query engine, and share a lot of the same frustrations… in fact until quite recently most of the table formats only had JVM client libraries and so integrating it purely natively with Daft was really difficult.

We finally managed to get read integrations across Iceberg/DeltaLake/Hudi recently as all 3 now have Python/Rust-facing APIs. Funny enough, the only non-JVM implementation of Hudi was contributed by the Hudi team and currently still lives in our repo :D (https://github.com/Eventual-Inc/Daft/tree/main/daft/hudi/pyh...)

It’s still the case that these libraries still lag behind their JVM counterparts though, so it’s going to be a while before we see full support across the full featureset of each table format. But we’re definitely seeing a large appetite for working with table formats outside of the JVM ecosystem (e.g. in Python and Rust)

philippemnoel · on May 14, 2024

Are you using the iceberg-rust crate for Rust? It's a rather young project, have you found it sufficient for your needs (if using)?

sammysidhu · on May 14, 2024

We're actually using pyiceberg to retrieve metadata! All our IO and decoding happens in the rust side once the data has been passthrough.

We expose something called a ScanOperator which allows integration into various catalogs through a thin layer that exposes ScanTasks.

Iceberg's impl: https://github.com/Eventual-Inc/Daft/blob/416009138359a9d410...

jaychia · on March 1, 2024

Interesting. Daft currently does validation on types/names only at runtime. The flow looks like:

1. Construct a dataframe (performs schema inference)

2. Access (now well-typed) columns and operations on those columns in the dataframe, with associated validations.

Unfortunately step (1) can only happen at runtime and not at type-checking-time since it requires running some schema inference logic, and step (2) relies on step (1) because the expressions of computation are "resolved" against those inferred types.

However, if we can fix (1) to happen at type-checking time using user-provided type-hints in place of the schema inference, we can maybe figure out a way to propagate this information through to mypy.

Would love to continue the discussion further as an Issue/Discussion on our Github!

benrutter · on March 1, 2024

Thanks for getting back, thats really interesting, for sure I will do!

jaychia · on March 1, 2024

Daft developer here!

We actually already have read support. Check out the pyiceberg docs' Daft section: https://py.iceberg.apache.org/api/#daft

It's also very easy to use from Daft itself: `daft.read_iceberg(pyiceberg_table)`. Give it a shot and let us know how it works for you!

jaychia · on March 1, 2024

Oh yes good point! We'll be sure to add more details about comparisons with local dataframe libraries such as Pandas/Polars/DuckDB.