There is no reason to believe that the foods humans have historically eaten are safer/healthier than "industrially processed/extracted/refined" food simply because we have historically eaten them. Evolution does not select for avoiding the health problems facing modern-day humans such as cancer or heart disease.
Uhh I don't think that financial incentives are a valid reason to believe something is healthier or safer than an alternative. Unless I have missed some sarcasm.
I mean there is a financial incentive to use byproducts of industrial processes that would otherwise be wasted, as food ingredients, and as there is no requirement to rigorously show that new ingredients are safe to consume in the US, this happens all the time and makes up a big portion of the average modern US diet.
But the list of allegedly questionable foods above are all foods we already eat, just with some things removed (e.g., avocado oil is just avocado with the flesh removed; pea protein is peas with the carbs removed). It is not obvious to me how you would conclude these are unhealthy.
I'm not saying they're healthier simply because we've historically eaten them.
But there are many reasons to believe natural/traditional foods may be safer and healthier than new industrial foods. To name a few:
1) There's reason to believe our bodies may be more adapted to eating natural or traditional foods, having eaten them for hundreds of thousands of years rather than one or two generations.
2) Many highly processed foods have within decades of their introduction to our diet been found to be really bad for us. Refined sugars, refined oils, refined flours, artificial sweeteners, many of the weird additives, many synthetic compounds like methylcellulose (someone close to me is extremely sensitive to this one), on and on.
3) These new ingredients, new kinds of refining and processing, and even synthetic food compounds, do not have to undergo any rigorous testing to be shown to be safe before being added to food. Even if they do some studies for some of them, how would you really know it's not causing serious long term problems for say 1% of people? Or even 10%? The size and duration of a study you'd need to find them to be safe would be expensive and they generally don't do it, since they're not required to.
4) These new ingredients often introduce novel molecules to the body that the body may not be adapted to. I hope I don't need to explain how many novel molecules that were invented and widely used in recent decades have proved to be highly toxic.
5) We have a huge increase in severe chronic disease in recent decades. I won't claim here that this is primarily because of the changes to our diet from industrially processed foods, but diet is a top contender given that it's one of the biggest things that has changed in the human lifestyle, along with all the other novel substances our bodies come in contact with now.
6) We know of tons of people who were healthy to age 80, 90, 100, eating primarily/entirely natural foods. We don't yet have any examples of this with people eating a large portion of modern industrial foods that didn't exist 80 years ago. This is not proof that they're dangerous, I'm just saying we don't know and have reason to be cautious.
> There's reason to believe our bodies may be more adapted to eating natural or traditional foods, having eaten them for hundreds of thousands of years rather than one or two generations.
This is an argument that no white people should be eating pineapples, mangos, bananas, kiwifruit, etc. Hell, probably not even apples.
Yes; you can phone it in post-tenure. But just because it is possible doesn't mean (in my experience) it is common; and I don't think it's helpful (as TFA claims) to equate this possibility with "a total scam." To get tenure anywhere doesn't just require a huge amount of work as an Assistant Professor; it also requires a huge amount of work as a PhD student and potentially multiple rounds of post-doc'ing or other non-tenure-line work. In my experience, tenured professors have spent nearly two decades distorting their work-life balance beyond all recognition to the point that grinding insanely hard in pursuit of publications just feels normal.
Worth noting that the filtering implementation is quite restrictive if you want to avoid post-filtering: filters must be expressible as discrete smallints (ruling out continuous variables like timestamps or high cardinality filters like ids); filters must always be denormalized onto the table you're indexing (no filtering on attributes of parent documents, for example); and filters must be declared at index creation time (lots of time spent on expensive index builds if you want to add filters). Personally I would consider these caveats pretty big deal-breakers if the intent is scale and you do a lot of filtering.
> Most of the time you don't need a different Python version from the system one.
Except for literally anytime you’re collaborating with anyone, ever? I can’t even begin to imagine working on a project where folks just use whatever python version their OS happens to ship with. Do you also just ship the latest version of whatever container because most of the time nothing has changed?
If you're writing Python tools to support OS operations in prod, you need to target the system Python. It's wildly impractical to deploy venvs for more than one or two apps, especially if they're relatively small. Developing in a local venv can help with that targeting, but there's no substitute for doing that directly on the OS you're deploying to.
This is why you DON'T write system tools in Python in the first place. Use a real language that compiles to a native self contained binary that doesn't need dependency installing. Or you use a container. This has been a solved problem for decades. Python users have been trying to drag the entire computing world backwards this whole time because their insistence on using a toy language invented to be the JavaScript of the server, as an actual production grade bare metal system language
I don't really understand the point around error handling. Sure, with structured outputs you need to be explicit about what errors you're handling and how you're handling them. But if you ask the model to return pure text, you now have a universe of possible errors that you still need to handle explicitly (you're using structured outputs, so your LLM response is presumably being consumed programmatically?), including a whole bunch of new errors that structured outputs help you avoid.
Also, meta gripe: this article felt like a total bait-and-switch in that it only became clear that it was promoting a product right at the end.
In my experience the semantic/lexical search problem is better understood as a precision/recall tradeoff. Lexical search (along with boolean operators, exact phrase matching, etc.) has very high precision at the expense of lower recall, whereas semantic search sits at a higher recall/lower precision point on the curve.
Yeah, that sounds about right to me. The most effective approach does appear to be a hybrid of embeddings and BM25, which is worth exploring if you have the capacity to do so.
For most cases though sticking with BM25 is likely to be "good enough" and a whole lot cheaper to build and run.
Depends on the app and how often you need to change your embeddings, but I run my own hybrid semantic/bm25 search on my MacBook Pro across millions of documents without too much trouble.
Doesn't this depend on your data to a large extent? In a very dense graph "far" results (in terms of the effort spent searching) that match the filters might actually be quite similar?
The "far" here means "with vectors having a very low cosine similarity / very high distance". So in vector use cases where you want near vectors matching a given set of filters, far vectors matching a set of filters are useless. So in Redis Vector Sets you have another "EF" (effort) parameter just for filters, and you can decide in case not enough results are collected so far how much efforts you want to do. If you want to scan all the graph, that's fine, but Redis by default will do the sane thing and early stop when the vectors anyway are already far.
HNSW indices are big. Let's suppose I have an HNSW index which fits in a few hundred gigabytes of memory, or perhaps a few terabytes. How do I reasonably rebuild this using maintenance_work_mem? Double the size of my database for a week? What about the knock-on impacts on the performance for the rest of my database-stuff - presumably I'm relying on this memory for shared_buffers and caching? This seems like the type of workload that is being discussed here, not a toy 20GB index or something.
> You use REINDEX CONCURRENTLY.
Even with a bunch of worker processes, how do I do this within a reasonable timeframe?
> How do you think a B+tree gets updated?
Sure, the computational complexity of insertion into an HNSW index is sublinear, the constant factors are significant and do actually add up. That being said, I do find this the weakest of the author's arguments.
Interested to hear more about your experience here. At Halcyon, we have trillions of embeddings and found Postgres to be unsuitable at several orders of magnitude less than we currently have.
On the iterative scan side, how do you prevent this from becoming too computationally intensive with a restrictive pre-filter, or simply not working at all? We use Vespa, which means effectively doing a map-reduce across all of our nodes; the effective number of graph traversals to do is smaller, and the computational burden mostly involves scanning posting lists on a per-node basis. I imagine to do something similar in postgres, you'd need sharded tables, and complicated application logic to control what you're actually searching.
How do you deal with re-indexing and/or denormalizing metadata for filtering? Do you simply accept that it'll take hours or days?
I agree with you, however, that vector databases are not a panacea (although they do remove a huge amount of devops work, which is worth a lot!). Vespa supports filtering across parent-child relationships (like a relational database) which means we don't have to reindex a trillion things every time we want to add a new type of filter, which with a previous vector database vendor we used took us almost a week.
We host thousands of forums but each one has its own database, which means we get a sort of free sharding of the data where each instance has less than a million topics on average.
I can totally see that at a trillion scale for a single shard you want a specialized dedicated service, but that is also true for most things in tech when you get to the extreme scale .
Thanks for the reply! This makes much more sense now. To preface, I think pgvector is incredibly awesome software, and I have to give huge kudos to the folks working on it. Super cool. That being said, I do think the author isn't being unreasonable in that the limitations of pgvector are very real when you're talking indices that grow beyond millions of things, and the "just use pgvector" crowd in general doesn't have a lot of experience with scaling things beyond toy examples. Folks should take a hard look at what size they expect their indices to grow to in the near-to-medium-term future.
Ok, but what was the cost of labor put into curation of the training dataset and performing the fine-tuning? Hasn’t the paper’s conclusion been repeatedly demonstrated - that it is possible to get really good task-specific performance out of fine-tuned smaller models? There just remains the massive caveat that closed-source models are pretty cheap and so the ROI isn’t there in a lot of cases.
reply