Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Any_Value is a strange addition. I'm struggling to think of a scenario where I need a value from a group of records, but don't care which one.

Why not just use min, max or even row_number to specify an order? If the value isn't important, do you really need to present it?



Here a common example:

Given the following tables:

  CREATE TABLE person (
    id INT PRIMARY KEY,
    name VARCHAR NOT NULL
  );

  CREATE TABLE pet (
    id INT PRIMARY KEY,
    name VARCHAR NOT NULL,
    owner_id INT NOT NULL
  );
It is common to want to join them, like so:

  SELECT owner.id, owner.name, COUNT(pet.id) AS numberOfPets 
  FROM person AS owner
  LEFT OUTER JOIN pet ON owner.id = pet.owner_id
  GROUP BY owner.id
This doesn't work in standard SQL, because all columns in the SELECT list have to either be aggregated or included in the GROUP BY. owner.name is neither. That is a bit silly though because we know each result row will only have one unambiguous value for the owner name, since the GROUP BY is on a unique column from the same table as the owner name.

We can solve this with ANY_VALUE:

  SELECT owner.id, ANY_VALUE(owner.name) AS name, COUNT(pet.id) AS numberOfPets 
  FROM person AS owner
  LEFT OUTER JOIN pet ON owner.id = pet.owner_id
  GROUP BY owner.id


I run into the equivalent problem in Kotlin sometimes. I want to group things by a key, but that results in type `Map<K, List<V>>` because the compiler can't be sure that there won't be more than one `V` for the field being used for grouping... which is why there's also `associateBy` which does return `Map<K, V>`, but will throw if there happens to be more than one value for each K.

Not relevant, but I thought it may help other readers understand the problem.


The common solution in this case is simply to also GROUP BY owner.name.

That said, if owner.name has different values and you didn’t actually care which one you’d get ANY_VALUE could maybe be useful.

Though in that case you could also just use MIN or MAX, which already exist.


Yes, but all these solutions require the optimizer to be smart enough to be smart enough to optimize them away: "oh, owner.id is the primary key, so any value from owner will appear at most once, I don't have to group by them or track the maximum value, I can pick and store any value".

I think all optimizers are smart enough already, as this is a super common pattern. I honestly would have preferred this logic to be made more explicit: if you group by the primary key (or by the table name, as someone suggested), you can use all columns from the table without aggregate functions.

This kinda breaks down when you introduce subqueries and CTEs, which is where myopic optimizers can screw the pooch as well, so any_value can be of use there.


We use it to select any value from an aggregated list of values which we know are identical. It requires fewer operations than e.g. max.


What I'd really want in that example is a function that errored if they weren't all the same.


Any circumstance where it would be needed should have given a syntax error irrespective of the data present (all matching or with differences) – everything in the projection list needing to be in the grouping clause or an aggregate.

For a very simple but slightly odd (why would you ask exactly that?) example:

> SELECT FirstName, WorkEmail FROM Person GROUP BY WorkEmail

should give an error in current SQL implementations. Despite the fact that in the data in question there is no ambiguity as the email addresses are never the same between two people, the database has no way of knowing that. The error message in SQL Server in this case is “Column 'Person.FirstName' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause”. In this case you could declare WorkEmail to be unique, but that should still error as the standard doesn't take this into account. The new ANY_VALUE us counted as an aggregate so “fixes” the syntax error like so:

> SELECT ANY_VALUE(FirstName), WorkEmail FROM Person GROUP BY WorkEmail

Basically it is a way of telling the query planner that you know your data better than it does, and it shouldn't worry its little head and take the first thing it sees to project for that column.

mySQL actually does this by default, or at least it used to, using the first thing it sees for columns that are neither aggregates nor part of the grouping clause, and you'll find any number of people asking why this doesn't work in other databases (SQL Server, postgres, …) when they try porting things over.

--

ANY_VALUE() feels icky to me, similar to how the related mysql behaviour always did. It is explicitly giving an opening for the output to change arbitrarily for a given query and that makes me think it is going to be the source of difficult to find bugs, for instance when some code relies on a particular output but changes in data balance hit a tipping point and make the query planner take a different route meaning ANY_VALUE() returns something other than it has in that place for months/years.

I can think of examples where it could be useful or at least makes things no worse (you need to get a little more complex than my single table example above), i.e. where trying to disambiguate with another aggregate like MAX still gives effectively arbitrary results, trying with ROW_NUMBER() tricks adds complexity (both for the human reading the code and the query planner), and adding the column to the GROUP BY would cause duplicate rows otherwise, but I'd rather not have it because of the issues that it could easily create in the wrong hands.


It's actually super useful if you've got a table with two columns like "sku" and "product_name", where the value in both columns has one unique value from the other column paired with it. You can group by sku, and then grab any value for product_name, since they're all the same. Using min/max/etc involves extra computations that you don't need to do.

You could group by the combination of sku+product_name, but it's kind of unwieldy. You could also normalize it to not have this issue, but when you're getting this data from someone else, there's not much you can do.


I guess SQL is so widely used that there are always some niche cases. I guess it is useful if you just want to show one example value where there is no numerical order, and you want fetching this value to be superfast.

Like a movie database, and you want to show an example title for every genre.

But in a real-world scenario, I would want a popular movie, not literally any. If you have an Academy Award winners table, I guess, then everyone is popular.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: