Any_Value is a strange addition. I'm struggling to think of a scenario where I n...

Sonata · on April 14, 2023

Here a common example:

Given the following tables:

  CREATE TABLE person (
    id INT PRIMARY KEY,
    name VARCHAR NOT NULL
  );

  CREATE TABLE pet (
    id INT PRIMARY KEY,
    name VARCHAR NOT NULL,
    owner_id INT NOT NULL
  );

It is common to want to join them, like so:

  SELECT owner.id, owner.name, COUNT(pet.id) AS numberOfPets 
  FROM person AS owner
  LEFT OUTER JOIN pet ON owner.id = pet.owner_id
  GROUP BY owner.id

This doesn't work in standard SQL, because all columns in the SELECT list have to either be aggregated or included in the GROUP BY. owner.name is neither. That is a bit silly though because we know each result row will only have one unambiguous value for the owner name, since the GROUP BY is on a unique column from the same table as the owner name.

We can solve this with ANY_VALUE:

  SELECT owner.id, ANY_VALUE(owner.name) AS name, COUNT(pet.id) AS numberOfPets 
  FROM person AS owner
  LEFT OUTER JOIN pet ON owner.id = pet.owner_id
  GROUP BY owner.id

brabel · on April 14, 2023

I run into the equivalent problem in Kotlin sometimes. I want to group things by a key, but that results in type `Map<K, List<V>>` because the compiler can't be sure that there won't be more than one `V` for the field being used for grouping... which is why there's also `associateBy` which does return `Map<K, V>`, but will throw if there happens to be more than one value for each K.

Not relevant, but I thought it may help other readers understand the problem.

mtts · on April 14, 2023

The common solution in this case is simply to also GROUP BY owner.name.

That said, if owner.name has different values and you didn’t actually care which one you’d get ANY_VALUE could maybe be useful.

Though in that case you could also just use MIN or MAX, which already exist.

orthoxerox · on April 15, 2023

Yes, but all these solutions require the optimizer to be smart enough to be smart enough to optimize them away: "oh, owner.id is the primary key, so any value from owner will appear at most once, I don't have to group by them or track the maximum value, I can pick and store any value".

I think all optimizers are smart enough already, as this is a super common pattern. I honestly would have preferred this logic to be made more explicit: if you group by the primary key (or by the table name, as someone suggested), you can use all columns from the table without aggregate functions.

This kinda breaks down when you introduce subqueries and CTEs, which is where myopic optimizers can screw the pooch as well, so any_value can be of use there.

tomtom1337 · on April 14, 2023

We use it to select any value from an aggregated list of values which we know are identical. It requires fewer operations than e.g. max.

OscarCunningham · on April 14, 2023

What I'd really want in that example is a function that errored if they weren't all the same.

dspillett · on April 14, 2023

Any circumstance where it would be needed should have given a syntax error irrespective of the data present (all matching or with differences) – everything in the projection list needing to be in the grouping clause or an aggregate.

For a very simple but slightly odd (why would you ask exactly that?) example:

> SELECT FirstName, WorkEmail FROM Person GROUP BY WorkEmail

should give an error in current SQL implementations. Despite the fact that in the data in question there is no ambiguity as the email addresses are never the same between two people, the database has no way of knowing that. The error message in SQL Server in this case is “Column 'Person.FirstName' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause”. In this case you could declare WorkEmail to be unique, but that should still error as the standard doesn't take this into account. The new ANY_VALUE us counted as an aggregate so “fixes” the syntax error like so:

> SELECT ANY_VALUE(FirstName), WorkEmail FROM Person GROUP BY WorkEmail

Basically it is a way of telling the query planner that you know your data better than it does, and it shouldn't worry its little head and take the first thing it sees to project for that column.

mySQL actually does this by default, or at least it used to, using the first thing it sees for columns that are neither aggregates nor part of the grouping clause, and you'll find any number of people asking why this doesn't work in other databases (SQL Server, postgres, …) when they try porting things over.

--

ANY_VALUE() feels icky to me, similar to how the related mysql behaviour always did. It is explicitly giving an opening for the output to change arbitrarily for a given query and that makes me think it is going to be the source of difficult to find bugs, for instance when some code relies on a particular output but changes in data balance hit a tipping point and make the query planner take a different route meaning ANY_VALUE() returns something other than it has in that place for months/years.

I can think of examples where it could be useful or at least makes things no worse (you need to get a little more complex than my single table example above), i.e. where trying to disambiguate with another aggregate like MAX still gives effectively arbitrary results, trying with ROW_NUMBER() tricks adds complexity (both for the human reading the code and the query planner), and adding the column to the GROUP BY would cause duplicate rows otherwise, but I'd rather not have it because of the issues that it could easily create in the wrong hands.

EnergyAmy · on April 14, 2023

It's actually super useful if you've got a table with two columns like "sku" and "product_name", where the value in both columns has one unique value from the other column paired with it. You can group by sku, and then grab any value for product_name, since they're all the same. Using min/max/etc involves extra computations that you don't need to do.

You could group by the combination of sku+product_name, but it's kind of unwieldy. You could also normalize it to not have this issue, but when you're getting this data from someone else, there's not much you can do.

phartenfeller · on April 14, 2023

I guess SQL is so widely used that there are always some niche cases. I guess it is useful if you just want to show one example value where there is no numerical order, and you want fetching this value to be superfast.

Like a movie database, and you want to show an example title for every genre.

But in a real-world scenario, I would want a popular movie, not literally any. If you have an Academy Award winners table, I guess, then everyone is popular.