More

Gcam · on Feb 19, 2024

As part of our benchmarking of Groq we have asked Groq regarding quantization and they have assured us they are running models at full FP-16. It's a good point and important to check.

Link to benchmarking: https://artificialanalysis.ai/ (Note question was regarding API rather than their chat demo)

Gcam · on Feb 19, 2024

Groq's API performance reaches close to this level of performance as well. We've benchmarked performance over time and >400 tokens/s has sustained - can see here https://artificialanalysis.ai/models/mixtral-8x7b-instruct (bottom of page for over time view)

Gcam · on Jan 17, 2024

Hi, we have this if you take a look at the models page (https://artificialanalysis.ai/models) and scroll down to 'Latency', and also on the API host comparison pages for each model (e.g. https://artificialanalysis.ai/models/llama-2-chat-70b)

com2kid · on Jan 17, 2024

Ah so you do!

Your latency numbers for OpenAI (and Azure's equivalents) seem really high, I run time to first token tests and I see much better numbers!

(Also are those numbers average, p50, p99, etc? I'd honestly expect a box plot to really see what is going on!)

Gcam · on Jan 29, 2024

Hey com2kid - if you're still there, we did end up adding boxplots to show variance. Can be seen on the models page https://artificialanalysis.ai/models and on each models page where you view hosts by clicking one of the models. They are toward the end of the page under 'Detailed performance metrics'

Gcam · on Jan 16, 2024

We have Claude Instant on the models page: https://artificialanalysis.ai/models Can add it via the select at the top right of each card where it says '9 Selected' (below the highlight charts)

avereveard · on Jan 17, 2024

Ah cool was on mobile didn't eee the select

Gcam · on Jan 17, 2024

Definitely agree with your point on Claude Instant though. Much less than half the price, much higher throughput/speed for a relatively small quality decrease (varied by how 'quality' is measured, use-case)

Gcam · on Jan 16, 2024

Model quality index methodology is as per this comment (can add perplexity using the dropdown): https://hackertimes.com/item?id=39014985#39017632

It's a combination of different quality metrics which have Perplexity, overall, not performing as well. That being said, I think we are in the very early stages of model quality scoring/ranking - and (for closed sourced models) we are seeing frequent changes. Will be interesting to see how measures evolve / model ranks change

Gcam · on Jan 16, 2024

Thanks for the feedback and glad it is useful! Yes, agree might better representative of future use. I think a view of variance would be a good idea, currently just shown in over-time views - maybe a histogram of response times or a box and whisker. We have a newsletter subscribe form on the website or twitter (https://twitter.com/ArtificialAnlys) if you want to follow future updates

AaronFriel · on Jan 16, 2024

Variance would be good, and I've also seen significant variance on "cold" request patterns, which may correspond to resources scaling up on the backend of providers.

Would be interesting to see request latency and throughput when API calls occur cold (first data point), and once per hour, minute, and per second with the first N samples dropped.

Also, at least with Azure OpenAI, the AI safety features (filtering & annotations) make a significant difference in time to first token.

Gcam · on Jan 16, 2024

Hi HN, Thanks for checking this out! Goal with this project is to provide objective benchmarks and analysis of LLM AI models and API hosting providers to compare which to use in your next (or current) project. Benchmark comparisons include quality, price, technical performance (e.g. throughput, latency).

Twitter thread with initial insights: https://twitter.com/ArtificialAnlys/status/17472648324397343...

All feedback is welcome

ttt3ts · on Jan 16, 2024

Any chance of including some of the better fine tunes, e.g. wizard or tulu? (worse than mixtral but I assume other finetines will be better just like wizard and tulu are better than LLAMA2)

I guess their cost is same as base model although would effect performance.

_micah_h · on Jan 17, 2024

Hey, yeah the bar for adding finetunes will probably be that they're being hosted by ~3 supported hosting providers. Very much open to it!

YetAnotherNick · on Jan 17, 2024

Can quality score be added for each inference provider for the same model. Many of them use different quantization and approximation so that it's not just price and throughput that's important. Specially for model like Mixtral.

bravura · on Jan 16, 2024

I'd love to see replicate.com (pay per sip) on there. And lambdalabs.com

[edit: And also MPS]

_micah_h · on Jan 17, 2024

We've been waiting on Replicate to launch per-token pricing for LLMs because their previous pay-per-second model was uncompetitive - but it looks like they might have just turned it on with no big announcement! They'll go straight to the top of the priority list.

Do Lambda have a serverless inference API? Not aware of them playing in this space yet.

Presume you mean MPT not MPS - yep we'll look into MosaicML soon.

Gcam · on Jan 16, 2024

We have this (and other more detailed metrics) on the models page https://artificialanalysis.ai/models if you scroll down and for individual hosts if you click into a model (nav or click one of the model bars/bubbles) :)

There are some interesting views of throughput vs. latency whereby some models are slower to the first chunk but faster for subsequent chunks and vice versa, and so suit different use cases (e.g. if just want a true/false vs. more detailed model responses)

throwawaymaths · on Jan 16, 2024

Thanks!

Gcam · on Jan 16, 2024

Thanks! For Claude instant, select the dropdown on the top right of the card where it says '8 Selected' and can add it to the graphs. Thanks for the suggestions for adding Phi 2, Model.com as a host, can look into these!

Gcam · on Jan 16, 2024

Quality index is equally-weighted normalized values of Chatbot Arena Elo Score, MMLU, and MT Bench.

We have a bit more information in the FAQ: https://artificialanalysis.ai/faq but thanks for the feedback, will look into expanding more on how the normalization works. We are thinking of ways to improve this generalized metric.

A sticking point is quality can of course be thought of from different perspectives, reasoning, knowledge (retrieval), use-case specific (coding, math, readability), etc. This is why show individual scores on home page and models page: https://artificialanalysis.ai/models