More

Barathkanna · 2026-03-12T06:21:45 1773296505

Sounds like a plan, But what if you can just pay a fixed cost every month and not worry about anything?

thiago_fm · 2026-03-13T12:25:46 1773404746

What you suggested is the best way in my opinion, but given what OP asked, I gave my answer.

Barathkanna · 2026-03-12T06:19:44 1773296384

That’s true, but AI is interesting because consumption-based pricing introduces a lot more variance than typical SaaS infrastructure. One user action can trigger dozens of model calls in an agent workflow. That’s partly why we started experimenting with models like https://oxlo.ai where the pricing flips back to a fixed subscription and we absorb the usage spikes.

Barathkanna · 2026-03-12T06:14:28 1773296068

Local models help remove token cost uncertainty, but they shift the problem to infrastructure and ops. GPUs, scaling, maintenance, and latency can add up quickly depending on the workload. For many builders it ends up being a tradeoff between predictable infra cost and flexible API usage.

Barathkanna · 2026-03-12T06:13:34 1773296014

That’s great. Real-time tracking is a big step already. The tricky part we kept running into was the variance itself, especially with retries and agent loops. That’s partly why we started experimenting with Oxlo.ai (https://oxlo.ai) where the pricing model absorbs that variance so builders don’t have to constantly model token risk.

Barathkanna · 2026-03-12T06:11:40 1773295900

One underlooked source of variance is retries from formatting failures. In many agent systems the loops dominate the cost, not the raw token length.

We ran into the same issue building agent workflows, which is why we started building https://oxlo.ai — experimenting with a flat subscription model where we absorb the token variance so builders don’t have to constantly model token risk.

Barathkanna · 2026-03-12T06:08:26 1773295706

Agreed. The real cost unit becomes the whole agent workflow, not a single LLM call. One user action can trigger dozens of calls.

We ran into the same issue and ended up building https://oxlo.ai to make the cost side more predictable for agent workloads.

Barathkanna · 2026-03-11T06:50:16 1773211816

True, but for early stage builders it’s harder to design those guardrails upfront. A lot of the time you only discover the retry patterns and cost spikes once real users start hitting the system.

Lazy_Player82 · 2026-03-11T06:55:29 1773212129

Fair point. And honestly, with more non-technical builders shipping agent-based products these days, that's probably where a service like this makes the most sense – for people who don't yet have the experience to know what guardrails to put in place.

Barathkanna · 2026-03-11T06:58:09 1773212289

Exactly. That’s actually why we started building Oxlo.ai. Early stage builders usually just want to experiment without worrying too much about token cost spikes.

Barathkanna · 2026-03-11T06:46:38 1773211598

Local models solve the marginal cost problem, but they move the complexity into infrastructure and throughput planning instead.

clearloop · 2026-03-11T09:01:03 1773219663

makes sense, it really depends on the use cases, I'm building my version of claw openwalrus for the local LLMs first goal, I think myself will use local models for daily tasks that heavily depend on tool callings, but for coding or doing research, I'll keep using remote models

and this topic actually inspires me that I can introduce a builtin gas meter for tokens

Barathkanna · 2026-01-27T15:35:04 1769528104

Agreed. Self-hosting gives the cleanest fixed cost, but you pay for it in ops and capacity planning. I’m mainly curious whether there’s a middle ground that gives early teams more predictable spend without immediately taking on full infra overhead.

storystarling · 2026-01-27T19:30:19 1769542219

Serverless GPU providers like Modal or RunPod are probably the closest thing. You pay for execution time rather than tokens so the unit economics are deterministic, and you don't have to manage the underlying capacity or OS. It is still variable billing but you avoid the token markup and the headache of keeping a cluster alive.

Barathkanna · 2026-01-27T09:55:30 1769507730

A realistic setup for this would be a 16× H100 80GB with NVLink. That comfortably handles the active 32B experts plus KV cache without extreme quantization. Cost-wise we are looking at roughly $500k–$700k upfront or $40–60/hr on-demand, which makes it clear this model is aimed at serious infra teams, not casual single-GPU deployments. I’m curious how API providers will price tokens on top of that hardware reality.

wongarsu · 2026-01-27T11:40:54 1769514054

The weights are int4, so you'd only need 8xH100

a2128 · 2026-01-27T13:05:50 1769519150

You don't need to wait and see, Kimi K2 has the same hardware requirements and has several providers on OpenRouter:

https://openrouter.ai/moonshotai/kimi-k2-thinking https://openrouter.ai/moonshotai/kimi-k2-0905 https://openrouter.ai/moonshotai/kimi-k2-0905:exacto https://openrouter.ai/moonshotai/kimi-k2

Generally it seems to be in the neighborhood of $0.50/1M for input and $2.50/1M for output

reissbaker · 2026-01-27T10:37:21 1769510241

Generally speaking, 8xH200s will be a lot cheaper than 16xH100s, and faster too. But both should technically work.

pama · 2026-01-27T14:52:26 1769525546

You can do it and may be ok for single user with idle waiting times, but performance/throughput will be roughly halved (closer to 2/3) and free context will be more limited with 8xH200 vs 16xH100 (assuming decent interconnect). Depending a bit on usecase and workload 16xH100 (or 16xB200) may be a better config for cost optimization. Often there is a huge economy of scale with such large mixture of expert models so that it would even be cheaper to use 96 GPU instead of just 8 or 16. The reasons are complicatet and involve better prefill cache, less memory transfer per node.

bertili · 2026-01-27T10:04:59 1769508299

The other realistic setup is $20k, for a small company that needs a private AI for coding or other internal agentic use with two Mac Studios connected over thunderbolt 5 RMDA.

Barathkanna · 2026-01-27T10:14:13 1769508853

That won’t realistically work for this model. Even with only ~32B active params, a 1T-scale MoE still needs the full expert set available for fast routing, which means hundreds of GB to TBs of weights resident. Mac Studios don’t share unified memory across machines, Thunderbolt isn’t remotely comparable to NVLink for expert exchange, and bandwidth becomes the bottleneck immediately. You could maybe load fragments experimentally, but inference would be impractically slow and brittle. It’s a very different class of workload than private coding models.

bertili · 2026-01-27T10:26:07 1769509567

People are running the previous Kimi K2 on 2 Mac Studios at 21tokens/s or 4 Macs at 30tokens/s. Its still premature, but not a completely crazy proposition for the near future, giving the rate of progress.

NitpickLawyer · 2026-01-27T10:48:08 1769510888

> 2 Mac Studios at 21tokens/s or 4 Macs at 30tokens/s

Keep in mind that most people posting speed benchmarks try them with basically 0 context. Those speeds will not hold at 32/64/128k context length.

zozbot234 · 2026-01-27T10:17:27 1769509047

If "fast" routing is per-token, the experts can just reside on SSD's. the performance is good enough these days. You don't need to globally share unified memory across the nodes, you'd just run distributed inference.

Anyway, in the future your local model setups will just be downloading experts on the fly from experts-exchange. That site will become as important to AI as downloadmoreram.com.

YetAnotherNick · 2026-01-27T12:10:53 1769515853

Depends on if you are using tensor parallelism or pipeline parallelism, in the second case you don't need any sharing.

omneity · 2026-01-27T13:32:56 1769520776

RDMA over Thunderbolt is a thing now.

embedding-shape · 2026-01-27T10:12:20 1769508740

I'd love to see the prompt processing speed difference between 16× H100 and 2× Mac Studio.

zozbot234 · 2026-01-27T10:19:14 1769509154

Prompt processing/prefill can even get some speedup from local NPU use most likely: when you're ultimately limited by thermal/power limit throttling, having more efficient compute available means more headroom.

Barathkanna · 2026-01-27T10:20:36 1769509236

I asked GPT for a rough estimate to benchmark prompt prefill on an 8,192 token input. • 16× H100: 8,192 / (20k to 80k tokens/sec) ≈ 0.10 to 0.41s • 2× Mac Studio (M3 Max): 8,192 / (150 to 700 tokens/sec) ≈ 12 to 55s

These are order-of-magnitude numbers, but the takeaway is that multi H100 boxes are plausibly ~100× faster than workstation Macs for this class of model, especially for long-context prefill.

ffsm8 · 2026-01-27T12:25:42 1769516742

You do realize that's entirely made up, right?

Could be true, could be fake - the only thing we can be sure of is that it's made up with no basis in reality.

This is not how you use llms effectively, that's how you give everyone that's using them a bad name from association

zozbot234 · 2026-01-27T10:12:08 1769508728

That's great for affordable local use but it'll be slow: even with the proper multi-node inference setup, the thunderbolt link will be a comparative bottleneck.