The root cause is that the Astro bundle handed to the deployment process is monolithic. There was a top-level `await` for an RSS endpoint which called an API with `fetch`. The issue is that these two (and the rest of the app) were bundled together!
Therefore, any time the function was invoked, that top-level `await` was running for all endpoints. It never yielded. And it's fully autonomous, which means it'd keep running regardless of even a browser being open once the chain reaction started.
This is a Swiss Cheese[2] kind of failure. It required the top level await, the monolithic bundle, and the RSS function using `fetch` (i.e.: over the network) rather than `import`-ing the data layer API directly.
Most importantly, what we're doing: we are going to deploy a fix to ensure this doesn't happen again, across frameworks. I really appreciate Mike raising this and hopping on a zoom call with me while our team investigated.
As an addendum for HN: we're continuing to refine the tools and patterns to best "harness" the practically-infinite ability for Serverless and Edge functions to scale horizontally. It's an awesome property, but it's taught us valuable lessons. We've come a long way in adding guardrails and alerts, and this will be another value-added protection that future customers will enjoy.
well, yes, but it wouldve been better if support had also been trained that this sort of issue gets a refund escalation. one should not have to rank on Twitter AND HN to get good service for a very reasonable failure.
I'm currently stuck in their support hell and have been told:
1. My issue is not real
2. Okay, your issue is real, but because I'm not paying $$$ we're going to ignore you
3. I should do free work for Vercel and poll their community forums to see how widespread the issue is
4. Their support is only trained to handle frontend issues and because this is an issue with their CDN it's expected that they'll respond incompetently
5. They'll escalate with their CDN team and respond in one week (that was over a month ago, no follow up whatsoever)
It's hard to take Vercel seriously. As a toy, it's probably fine. But I'll ultimately move this project off of their CDN product as soon as it reaches costly volume.
Fair points, hopefully this brings about the needed changes in these billing systems to prevent it happening in the future. It happens all the time on other providers as well. I'm very critical of the actual necessity of these infinitely scalable systems if you find my other comment in this thread
I wonder what the aversion is to using a plain old server / vps. It's really not that difficult to deploy nowadays [0][1][2][3] and I'd rather get an $8 bill every month as insurance than ever worry about shit like OP just went through. It'll probably be more performant anyway due to cold starts and "edge" still having to hit us-east-1 for data.. cache your static files with Cloud Flare/Front. People are always surprised by how much traffic a single VPS can take[4] and believe it all has to be serverless to be web scale. I believe HN still runs on a single core or something.
There's a ton of places to get cloud credits as well, too many to link, so just Bing™ it
Vercel is sooo easy. Hook it to a repo and you’re done. It’s also cheap depending on use case. Edge functions are nice. Zero complaints from me as a customer for a little over a year.
I work with a small crew and having no server maintenance lets us ship more code. Really as simple as that.
You're right I used the term interchangeably - AWS Lightsail would have been a good service comparison, and those do start at $3.50[0]. The $8 figure came from running the smallest Fargate task at .25cpu and 512mb, excluding the load balancer etc. [1]
It was announced that App Runner (today) allowed for smaller instances[2], and the price/mo for this instance would basically be 1/4th of the previous default. So ~$14 [3]
Took me less than 1 min to setup NextJS on AWS App Runner with auto deploys on commits, the rest was waiting for it to finish which I'll admit, Vercel is much faster at.
Bursty workloads? I.e people use my app from 9am - 11am and then nothing for the rest of the day. I’m paying $8 but only around $2 of compute is actually helpful. Setting spend limits on lambda / vercel / cloudfns means the entire $8 go to my bursty workload
I've never been able to understand why so many usage-based hosting platforms don't give you the option to say "if my bill goes above X, shut down the service instead of continuing to charge me". It seems so easy and obvious, and I'll never ever use a platform for personal projects that doesn't let you have a failsafe
> I've never been able to understand why so many usage-based hosting platforms don't give you the option to say "if my bill goes above X, shut down the service instead of continuing to charge me".
Large, valuable customers wouldn't use this option. It would only help the hobbyist types who aren't doing anything mission-critical and aren't the platform's most profitable (not big spenders, not particularly loyal because they aren't big enough to tailor their infra to that platform, etc.)
I'm not sure that's true. It's probably a lower risk for them, but still, a runaway service can rack up unlimited bills. At some point it matters to everybody. Even if they set it to $CURRENT_CASH_RESERVES, that has non-zero value
And when we talk about the in-between - startups - I'm pretty sure I read at least one anecdote on here about being put out of business by surprise hosting bills
Large, valuable customers wouldn't use this option
Unless they prefer to rely on being large and valuable as a leverage, they just have much bigger limits. And even in such autoscale-type company there may be a separate department that would want to limit their budget because they have one. Imagine spending $100k extra accidentally in your office while the whole budget is around $10M and now you can’t simply reach support on terms of “that billion was clearly a mistake”, because $10.1M seems normal.
With self-serve developer tooling like vercel they are hoping that devs bring the tools to work that they have been prototyping themselves. That’s why there’s a free tier.
From what I remember, they don't offer limits, and neither does GCP. They offer billing alerts and that's it. Both cloud platforms operate on what they call a "shared responsibility" model, aka you bill it you buy it.
the big clouds usually have something you can wrap around alerting on their usage tracking APIs, but that's often very "eventually consistent" and might be delayed by days. Which is understandable in some cases, is still useful for some cases, but e.g. doesn't reliably protect you from "oh crap a broken config launched way to many VMs and now I've spent 10k in a day"
It might be really hard to add in retroactively. Otherwise there’s no excuse, it’s such an annoying vulnerability for a customer, even if you can email support and get a refund. Where I work, modal.com, we have enforced budget limits for accounts.
Sure, it gets more complicated with storage. But I think we're mostly talking about stateless services here
And for storage or storage-having services you could easily have a "shut it down but preserve the data for a comparatively-small recurring fixed storage charge"
There’s a reason that services like AWS and GCP tend to be fairly forgiving of billing due to legitimate mistakes by first-time users (and others).
And that reason is perfectly summarized by this post: anyone reading this post is likely to put Vercel on a mental do-not-use list.
This is also the reason that cloud providers tend to have default quotas that, among other things, limit runaway usage.
Admittedly without knowing much about it, it sounds like Vercel may be pretty immature as a business.
Edit: after commenting I saw the Vercel response. I’m leaving my comment up because that response still seems to focus on a refund being conditional on some technical justification. That’s how engineers tend to think. It’s not how successful businesspeople need to think. It seems like both the refund policy and quota/limit management may need to be reviewed.
I hear this claim that AWS is forgiving of legitimate mistakes, but I don't think it's true.
It's true if you have friends in Amazon, or you manage to appear on ycombinator or have a few thousand twitter followers. I've known at least three students end up with bills of around $100, which to them are huge, and which weren't waved.
The problem there is if you can’t afford $100, then under no circumstances should you consider using AWS or any major cloud provider. They simply don’t cater to that market.
They also charge more, sometimes around double, for domains, like .studio is the nice even price of $50/yr on Vercel but it's some odd cheaper price like $26 on Namecheap.
And it's not like that part of it's better, they still have to pass on weirdness from registrars. I tried registering .md and that was failing. Probably not on their end but they didn't do anything extra like warn me it could happen or provide a helpful error message.
(I got excited about Zeit when it was going to be similar to Google Cloud Run or fly.io and am less interested in serverless that means things like database bouncers or using old versions of Node.js)
Earlier this week I was in a call with a colleague, an architect.
He was explaining his plan to add 2 new methods to an existing API and ended with a cost calculation where based on projected usage, it would cost the company $13/month, and growing over time.
I was shocked at the amount as this is a tiny piece of a massive project but equally shocked about us now requiring to think like this. To decipher real monetary cost, line by line.
I guess that's why there's a new profession: cloud cost optimization engineer. Not for me though, I stand by my god given right to ship shitty code without consequences.
If that doesn't work, I'll become a cloud cost optimization engineer imposter. I start with intentionally expensive code, erase any trace that I had any hand in it, then come in to fix it. I take a 50% commission from the savings.
Yeah, unfortunately I had a similar experience back when Vercel was Now. It seems like the optimal strategy is to stick with their free plan until you really need the paid features, then monitor it like a snake stalking a mouse.
The flip side is that they’ve been rock solid for years on their free plan. Super reliable and nothing but positive things to say about that tier.
Kinda surprising they wouldn’t forgive the $3k bill. That’s the cost of a nice MacBook Pro for a runaway experiment. You’d expect this sort of thing in ML training, not webdev…
I have never used functions before, but this seems like this is very expensive compared it every other cloud service provided like Digitalocean, AWS, GCP, and Azure.
I might be way off, but this would cost under $100 on all of them and you would most likely get a refund if you talked to support.
So why is Vercel so expensive? Do they not have a pricing limits that you can set? Seems like a very bad idea to run functions that are cost per usage on a service that has no way for you to set limits on usage.
After learning about this I will stay away until they implement obvious limits on usage. Not cool to have to appeal for public viral posting to waive an unfair bill.
In a serverless environment, you're only billed for how long your function is running. However, in this case the function never terminates even after request it handles completed.
I suspect the user made a mistake of deploying a traditional web app (where there is a main process that route requests and wait indefinitely for next requests) instead of deploying individual "function" that terminates gracefully after handling the request. Seems to be an honest mistake to make for first time user. In other platform, his process will be killed by the timeout limit (serverless platform usually has a strict timeout limit) but for some reason, in vercell it keep on running forever and racked up a huge bill.
I am glad the issue was resolved for OP. I've had a great experience with Vercel and never had anything of this sort happen to me, fingers crossed I don't run into it any time soon :D
I had a very similar use case when I deployed Databricks through AWS. It went haywire and issued a trillion PUT requests to S3. I ran up a huge bill and was told to pound sand when I asked for a refund.
We've concluded our analysis.
1. We're refunding the overages
2. We identified the root cause
The root cause is that the Astro bundle handed to the deployment process is monolithic. There was a top-level `await` for an RSS endpoint which called an API with `fetch`. The issue is that these two (and the rest of the app) were bundled together!
Therefore, any time the function was invoked, that top-level `await` was running for all endpoints. It never yielded. And it's fully autonomous, which means it'd keep running regardless of even a browser being open once the chain reaction started.
This is a Swiss Cheese[2] kind of failure. It required the top level await, the monolithic bundle, and the RSS function using `fetch` (i.e.: over the network) rather than `import`-ing the data layer API directly.
Most importantly, what we're doing: we are going to deploy a fix to ensure this doesn't happen again, across frameworks. I really appreciate Mike raising this and hopping on a zoom call with me while our team investigated.
As an addendum for HN: we're continuing to refine the tools and patterns to best "harness" the practically-infinite ability for Serverless and Edge functions to scale horizontally. It's an awesome property, but it's taught us valuable lessons. We've come a long way in adding guardrails and alerts, and this will be another value-added protection that future customers will enjoy.
[1] https://twitter.com/rauchg/status/1644099739959590912
[2] https://en.wikipedia.org/wiki/Swiss_cheese_model