Hi, I'm Daniel - the author of this article. Hope this gives some more insight into Amazon's infamous CoE process. Happy to field any questions folks may have.
Thanks for re-sharing. I was quite disappointed to see the negative community reaction to the original Amazon article. I felt that those knocking it didn't analyze the circumstances that lead to the Serverless Architecture's gaps and just saw some problems with it and assumed "serverless bad".
In reality, the series of decisions made a lot of sense. Take something that exists, try using it, learn from it, and make better decisions as a result of it. It just so happened that in this case, serverless architecture was the wrong tool for the job and dedicated machines were a better fit.
Anyways, happy to see my take being shared. If you enjoy this kind of content, I have a newsletter and Youtube channel where I discuss / make videos about more AWS & cloud concepts:
I think serverless could be the right fit, something like detectors seems ideal. Theyre somewhat isolated individually, etc etc.
I kind of want to take a Crack at what Prime Video did except serverless from a green field perspective, and I'm willing to bet I wind up provisioning a RedPanda cluster or similar and having them feed detector lambdas. The bottleneck and cost here appeared to be storing intermediate results and the orchestration. Orchestration is solved by using something like SQS or what have you. Intermediate results are somewhat harder depending on size - but I think you could get there and it'd be an interesting exercise.
But isn't this an example of even AWS developers using cloud services wrong?
They used a serverless approach (which should have excellent scalability -- it's one of the most common marketing lines of it!) then had to switch architecture because they realised it wasn't scalable.
> Secondly, the original architecture was based on a tool that was not designed for scale. Based on the article, it read as if this tool was used for diagnostics to perform ad-hoc assessments of stream quality. This means it likely wasn’t designed for scale or put through the pressure tests of a formal design document / review. If that were to happen, any run of the mill Software Engineer would have been able to recognize the obvious scaling / cost bottlenecks that would be encountered.
> Finally, I think the PV team made all the right moves by switching to provisioned infratructure. They identified an existing tool that could have been a potential solution. They attempted to leverage it before realizing it had some real scaling / cost bottlenecks. After assessing the situation, they pivoted and re-designed for a more robust solution that happened to use provisioned infrastructure. This is what the software development process is all about.