FWIW I’m not a big cloud fan, but I was tasked with the obligation of finding th...

keithnoizu · on Jan 23, 2019

I have a 240+ core cluster running on GCE now with additional appengine frontend for some api items, and I loathe the environment. I've used Heroku, AWS, Azure, Digital Ocean, and IBM SoftLayer in the past. I worked on service readiness and billing on Azure at Microsoft.

If it works for you thats good, but it wouldn't be my first choice even if I found it to cost less than other alternatives.

dijit · on Jan 23, 2019

Interesting; where does it fall down for you?

kuwze · on Jan 23, 2019

Yeah bump. I rarely read about people not liking GCP (aside from their customer service of course) but would love to hear any issues you may be having.

zjaffee · on Jan 23, 2019

As someone whose worked with multiple cloud providers (at very large scales), I would not recommend google cloud simply because of support issues I've had and clearly misrepresented service capabilities and limits.

I will agree that GCP has some nice quality of life features, and is certainly favorable for a small project compared to some other providers, but I find it hard to trust google's ability to keep up with the demands of a large organization.

oppositelock · on Jan 23, 2019

Same here. I'm the tech lead responsible for cloud infrastructure at a company with a large cloud presence, and have been doing this for years. Google has better technology in a lot of ways, but awful customer support and even customer treatment. A billing glitch on their side tore down all of our infrastructure and data at a previous employer, without so much as a "we're sorry". With AWS, if such a thing happens, you have an email from your account rep.

indigochill · on Jan 23, 2019

In what way(s) do you find GCP better than AWS? My company was using both for a while but has migrated more towards AWS lately. But I'm just learning my way around cloud development now so I don't have insight into the major differences between providers.

manigandham · on Jan 23, 2019

GCP has the fastest, simplest, and cheapest primitives for building. Organization and project hierarchy combined with IAM permissions (integrated with G Suite if you use it) make security and access easy. Every project has its own namespace and can be transferred to different owners or billed separately.

VMs don't have a mess of instance types, just standard but customizable cpu/ram with any disks and local SSDs attached. Live-migrated, billed to the second, and automatically grouped per cpu/ram increments for billing discounts.

Networking is 2gbps/core upto 16gpbs/instance with low latency and high throughput regardless of zone and doesn't need placement groups. VPCs are globally connected and can be peered and shared easily across projects so that it's maintained in one place across multiple teams. Fast global load balancing with a single IP across many protocols and immediate scaling.

Storage is fast with uncapped bandwidth and has strongly consistent listings. BigQuery, BigTable and lots of other services have "no-ops" so there's no overhead other than getting your work done. Support is also flat rate and not a ripoff percentage.

The major downsides to GCP are the lack of managed services, limited and outdated versions for what they do offer, poor documentation and broken SDKs, and dealing with the opinionated approach of their core in-house products. This usually means as a startup that you are more productive on AWS or Azure because you can get going in a few clicks with a vast ecosystem, while GCP is a better home for larger companies that have everything built and just need strong core offerings with operational simplicity.

lnkmails · on Jan 23, 2019

Your reasoning and the conclusion resonates a lot with my own findings for a project that we tried to bootstrap. GCP's GKE and Istio integration are becoming more compelling to consider them for containerized workloads. EKS isn't quite there. One of the things we are struggling with is finding a qualified partner who can help us build our new project on GCP. We don't have a mature infrastructure team and while we try to bootstrap one, we would like to rely on partners to help us move the needle. Even partners in GCP premier list don't have case studies of migrating monoliths to GCP. Most of them talk about GSuite migration which isn't quite the same. AWS wins this battle. They have far more mature partners with better track record and thought leadership. I wonder if you see this the same way. Maybe partnership is not crucial for you? Some insights from the community would help.

manigandham · on Jan 23, 2019

Yes. AWS biggest advantage now is the giant marketplace of vendors and partners so you can get help and managed services for just about anything.

In my experience, most startups just want managed options they can run themselves rather than engaging partners but if that's what you need then AWS will have more companies to offer, although GCP does have qualified partners. I recommend contacting one of the GCP developer advocates either in this thread or on twitter for help, or email me separately and I'll put you in touch.

Also I havent worked with them but https://shinesolutions.com/ has put out plenty of articles and case studies that seems to show they're pretty capable, that might work for you.

vira28 · on Jan 23, 2019

Just curious, Which support tier you are using from GCP?

For an early stage startup/Indie developer, GCP support model is not appeasing. $100 Per user for development role is unacceptable. (at least to us, the stage where we are now)

manigandham · on Jan 24, 2019

We use the production roles but you don't have to signup every single person, just those who file tickets and interact with support.

If you're very early then you can also try the older support pricing: https://cloud.google.com/support/premium/

If that's still too expensive then you can probably rely on the free support and forums until you have more spend and revenue.

boulos · on Jan 23, 2019

Depending where you're based, there are some very good partners that can help here. It really depends on what level of cooperation you're looking for from "here, you do it" to "just give us guidance along the way".

https://cloud.google.com/solutions/migration-center/ is a reasonable jumping off point. Sorry if that didn't come up more clearly.

lnkmails · on Jan 28, 2019

We have talked to Velostrata and some other GCP partners. It's hard to assess their usefulness. Unfortunately, they don't have much of open source credibility. Picking by case studies seems like picking out a car by reading advertising material.

ed_elliott_asc · on Jan 23, 2019

I did some perf testing of van’s across azure, gcp and aws.

Gcp was always exactly the same perf, on the button exactly.

Azure was all over the place, fast then slow then fast then slow.

Aws was in the middle.

If you want guaranteed perf then go with gcp 100%

CreepGin · on Jan 23, 2019

> The major downsides to GCP are the lack of managed services, limited and outdated versions for what they do offer, poor documentation and broken SDKs, and dealing with the opinionated approach of their core in-house products. This usually means as a startup that you are more productive on AWS or Azure because you can get going in a few clicks with a vast ecosystem, while GCP is a better home for larger companies that have everything built and just need strong core offerings with operational simplicity.

That is spot on. After a year of research and testing on GCP and AWS, I concluded with the same thing. AWS is much more startup/indie-friendly.

crankylinuxuser · on Jan 23, 2019

Our blocker in using GCP is that they do not offer a managed Oracle Database. AWS does with RDS.

We hate Oracle like the next person, but we're locked in. And there's no way we're going to the "Oracle Cloud"

manigandham · on Jan 23, 2019

That's more on Oracle than Google, and probably unlikely to ever happen at this point. They've already increased licensing costs to make it more expensive to run in other clouds.

CobrastanJorji · on Jan 23, 2019

You know, that's kind of weird. You'd think they'd be really worried about the general trend and would be introducing smaller, cheaper versions of cloud Oracle. Are they just that convinced that people will stick with Oracle? I wonder what makes them think that. I'm not familiar enough with Oracle to take a guess.

dijit · on Jan 23, 2019

What about running oracle on a google instance with a regional disk? Sure, it's not as sexy as a managed service but functionally it would be very similar.

ticmasta · on Jan 23, 2019

Not only is it not sexy, it's probably the most work. You're now administering the server, storage & the database; you're the DBA for your Oracle instance and databases; Oracle loves to hammer this scenario for licensing (they'd rather you pay for their "cloud") and paying a premium for the underlying hardware that needs to managed remotely.

I can understand why this is a no-go for the GP.

dijit · on Jan 23, 2019

Yeah, I understand it too; We're currently being squeezed by MS licensing in GCP (but it's basically free in Azure!).

This is the cost of lock-in.

Personally I'd rather have the expertise on staff than pay oracle (and microsoft) for this shitty behaviour.

zjaffee · on Jan 23, 2019

What are you talking about, google support is absolutely "some rip off percentage". If you are saying it isn't you clearly aren't working with it at any sort of scale.

Edit: I'm being downvoted but it's true, if you need enterprise support you're paying the same percentages you would on any other cloud service.

manigandham · on Jan 23, 2019

https://cloud.google.com/support/#support-options

The role-based support is flat-rate but you're right that the Enterprise tier is $15k or percentage spend. From my experience, most companies are fine with the 1-hour production tier.

I haven't worked with GCP beyond 6 figure scale but I dont think that matters. If you need the 15-min response times and the TAM guidance then that's the fee but at least you can opt-out if you don't.

dijit · on Jan 23, 2019

Hey! I'm actually glad you asked. :)

So, obviously my experience is based on my use-case so things that are important to me are predictability of resources and consistency of data.

From a resource perspective (GCP:Compute) there seems to be a tendency of setting CPU affinity for certain cores, often in the same NUMA zone on the host. This means my tickrate doesn't get hijacked on certain cores randomly. That's quite nice honestly and it's something that on-prem hyperconverged VM solutions don't manage. The NUMA zone affinity / samezone affects our performance quite drastically as certain applications (such as in-memory databases or gameworlds) allocate large chunks of memory and then do little updates millions of times a second. This means memory bandwidth is very important.

GCP also has live migration of instances which is 100% transparent to the application (even the CPU clock seems to be moved over), unlike my colleagues on AWS who get 24h notifications of host machine maintenances I've never had to "deal" with even a single instance outage since starting up on GCP 1y ago. But I would assume if AWS hasn't moved on solving this problem they will soon.

Things like VPC networking being inter-zone by default is something that just makes sense and solves some headaches that I hear other teams having.

From a storage performance perspective, on AWS we were hard-capped to 5GBps per instance to S3, such a hard limit does not exist for GCP instances to cloud storage so we're able to make use of it. (in fact, I can often saturate my 10GBit interface limit even to non-google instances). A n AWS sales rep that this hard-limit can't be removed no matter how much we pay.

I can't attest to the difference in APIs, since I use terraform and that abstracts away anything I would be using.

Regarding storage again; we had reps from AWS talking to us at length and they would not guarantee that fsync() would be honoured on elastic storage. They stated that "it should never be needed" but honestly I'm not comfortable with that answer. I mean, my bare-metal database instances haven't gone down in 3 years but it doesn't mean I'm going to start decreasing the consistency for raw throughput.

There's a lot more but these are the most important things that I can think of right now.

_msw_ · on Jan 23, 2019

Hi! I'm from the EC2 team.

The fixed performance EC2 instances (in contrast to instances with burstable performance) all have dedicated physical CPU allocations, and 1:1 affinity from vCPU to the underlying CPU. Similarly, memory allocations are fixed, pre-allocated, and are aligned to the NUMA properties of the host system. When combined with both our Xen hypervisor and Nitro hypervisor, this provides practically all of the performance of the host system with regard to memory latency and bandwidth.

See: https://twitter.com/_msw_/status/1045032259189760000

With our latest 25 Gbps and 100 Gbps capable instances, EC2 to S3 traffic can make use of all of the bandwith provided to the instance.

Please send me more information about the interaction you had on the EBS side, msw at amazon dot com. All writes to EBS are durability recorded to nonvolatile storage before the write is acknowledged to the operating system running in the EC2 instance. fsync() does not necessarily force unit access, and explicit FUA / flush / barriers are not required for data durability (unlike some storage devices or cloud storage systems). Perhaps there was confusion about the question that was asked.

romeisendcoming · on Jan 24, 2019

Can attest to very little CPU performance loss via OpenMP applications over bare metal. Maybe as little as 5% for our use case (numerical modeling of weather impacts).

parasubvert · on Jan 23, 2019

Re: NUMA, I had thought it was fairly straightforward to peg VMs to particular NUMA nodes (or declare CPU or memory affinity) in VMware. I agree GCP does it well (better than the other public clouds!), but I've also seen this done well on-prem.

dijit · on Jan 23, 2019

I have too, which is why I singled out hyper-converged.

In hyper converged environments the storage and the compute live together on a service mesh and thus some processing is done on the hypervisor to manage storage, those processes don't have an affinity (in the case of vmware for example) so a guest VM core gets suspended for a couple of hundred clock ticks sometimes.

_msw_ · on Jan 23, 2019

Hi! I'm from the EC2 engineering team.

This is a super challenging problem for general purpose virtualization stacks. EC2 has been working to avoid any guest VM interruptions. With the Nitro hypervisor, there are no management tasks that share CPUs with guest workloads.

See more here, including data from a real customer with real-time requirements running on EC2 under the Nitro hypervisor: https://youtu.be/e8DVmwj3OEs?t=1796

parasubvert · on Jan 24, 2019

That's a good point, I have seen cases where we've separated a portion of nodes for VSAN or ScaleIO out to a separate cluster to avoid this behaviour on latency sensitive workloads. These nodes don't need a ton of RAM and you'd better have good E-W bandwidth for them, but it's always a tradeoff...

_msw_ · on Jan 23, 2019

Hi! I'm from the EC2 team.

This is a great tool to help uncover if CPU and memory NUMA within a VM aligns to the physical host or not:

https://www.cl.cam.ac.uk/research/srg/netos/projects/ipc-ben...

Paper from the USENIX 2012 conference: http://anil.recoil.org/papers/drafts/2012-usenix-ipc-draft1....

EC2 instances have been NUMA optimized since we launched our CC1 instances in 2010. I encourage you to try ipc-bench on your cloud provider of choice, if CPU and NUMA affinity is important to your workload.

afpx · on Jan 23, 2019

Regarding the 5 gbps cap, they said at re:invent 2017 that they increased the cap to 25 gpbs. Did that never happen?

https://m.youtube.com/watch?v=9x8hz1oRWbE

manigandham · on Jan 23, 2019

Yes it did: https://aws.amazon.com/blogs/aws/the-floodgates-are-open-inc...

However it may depend on instance type and other factors. This complexity is a problem with AWS.

philliphaydon · on Jan 23, 2019

What 24 hour host machine maintaince? Maybe once a year I get an email 1 month in advance that a server will be rebooted or I can do it myself... I stop/start on the week and don’t get another email for another year...

numbsafari · on Jan 23, 2019

Three years on GCP here... I’ve not once received one of these notifications. In my prior 7 years on AWS, these happened all the time, especially when you run a lot of VMs in different zones/regions (probably less visible if you are all in one zone).

Live migrations is amazing.

Now if only Google paid more attention to certain features of their GLB... amazing tech with showstopper bugs means amazing tech I can’t use. (Also means it’s not so amazing)

manigandham · on Jan 23, 2019

It depends on a lot, like region, instance type, security issues, etc. No notices is still better than any, and live migration also helps with reliability by moving your VM if possible instead of it going down with the host.

acranox · on Jan 23, 2019

I think he means the notice is sent 24h before the scheduled maintenance. They don’t always provide a lot of advance notice.

philliphaydon · on Jan 23, 2019

That's why I'm confused, 7 years on AWS and never had less than 1 month notice.

electroly · on Jan 23, 2019

Sounds like you've never seen an actual hardware failure; you've only seen scheduled maintenance. You don't get any notice at all with host failures. You get an email like this:

"We have important news about your account (AWS Account ID: XXXX). EC2 has detected degradation of the underlying hardware hosting your Amazon EC2 instance (instance-ID: i-XXXX) in the us-east-1 region. Due to this degradation, your instance could already be unreachable. After 2018-12-28 19:00 UTC your instance, which has an EBS volume as the root device, will be stopped."

The stop date is 2 weeks after the email is sent, but as the message states, the instance is likely already dead. It was already dead in this case. GCP will live migrate your instance to another host.

philliphaydon · on Jan 23, 2019

That's the email I get. I had 1 in 2018 but it was 4 weeks from the received date, not 2. Only a couple of our servers are older than 12 months tho, we usually generate new AMI's roll them out to keep them up to date.

dijit · on Jan 23, 2019

I wouldn't jinx it.

Like the parent said; we used to receive these notices quite frequently and the amount of time varies wildly. I should ask the teams who currently/primarily use AWS perhaps the situation has changed?

But it was not uncommon to have 24hrs of notice, especially for instances in US-EAST2 for some reason.

philliphaydon · on Jan 23, 2019

We use us-west-2, but I'm touching wood non-the-less.

shaklee3 · on Jan 23, 2019

Can you explain more about that 5Gbps per instance to s3? Is that documented anywhere? How many prefixes were you using?

cloakandswagger · on Jan 23, 2019

This was 5Gbps until January of last year, when it was increased to 25Gbps:

https://aws.amazon.com/blogs/aws/the-floodgates-are-open-inc...

manigandham · on Jan 23, 2019

It was 5gbps, it should be higher now: https://aws.amazon.com/blogs/aws/the-floodgates-are-open-inc...

dklsafhjskljfl · on Jan 23, 2019

Aside from pricing (different workloads yield different results), I'd say that:

GCP has much more modern APIs (they got to do "right" the first time around, while AWS had to learn from their "mistakes")

GCP is less mature with documentation, but still excellent (AWS is first class). I've found bugs, and others I know have run into straight up wrong docs - a quick message to support yielded the correct information and docs update within the week.

GCP is far more "opinionated" about how you should run a service. AWS is opinionated as well, but less so. What this means is that while GCP will sell you resources in traditional way, to go GCP native or from scratch really requires buying into the GCP "way" of doing things more than AWS does. Basically, K8S or go home.

We use GCP and I enjoy it quite a bit. I have extensive AWS experience as well, but no strong preference.

lrem · on Jan 23, 2019

GCP is not done "right the first time". Most of the interesting things you see in GCP are 3rd-5th generations of internal products.

falcor84 · on Jan 23, 2019

How is that a contradiction? You are saying that by the time the products made it to become external SPIs, they were already battle tested, right?

lrem · on Jan 24, 2019

That's true, but I wasn't trying to contradict, only explain why some Google's products feel so polished since early access. Also explains why some designs may feel awkward: they were assuming the user to be Google products, with all the ballast of building for a billion users.

sandGorgon · on Jan 23, 2019

we are struggling to maintain price parity on Google Cloud versus both Azure and AWS.

AWS now has the new arm machines in production which are cheaper. AWS also allows yearly prepaid managed databases - while Google still charges per second billing (with some monthly discount). My AWS TCO comes out to be much, much lower (almost 30%) than GCP if i take into account committed use/prepayment discounts.

dijit · on Jan 23, 2019

Interesting; maybe it's not solving the same use-cases for you then :)

For us: on AWS/Azure we had to benchmark each instance when it came up and if it performed poorly it had to be reaped and redeployed constantly. The increased overhead in dealing with things "the AWS way" was enough by itself to offset the pricing difference with the steep AWS discount we were getting.

Understanding administrative overhead is complicated but I'm sure you are taking into account the cost of humans. (or, you've already absorbed the cost of automating it?) :)

As with all things; use what works for you. I'm just very happy with GCP over the others coming from bare metal.

sandGorgon · on Jan 23, 2019

it depends on usecases - i run a data heavy usecase. We deploy using docker swarm, so our redeployment is fairly well taken care of.

in terms of benchmarking - this is not something we care about (since we dont have so much CPU utilization), so probably its easier for me.

For my usecase, AWS has better administrative tools than google. I use Route 53, SES, S3 - in all those three, no other products come close (transfer acceleration...im looking at you).

The one place that I agree with you is Dataproc. That's far superior to EMR in terms of spin-up/down.

Ironlink · on Jan 23, 2019

I dislike the concept of committing to use. It creates a disincentive to optimize jobs, because "we've already paid for three years of this".

pantulis · on Jan 23, 2019

It's up to you to see it the other way around: if you optimize jobs you can fit future workloads in the infrastructure you've already paid for.

I prefer to think of reserved instances as a purely financial trick.

room271 · on Jan 23, 2019

The pay-as-you-go reductions give you more flexibility about when you make optimisations to your code though and help you handle uncertainty around this.

For example, if I have a new workload that won't fit on existing hardware with AWS, I can either: a) reserve the new instances, or b) not reserve on the assumption I will be able to optimise in the 'near' future.

In practice this is hard to know but AWS forces you to come to a correct decision up front, whereas Google Cloud allows you to not have to answer this up front without being punished later for making the 'wrong' choice.

Ironlink · on Jan 23, 2019

Right, but that requires your new workloads to fit well with the resources you have committed to. If you've committed to many small machines, they may not be well suited to an especially heavy workload. If you've skewed your commitment towards a large amount of memory per compute (or vice versa), you can find yourself with lots of RAM (or compute) and no natural use-case.

At the bottom line, commitment impedes change.

pantulis · on Jan 24, 2019

Your point is valid, thanks for the answer.

sandGorgon · on Jan 23, 2019

From a startup POV, I'll give you the other perspective.

I now have a significantly reduced opex that saves me a lot of money. If I grow very fast (and outgrow it)... I really dont care, because I will most likely have a financing event.

If I dont grow fast enough to justify the spend, I have bigger problems.

In addition, I have to commend Google and Azure here - the way they do committed use is very flexible. They price it on number of units (cores, RAM, whatever) - so if you outgrow it, you still get the discount + full cost of additional units. On AWS, if you outgrow, you have to frikking sell off your machines on their auctions and buy new ones.

Only problem is that Google doesnt do this for databases (which are very expensive)

matwood · on Jan 23, 2019

> On AWS, if you outgrow, you have to frikking sell off your machines on their auctions and buy new ones.

I don't believe this is entirely true. The AWS reserved credits are good within the machine family. So a t2 credit is good for all t2 instance sizes. Not as flexible as CPU/RAM credits, but more so than it was in the past.

sandGorgon · on Jan 23, 2019

true - but in multiples of instances. so 2xt2.xlarge = 1 t2.2xlarge. But if you have 3xt2.xlarge, then the swap cannot be partially for 2xt2.2xlarge . So you have to sell one of the t2.xlarge.

its not very nice. that's why they have a reserved instance marketplace - https://aws.amazon.com/ec2/purchasing-options/reserved-insta...

what is new is the convertible reserved instances. I havent used it - and you may be right there. But i still maintain the GCP/Azure way is light years better

matwood · on Jan 24, 2019

Which is why amazon’s own tool suggests buying smalls or nanos.

I do think gcp and azure do it better though.

briffle · on Jan 23, 2019

I like google's implementation of it, since you commit to use a certain amount of CPU, and RAM, and not instance types. So you can commit to a 4cpu/15GB ram system, and a year later, double the size of the vm, and you still have the cheaper price on the first 4CPU/15GB of ram. And its per project. (I would love it if I could purchase per organization)

manigandham · on Jan 23, 2019

Yes, GCP has the best billing model. You purchase CPU/RAM capacity and it works the same whether on-demand, sustained use discount, or commitments.

Separating capacity from instance type makes it much more natural and easy to use for your actual requirements.