Andrew Jassy
SVP, Amazon Web Services
Hi Andy,
I am not going to ask you how are you doing. For everyone in the Amazon Web Services eco-system, the last 24 hours have been brutal. But I’d like to share my perspective with you, and offer a couple of suggestions:
I believe that in the long run this will be a positive day for the cloud computing movement. Naysayers seeking evidence to avoid the cloud have new ammunition, those hyping the cloud are experiencing its limitations, and the leading cloud provider, your company, is learning from the major outage the importance of being humble and cooperative.
I also believe that the way AWS behaves needs to change. You built the leading infrastructure-as-a-service provider with a level of secrecy typical of a stealth startup or a dominant enterprise software platform vendor. It works for Apple – they deliver a complete integrated value chain. But it is not your position in the cloud ecosystem. Today’s outage shows that secrecy doesn’t and won’t work for an IaaS provider. Compete on scale and enterprise readiness, and part of readiness is being open about your internal architectures, technologies and processes.
Our dev-ops people can’t read from the tea-leaves how to organize our systems for performance, scalability and most importantly disaster recovery. The difference between “reasonable” SLAs and “five-9s” is the difference between improvisation and the complete alignment of our respective operational processes. My ops people were ready at 1:00 am PT to start our own disaster recovery, but status updates completely failed to indicate the severity of the situation. We relied on AWS to fix the problem. Had we had more information, we would have made a different choice.
This brings me to my last point: communication. Your customers need a fundamentally different level of information about your platform. There are some very popular web sites that try to re-engineer the way AWS operates. These secondary sources – based on reverse engineering and conjecture – provide a higher level of communication than we get directly from the AWS pages. We live in the Twitter, Facebook, Wikipedia and Wikileaks days! There should not be communication walls between IaaS, PaaS, SaaS and customer layers of the cloud infrastructure.
Tear that wall of secrecy down, Mr. Jasse. Tear it down!
Respectfully,
Roman Stanek
CEO and Founder, GoodData (2009 AWS Startup Challenge winner)
@romanstanek
roman@gooddata.com
P.S. I am publishing this letter on my blog. It’s part of open communication between our companies.
Amen! The white elephant is on the living room table now at least….
Roman, as always, well balanced and well put. GoodData has more operational experience with AWS and computing on the public cloud than nearly all ISV’s. The cloud will move forward and this incident brings to light issues with Amazon that have been well know for many years. The publicity surounding this incident levels the playing field for cloud computing providers – and more competition is always a good thing for a market like this. If Amazon wants to retain the leadership positioning they have achieve from a bold first mover advantage, then Mr. Jassy would be well advised to listen carefully to your council.
Was GoodData’s data replicated out of the cloud?
We keep data backups in several locations and this issue was not about access to data backups.
In order to being able to recover from failure wee keep hot standby but the recommended disaster recovery procedures failed on AWS and we had no information about what is going on. No more trust without transparency!
In my opinion, your pptceerion here is wrong. It is always your (our) fault when the app breaks.When your site fails to work the customer doesn’t care about whether it’s hosted on provider A or provider B, or whether it’s using language X or language Y with bindings to runtime Z. It’s 100% your problem that you are using something that sometimes breaks and do nothing about it. Prepare a backup site (in AWS, Rackspace, your PC, whatever) that can handle some of the traffic but still have the service online.If you are using libfoo in your app’s code, and something breaks because of that bug, will you display a message saying Oh, it looks like libfoo is broken, try again after foocorp will release a patch ?No, you obviously won’t. You’ll fix it somehow.Whether it’s the code, the hardware or the platform that fails, it doesn’t matter. It’s your site and your responsibility. The customer will always complain to you.Obviously, the world is not perfect, and using a single platform is a great idea (with the right SLA), but _you_ chose Heroku, not the customer. So you ought to give the answers, not Heroku.