Filter by Thinker: View All

Why we upgraded from Amazon Web Services Mark Marsiglio

As you may know, we provide our clients with enterprise-grade hosting using a complex architecture consisting of many servers dedicated to individual functions such as hosting databases, file servers, load balancers and front end servers. 

For many years, our servers were located in the Amazon Web Services US-East Region. It seemed to be a perfect platform for this, allowing our array of servers to increase and decrease in capacity (and cost) as our usage patterns demanded. When we implemented our scalable array of servers within AWS it offered many significant advantages over our previous dedicated-server hosting platform. 

However, there are two inherent flaws in the design of AWS that prevented it from being the best choice for us now. 

1) Downtime

First, there have been several high-profile failures of the AWS infrastructure resulting in downtime. I say "high profile" because they frequently make the, mainstream,news. The reason that I believe these failures are so well known is because the very nature of AWS's popularity prevents it from being able to handle them effectively.

If you read Amazon's epic 6,000 word explanation of their April 2011 downtime, they identified the design flaw and indicated they would be adding reserve capacity:

As a result, many users who wrote their applications to take advantage of multiple Availability Zones did not have significant availability  impact as a result of this event.

and:

We now understand the amount of capacity needed for large recovery  events and will be modifying our capacity planning and alarming so that  we carry the additional safety capacity that is needed for large scale  failures. We have already increased our capacity buffer significantly,  and expect to have the requisite new capacity in place in a few weeks.

At the same time, AWS continues to drive down the cost of their service with regular price cuts. They could be cutting prices and making the system more stable at the same time, but those goals seem at odds with each other. Does AWS have enough reserve capacity in their infrastructure to handle a massive increase in utilization that would result from the failure of an availability zone? 

Any competent system administrator will plan for failure of single instances or services within their cloud deployment. Even multiple instance failure. But AWS has had issues with instance failure and simultaneous API failures that prevent recovery. 

I would argue that as more customers deploy larger cloud-hosted platforms with AWS the impact of a infrastructure failure in one AZ is likely to disrupt service in other AZs or even regions. 

With the release of the new EBS Snapshot Copy feature they have made it easier to maintain a warm copy of your data in another AWS region. I suppose using a separate region is safer than a separate availability zone in the same region for failover, but why stop there?

Our choice - two completely separate providers

For the sake of argument, let's accept the premise that the best failover solution is a live backup isolated in a separate region (AWS US-West for instance) with failover handled by DNS. Then I would propose that there is still risk in depending on the same provider for both your primary and secondary systems.

The more AWS enables cross-region data transfer features the more likely it is that they will not have the reserve capacity to properly spin up the recovery resources required by their customer base. If an AZ fails, every AWS customer will be scrambling to spin up replacement instances all at the same time and this is likely to continue to overwhelm the infrastructure just as it has done multiple times in the past. 

Our solution depends on a primary data center and backup data center that are completely independent of each other. Separate companies, different technologies, thousands of miles apart geographically. We use DNS failover to monitor the primary system and switch to the backup if there is a problem. We sync the data daily and can run indefinitely on the backup platform if necessary.

2) I/O Performance

While there are an incredible array of services at AWS, our primary use was built around their EC2 servers and EBS storage volumes. We host enterprise web content management systems on this platform.

We are not building our own web app, architecting the system from scratch to overcome the performance limitations of EBS. We are optimizing the performance of an off-the-shelf CMS so we can't do much about its dependence on IO performance. 

With the release of High I/O instances and EBS-Optimized Instances and volumes AWS has attempted to address this limitation. The former adds 2TB of SSD and the latter increases the performance reliability of normal EBS volumes. But both add significantly to the cost. 

Consider the costs of a typical component in the platform:

Instance Type

Instance Monthly Cost

140GB Storage 

Total

AWS XLarge Instance $374 $14 $388
AWS EBS-Optimized XLarge Instance $410 $217* $627
AWS High IO Instance with SSD $2232 n/a** $2232
Non-AWS SSD  Instance $300 Included $300

*Configured for the maximum 2,000 IOPS  **Includes 2TB of SSD, currently the minimum configuration at AWS

While the AWS SSD instances are very performant, the associated resources (CPU, RAM, Disk size) are wrong for our system design. The EBS-Optimized instances in the example are priced out at the max available capacity of 2,000 IOPS. The SSD performance at competing providers is at least 10-50x that depending on whose benchmarks you believe. 

Our choice - cloud-based SSD servers

Our platform uses SSD for all primary services except our web servers' boot volume. Database, file serving (NFS) and search indexing services all run 100% SSD in production and in our backup platform. 

The improvement in performance and stability is significant and noticeable. The bottleneck with AWS was always IO wait and it often was so bad that it led to downtime. At AWS we used 4 EBS volumes in a striped RAID 0 configuration to improve the throughput but in practice what we found is that we were 4x as likely to have the volume hang because of inconsistent IO performance (we now had to have 4 volumes working consistently instead of just 1). 

Summary

We used AWS exclusively for our enterprise web content management system cloud hosting platform from 2007 until 2012. Our average uptime from 2007 to 2012 was 99.67%, far from great. The majority of the downtime was caused by infrastructure and performance problems at AWS.

Since we switched to SSD our uptime has been 99.97% with the primary downtime being a planned maintenance event. With our new system design we would expect a maximum of 10 minutes of downtime before requests are automatically rerouted to our backup platform. 

Performance and uptime are constantly constrained by cost and we continue to look for the best combination for our clients. 

Filed under: Hosting , CMS , Web Development