Adventures in AWS: How to break the EBS Snapshot feature

In this series, I explore some of the everyday challenges facing an AWS developer/sysadmin. This week: how does the EBS Snapshot API stand up under heavy load?

UPDATE 6/9: Turns out my original fix wasn’t quite enough to overcome the bug 🙂 See the revised solution below.

The Background
If you keep important data on EBS volumes, you’ll want to back them up using EBS’s nice Snapshot feature, which allows you to grab a point-in-time backup of a volume and dump it in S3. It’s an incremental backup, meaning you are only snapping–and being charged for–the blocks changed since the last snap. If you are using EBS for something fast-changing and critical like transaction logs, it’s not a bad idea to snap as often as possible. If you’re snapping many volumes many times a day, you’ll also want to tag them with an identifier or two for sanity’s sake. AWS allows you to place up to ten tags on a resource, so what happens if you have, say, fifty thousand snapshots in an account, each with four to six tags?

The Problem
What happens is not good. The snapshot enumeration feature, both through the console and the API, gets REALLY slow and error prone. You’ll start to see frequent errors in the console when trying to load your existing snapshots:

An error occurred fetching snapshot data: Unable to execute HTTP request: Read timed out

If you do get the snapshots to load, you may be performing what amounts to a denial-of-service attack on your own account; other users or jobs may not be able to get access to the snapshot feature while you are listing the snapshots. And you likely won’t be able to filter snapshots based on tags at all. Yes, you read that right – even with the ten-tag limit, it’s currently quite possible to outpace the amount of tags the AWS API can index in the time it allots to processing a request.

I’ve reproduced this issue multiple times over the past few weeks. Here’s a python/boto snippet I ran in an account with over 100k snapshot tags:

import boto.ec2
conn = boto.ec2.connect_to_region('us-east-1')
conn.get_all_snapshots(filters={'tag-key': config['my_tag']})

This code, which is supposed to return all snapshots tagged with ‘my_tag’, instead hung for a couple of minutes before boto raised an HTTP 500 error.

I alerted AWS and they confirmed that they’re aware of the problem, but don’t yet have a timeline on a fix. In the meantime, they suggested a workaround that I had already implemented.

The Solution
It turns out that if you eliminate tag filters from your API request entirely, you get a much more reliable response. This code succeeds even with nearly 100k snapshots requested:

import boto.ec2
conn = boto.ec2.connect_to_region('us-east-1')
snapshots = conn.get_all_snapshots()

Once you get all your snapshots in memory, you can do the tag filtering on the objects yourself:

foreach snapshot in snapshots:
    tag = conn.get_all_tags({"resource-id" : snapshot.id, "key" : config['my_tag']})

Obviously, you’ll use more memory to store an object containing all snapshots than a filtered subset, not to mention the overhead of the additional API calls for tagging info, but until AWS fixes the underlying issue, this is probably the best option. Let me know if you have run into issues with EC2 Snapshots or other components of the AWS API!

Adventures in AWS: How to break the EBS Snapshot feature

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s