Serverless Workflows on AWS: My Journey From SWF to Step Functions

AWS Lambda functions can only run for a maximum of five minutes. This must be distinctly understood, or nothing wonderful can come of the story you are about to hear.

The Problem

This past summer, my team and I set out to build an internal software system used for deployment testing on AWS. The application would run a large number of workflow executions in parallel each night and might perform a few one-off executions during the day – maybe six hours total use out of every twenty-four, with only a small fraction of that time spent doing actual compute tasks. Trying to scale, manage and spend money on EC2 instances for that workload didn’t interest us. We wanted to run our whole workflow process end-to-end on AWS Lambda.

And we did. Heaven help us, we did. This is our story.

Helpful Background: Workflows on AWS

(You can skip this section if you know SWF already) 

If your cloud application has lots of moving parts and does asynchronous processing, you probably need some kind of distributed workflow service: an orchestration system that will help you hand out tasks to worker processes and manage central state.

If your cloud is AWS, you may already be familiar with their well-established workflow service, SWF. The “S” stands for “Simple”, which is a bit of a misnomer, but using SWF is at least easier than building a fully managed, highly available workflow engine on your own. SWF’s fundamental abstractions are the “decider” – a piece of compute logic that figures out what task to perform next – and “workers”, ancillary processes that do the decider’s bidding. Implementing these functions is up to you. SWF makes sure each task gets assigned exactly once, maintains a record of task state throughout the workflow, and generally acts like a helpful parent getting children out the door on time for school.

The SWF FAQ says that workers and the decider “can run on cloud infrastructure, such as Amazon EC2, or on machines behind firewalls.” SWF also provides a mechanism for invoking AWS Lambda functions as worker tasks. But the implicit assumption is that you will use a long-running, persistent server for your decider.

That last point did not sit well with my team. For the reasons mentioned above, we much preferred to put that decider on Lambda too.

Serverless Workflows on SWF

AWS Lambda functions, as I said, can only run for a maximum of five minutes. But SWF architecture requires the decider regularly to poll SWF for what are called “decision tasks” – notifications that an activity worker has completed its duties and handed control back to the workflow. Our decider logic, when coordinating a more complicated deployment test, might have to do this polling for a couple of hours. How to perform this long-running task on a very short-running Lambda function?

Our initial answer: CloudWatch Events. Each time we started a workflow, we set up a CloudWatch rule that invoked the decider Lambda function every couple of minutes. The decider would wake up, check SWF, and go back to sleep if nothing interesting was happening.

The design looked something like this:

blank-diagram-page-1
Figure 1. Our serverless SWF design, at PowerPoint levels of naivety

 

While our implementation of this design worked well enough for the first few months of the project’s life, its limitations quickly became clear:

  • Latency. CloudWatch rules cannot run more frequently than once a minute, and SWF scheduling delays made it advisable to invoke the decider even less frequently than that. This situation created significant latency between workflow actions, leading to inflated workflow times that were especially noticeable for workflows involving lots of short tasks.
  • Cost. Having to run the decider on a two-minute loop throughout the life of the workflow somewhat negated the cost advantages we hoped to get from using Lambda instead of EC2 in the first place, especially as our number of workflows scaled up.
  • Runtime State. Every time the Lambda decider function ran, it had to figure out where it was in the workflow process. SWF is supposed to make stateless execution easy, and it provides the complete deployment workflow history as a JSON blob when handing tasks to the decider, but the blob quickly becomes unmanageably large and filled with superfluous data. To keep track of what was going on in the workflows, we resorted to maintaining ephemeral state in a DynamoDB table, adding more latency and cost.
  • Retries/Error Handling. SWF, I regret to say, has bugsSometimes SWF completely fails to schedule a task, or does it so late that the workflow’s task timeout expires. Finding and catching these errors required even more state maintained outside of SWF.
  • Debugging. The SWF console’s workflow event views are difficult to read, oddly paginated and don’t provide much information, leading to a rabbit trail of log searches anytime something went wrong.
  • Code Maintainability. The combination of multiple state sources and ramifying failure scenarios, not to mention the central “hack” of running the SWF decider in a loop, led to a mess of one-off fixes and hacky workarounds in our codebase.

All in all, this first design was a good (or rather, very bad) example of what I’ll call “serverless for serverless’s sake”. In retrospect, if I’m being completely honest with myself, we would have done just as well to forego the whole Lambda idea and run the decider on a set of EC2 instances.

But once that sweet serverless stuff hits your bloodstream, it’s pretty hard to quit. As use of our application scaled up and our back end got more and more unwieldy, we clutched to a rumor we’d heard from our AWS liaison that some sort of “serverless-first” update to SWF was in the works. What we needed was a true serverless workflow system: one that didn’t require constant polling, could pass minimal state between Lambda functions without requiring additional database infrastructure and had no long-running components whatsoever. Though we didn’t quite realize it, what we needed was a state machine.

Enter Step Functions

Lo and behold, AWS announced a new service called Step Functions (SFN) during re:Invent in November 2016. Yep, it’s state machines – and a whole lot more. A Step Functions state machine processes JSON the way a sea cucumber processes sand. It hooks together the input and output of Lambda functions in one giant tube of state, all wrapped up in a JSON template definition that is grandiosely called the “Amazon States Language”. Here are a few of the best features of SFN:

  • Visual workflows

Step Functions ingests your JSON template and turns it into a real-time graphical view, helping you make sense of your state machine’s current, well, state.

Screen Shot 2016-12-29 at 4.14.22 PM.png
Figure 2. The visual representation for one of our early workflow tests. Green means the step succeeded.

 

  • No need to talk to the orchestrator

SWF requires deciders to ask the system for tasks and workers to let the system know when they’re done. The beauty of SFN is that your business code doesn’t need to know anything about SFN. Each Lambda function is completely self-contained. It accepts input and returns output. The state machine just passes the baton.

  • Easy implementation of dynamic backoff for async tasks

If your workflow has I/O-bound periods where it’s waiting on some external task, you don’t want to burn any more compute time than necessary. Not only does SFN let you include “wait states” that sit there counting sheep before calling your next Lambda function, but you can parameterize the duration of the wait period to build your own dynamic exponential backoff.

Needless to say, we took one look at Step Functions and knew that our poor, hardworking SWF solution was history. A little over a month later, we’ve rewritten our entire application to use SFN, and we’re loving the results.

 

SFN vs SWF

SFN state machines have already demonstrated improvement over our old SWF solution in the following areas:

  • Latency. One of our workflows with several short deployment steps took thirty minutes to run under the old system. It takes — no kidding — thirty seconds to run to completion on SFN.
  • Cost. No SWF decider means no Lambda functions treated like persistent compute, which radically lowers cost. Moreover, the state machine structure of SFN means no calls to a DynamoDB state table.
  • Runtime State. The state machine seamlessly transfers the minimal amount of data needed between lambda functions, eliminating the need for other logs or database tables to hold runtime state.
  • Retries/Error Handling. SFN retries failed executions and easily reroutes errors.
  • Debugging. Visual workflows help quickly pinpoint solutions to state machine problems.
  • Code Maintainability. Our codebase has now been completely refactored to take advantage of SFN’s better abstractions and streamlined requirements. I think there’s about half as much code as there was before.

Of course, SFN is still a new service with plenty of quirks and feature gaps. A couple of current things to keep in mind:

  • As of this writing, CloudFormation does not yet support Step Functions, so all SFN automation must happen in code.
  • SFN has had some issues with properly deleting state machines. The AWS folks appear to be actively working on the issue.
  • SFN is only available in a few regions so far, so make sure to check this list before getting started. However, AWS has already hinted that Step Functions will displace SWF as their go-to workflow solution in the future, so I wouldn’t worry about widespread availability for long.

Lessons from the journey

The world of serverless changes fast. Keep in mind that AWS Lambda itself is barely two years old, and most of the ecosystem around it is quite a bit more recent. Just because a service you need isn’t around today, doesn’t mean it won’t exist tomorrow. In the meantime, living a bit past the bleeding edge certainly has drawbacks, but there are advantages too. Without the knowledge we gained from several months of trying to fit the round peg of serverless workflows into the square hole of SWF, we wouldn’t have been able to make such a fast cutover to Step Functions.

And I’m more than happy to share that knowledge. So please hit me up if you have questions about any of the technologies discussed in this post. I’d love to chat with you about serverless workflows on AWS!

 

Serverless Workflows on AWS: My Journey From SWF to Step Functions

11 thoughts on “Serverless Workflows on AWS: My Journey From SWF to Step Functions

  1. Daniel Kibler says:

    For some time now it seemed to me that SWF felt orphaned. Your post validates that. I use SWF, along with Lambda and other services, for what I call a process scheduler. It essentially runs processes in a proprietary system running on EC2.Do you see some way in the SFN architecture, to replace the SWF worker for running code on EC2 instances?

    Like

    1. forrestbrazeal says:

      A Step Functions activity task can be a process running on EC2, just like in the SWF paradigm. From the SFN FAQ:

      “Q: How does AWS Step Functions work with Amazon EC2 and other compute resources?

      All work in your state machine is done by tasks. A task may be an Activity, which can consist of any code in any language. Activities can be hosted on Amazon EC2, Amazon ECS, mobile devices—basically any computer that can communicate with the AWS Step Functions API. Activities long-poll Step Functions using API calls to request work, receive input data, do the work, and return a result.”

      Essentially, SFN can do anything that SWF can do, and more. That’s why AWS is recommending it for all new workflows.

      Like

  2. pagemedias says:

    Can you help me understand how you ‘execute’ your step functions?

    Is it possible to set up a schedule within SFN to run a step function every 1 or 10 mins? Or do you need to build something outside of SFN to manage the scheduling/orchestration of step functions?

    Thanks!

    Like

    1. forrestbrazeal says:

      Good question!

      You will need to trigger your state machine via an API call from an external source. We run the state machine execution call from a Lambda function that can be triggered by a couple different sources:

      1. A CloudWatch Events rule (for scheduled executions, like you are asking about)
      2. A user-initiated event from our application’s front end, delivered through API Gateway.

      Like

  3. Kyle says:

    Thanks for sharing this. The docs around this stuff seem to be a little lacking at the moment, so this was very helpful for me.

    Do you ever find an execution getting stuck and requiring some kind of manual intervention?

    I’m looking at using this to do a new customer onboarding, which requires talking to a bunch of disparate systems that may or may not be online. I’m wondering how I would handle a scenario where an execution gets stuck and runs out of retries. Can you manually invoke a retry somehow?

    Like

    1. forrestbrazeal says:

      Hi Kyle,

      Thanks for reading!

      Not quite sure what you mean by an execution getting “stuck”. Do you mean that the execution of a step fails? If it fails and you run out of retries, you can fall through to a “catch” state that redirects your flow of logic.

      I have also seen a couple of cases where Step Functions literally gets stuck and hangs during state machine execution, usually because Lambda itself fails to invoke a requested function. To handle something like that, you should be able to set a timeout on the step itself that will trigger a failure after a specified period of time. This behavior is similar to the old SWF “activity timeouts”.

      Does that help?

      Like

  4. Thanks for sharing. I am excited about Step functions as well. One of my use-cases is to do integration testing. I want to launch a CFT, create an approval workflow for which I can use an activity, run a test, then destroy the stack. One of the activity is waiting till CFT is completed and instance ready after executing some user-data. I can hopefully use some wait conditions/cfn-signal to declare the readiness. I am thinking of writing the verification of CFT creation using Retry task and let it pass based on CREATE_COMPLETE status on CFT. Not sure State Machine language supports any kind of while type constructs. If you have better ideas, pls suggest

    Like

    1. forrestbrazeal says:

      Loops are easy to implement in SFN. Write a state that checks your end condition (in this case, the status of the CF stack) and send the output to a “choice” state. If your condition is met, continue on to the next step in the workflow. If the stack creation is still in progress, go to a “wait” state (you can parameterize the length of the wait) and then call your check-progress state again at the end of the wait.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s