Resilience on AWS

Explore how to build resilient applications on AWS

What is Resilience?

Resilience is the ability of a system to withstand and recover from failures, disruptions, or unexpected events. Software in the cloud encompasses a range of characteristics to maintain the functionality, integrity, and robustness of applications even under the most challenging conditions. For example, the ability of your application to resist and recover from faults or load spikes and remain functional is a continuous process of design decisions, observability, and assessment. The collection of resilience resources on this page provides techniques, tutorials, and examples to help you build applications in your AWS cloud that are ready and primed for customers.

  • Discover essential concepts for building resilient applications in the cloud on AWS, along with plenty of links to enable you to go deeper on these topics.

    1. What is resilience - It is about resisting faults and load spikes and remaining up.
    2. How to prevent faults from becoming failures - Faults happen all the time. How do you prevent them from becoming failures in your application in the cloud?
    3. How to think about resilience - We like to think about it as a three-part model. This helps you understand the different strategies used to mitigate different types of faults.
    4. How does the cloud help you build resilient applications - Learn about the tools and automation offered by the cloud to implement resilience best practices.
    Learn more 

Resilience Foundations

Building resilient applications on AWS involves a holistic approach, integrating key resources and frameworks. Starting with the "Reliability Pillar of the AWS Well-Architected Framework," builders can learn the best practices for creating resilience cloud workloads. The "Resilience Analysis Framework" further deepens this understanding by highlighting crucial failure modes and the trade-offs involved in implementing mitigations. To maintain and enhance resilience, the "Resilience Lifecycle Framework" presents a continuous improvement strategy across five stages. Finally, the AWS Resilience Hub empowers developers to assess and refine their applications' resilience, leveraging AWS's best practices and automated solutions for a robust resilience posture. Together, these resources provide a comprehensive path to achieving and sustaining resilient applications on AWS.

  • The AWS Well-Architected Framework helps you understand the pros and cons of decisions you make while building workloads on AWS. This pillar whitepaper documents the best practices you need to build resilient applications on the cloud.

    Learn more 
  • Use resilience analysis to understand which failure modes are most important to protect your application against. This whitepaper introduces the SEEMS model, covering five common failure categories, with each letter in SEEMS standing for one of these failure modes: single points of failure, excessive load, excessive latency, misconfigurations and bugs, and shared fate.
    Learn more 
  • A continuous lifecycle enables you to always improve the resilience of your application. Based on years of working with customers and internal teams, this framework outlines five key stages and the activities in each to keep your application resilient.

    Learn more 
    AWS re:Invent 2023 - Resilience lifecycle: A mental model for resilience on AWS (48 mins)
  • The AWS Resilience Hub offers several capabilities to improve the resilience of your applications on AWS. It assesses your application against AWS Well-Architected best practices for resilience and gives specific guidance on how to improve your resilience posture. It also gives you templates to easily deploy new CloudWatch alarms, Fault Injection Service experiments, and automated runbooks in Systems Manager. With these, you can monitor and test your resilience, as well as automate actions that are part of a resilience strategy.
    Workshop

    AWS Resilience Hub Workshop

    The goal of this workshop is to walk through the various functionalities of AWS Resilience Hub. By the end of the workshop you should have an understanding of the different service components and how to use the service to assess your workload resiliency.
    Learn more 
    Blog

    Building Resilient Well-Architected Workloads Using AWS Resilience Hub

    AWS Resilience Hub is a new service that helps you understand and improve the resiliency of your workloads using AWS Well-Architected best practices.
    Learn more 
Whitepaper

Reliability Pillar of the AWS Well-Architected Framework

The AWS Well-Architected Framework helps you understand the pros and cons of decisions you make while building workloads on AWS. This pillar whitepaper documents the best practices you need to build resilient applications on the cloud.
Learn more 
Whitepaper

Resilience Analysis Framework

Use resilience analysis to understand which failure modes are most important to protect your application against. This whitepaper introduces the SEEMS model, covering five common failure categories, with each letter in SEEMS standing for one of these failure modes: single points of failure, excessive load, excessive latency, misconfigurations and bugs, and shared fate.
Learn more 
Whitepaper , Video

Resilience Lifecycle Framework: A continuous approach to resilience improvement

A continuous lifecycle enables you to always improve the resilience of your application. Based on years of working with customers and internal teams, this framework outlines five key stages and the activities in each to keep your application resilient.
Blog, Workshop

Building Resilient Well-Architected Workloads Using AWS Resilience Hub

The AWS Resilience Hub offers several capabilities to improve the resilience of your applications on AWS. It assesses your application against AWS Well-Architected best practices for resilience and gives specific guidance on how to improve your resilience posture. It also gives you templates to easily deploy new CloudWatch alarms, Fault Injection Service experiments, and automated runbooks in Systems Manager. With these, you can monitor and test your resilience, as well as automate actions that are part of a resilience strategy.

High Availability (HA)

High Availability (HA) takes a proactive approach to resilience. It's about designing your systems in such a way that they can automatically recover from common failures without human intervention. This could mean duplicating critical components, balancing loads across multiple servers, or using cloud services that can reroute traffic in the event of a network blip. High Availability (HA) is all about reducing the probability of a significant impact on your services due to small, frequent issues.

Whitepaper

Availability and Beyond: Understanding and Improving the Resilience of Distributed Systems on AWS

This paper outlines a common understanding of availability as a measure of resilience, establishes rules for building highly available workloads, and offers guidance on how to improve workload availability.
Learn more 

Gray Failures

Binary failure events are typified by a resource withering or not working. Detection and mitigation of these can be straightforward. However, gray failures are a different story. In this case, the system may sometimes fail and sometimes not. Manifestations of this type of failure can be subtle and defy quick and definitive detection. Let’s help you out in detecting and mitigating these with the following resources: Advanced Multi-AZ Resilience Patterns - Detecting and Mitigating Gray Failures  What Happened to My Car? Understanding Gray Failures  Fix Gray Failures Fast Using Automation and Route 53 ARC Zonal Shift  Detecting gray failures with outlier detection in Amazon CloudWatch Contributor Insights  Rapidly recover from application failures in a single AZ  Advanced Multi-AZ Resilience Patterns  ARC310 | Detecting and mitigating gray failures  COP343| Building Observability to increase resiliency  

ARC309 | Build applications that recover from an Availability Zone impairment — This session and ARC301 are a great pair together. In this breakout, you’ll learn about Amazon Route 53 Application Recovery Controller zonal shift. OK, that service is a mouthful, but what it does is super-powerful — it gives you control over which AZs are in or out for your application (which ones are receiving traffic). Using the monitoring techniques covered in this session, you’ll be able to detect when an AZ needs to be taken out-of-service, learn how to take it out, and keep healthy AZs online to serve your customer traffic. video

Disaster Recovery (DR)

Disaster Recovery (DR) is your safety net. It's the process and policies you put in place to recover from catastrophic events that can cause extended outages, such as natural disasters, cyberattacks, or significant technical failures. The goal here is to minimize downtime and data loss by having a robust backup and restore strategy. This involves not just backing up data but ensuring you can quickly restore operations, possibly in a different geographic location if necessary.

Continuous Improvement

Resilience is not a set-it-and-forget-it feature. It requires a commitment to continuous improvement. This means regularly testing your systems' ability to recover from failures, a practice known as chaos engineering. It also involves monitoring your systems in real time to quickly identify and address issues. By making resilience testing a part of your continuous deployment pipeline, you ensure that your architecture can adapt to new challenges and remain robust against unforeseen threats.

Was this page helpful?