How Pearson improves its resilience with AWS Fault Injection Service

Introduction

Application owners are committed to developing new applications and enhancing existing ones rapidly as a top business priority. However, when it comes to mission-critical applications, whose availability is vital for business operations, owners must allocate dedicated time to design and rigorously test the resilience of their applications. To tackle this challenge, it is imperative that you design your systems with a comprehensive understanding of how your workloads will perform in real-world scenarios. The complexity of distributed systems, with numerous interacting components, makes validating workload behavior challenging.

This is where resilience testing, originally introduced to address this issue, comes into play. Chaos engineering is one component of resilience testing that helps to assess and enhance the robustness of our systems. Chaos engineering, often misunderstood as intentionally breaking the production environment, aligns with the Amazon Web Services (AWS) Well-Architected Reliability pillar. Its purpose is to methodically simulate real-world disruptions in a controlled manner, spanning service providers, infrastructure, workloads, and individual components. The goal is to learn from faults, observe and measure system behavior, enhance workload resilience, validate alerts, and ensure timely incident notifications. Consistent application of chaos engineering tests reveals vulnerabilities in your workloads, allowing teams to address weaknesses crucial for maintaining system availability, performance, and overall operational integrity.

In this blog post we show how Pearson PLC, an AWS education technology (EdTech) customer, successfully implemented resilient architectures through chaos engineering using AWS Fault Injection Service (FIS). Pearson is a British multinational publishing and education company whose mission is to help people make progress in their lives through learning. Pearson’s higher education (HigherEd) business unit is a market leader in producing digital learning material for colleges and universities in the US and across other major markets. Given the complexity of their distributed system architecture, Pearson recognized the necessity of establishing a comprehensive chaos engineering strategy to ensure system reliability.

Pearson’s commitment to enhancing customer experiences in the face of critical system failures drove their adoption of chaos engineering. They understood that without it, identifying the root causes of production failures could be a time-consuming process, impacting both system stability and user satisfaction. Pearson’s foremost objective was to guarantee safety and controlled execution. This required the implementation of safeguarding techniques, well-defined risk tolerance levels, and robust rollback mechanisms. Additionally, they prioritized seamless observability through comprehensive logging and monitoring, to leverage valuable data-driven insights and drive continuous improvements in system performance. Pearson HigherEd automated chaos engineering by integrating FIS into their continuous integration and continuous delivery (CI/CD) pipelines, establishing a cohesive system encompassing a diverse range of components.

Solution overview

Chaos engineering aims to improve system reliability and resilience by simulating controlled failures. It involves introducing failures and quickly reverting while analyzing the experiment’s methods. The HigherEd team found compatibility with FIS, which aligns well with AWS services in their environment. The team carefully considered execution methods to prevent unintended disruptions, including options like a sidecar or using AWS Systems Manager documents to accomplish experiments with actions that are not natively supported by the service today.

Integration with observability through Amazon CloudWatch allows them to monitor metrics such as performance, response times, and error rates during experiments. The ability to add validations and stop conditions using FIS assertions to validate and end the experiments tied with the CloudWatch alarms helped Pearson achieve greater control over the automation process.

To measure chaos engineering’s effectiveness, Pearson established metrics and benchmarks to quantify improvements and identify areas needing attention in their distributed system architecture. They created a flexible configuration framework for each service, recognizing metric variations across use cases. Pearson also adopted AWS Step Functions, due to its alignment with their use case and its automation potential.

“Our architecture and platform engineering team for HigherEd is committed to delivering the most dependable and efficient services and applications to our customers,” said Shridhar Navanageri, Pearson’s vice president of architecture and platform engineering. “To achieve this, we integrated FIS into our CI/CD pipelines in GitLab, enabling us to conduct chaos experiments seamlessly and automatically. FIS is a powerful tool that helps us simulate faults and inject failures in our systems, which in turn allows us to identify and rectify any underlying weaknesses — before they affect our customers.”

Solution architecture

Pearson uses AWS services such as Amazon Elastic Container Service (Amazon ECS) with an AWS Fargate compute environment, Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3) buckets, and Amazon ElastiCache, among others. Their teams run all their CI/CD pipelines using GitLab and integrate those pipelines with FIS to provide teams the ability to perform chaos engineering on their workloads before a new deployment.

The team created templates for common experiments to be carried out by FIS and defined parameters for teams to complete when needed. Once the user or developer chooses the experiment parameters, the GitLab pipeline initiates two jobs: the “Create Experiment” job using AWS CloudFormation to configure and set up the experiment in FIS and the “Start Experiment” job to inject the fault into the application.

In the performance testing environment, production load is replicated by executing a JMeter script on an Amazon Elastic Compute Cloud (Amazon EC2) instance. This script generates load on the Application Load Balancer that uses Amazon ECS service endpoints as targets. When the load stabilizes at its baseline, users can select the specific experiment type and its associated attributes to integrate into the GitLab CI pipeline. The goal is to embed chaos engineering testing as an automated test like functional and performance tests.

When the experiment finishes, the built-in integration sends email notifications with a report of the findings. Today, this architecture is built for testing (pre-production) workloads, but the team at Pearson is aiming to execute this in a production or production-like environment where they would need to set stop conditions during those experiments to avoid significant impact on their customers.

Figure 1 shows the architectural diagram for the solution described in the preceding paragraphs.

Figure 1. Architectural diagram of the solution described in this blog post. The major components are Amazon ECS using Fargate tasks, FIS, DynamoDB, Amazon S3, and an Amazon EC2 instance using JMeter.

Using this solution, Pearson empowered their teams to autonomously integrate the fault experiments into their applications, independent of reliance on other teams, and significantly expedited their ability to iterate and enhance their solutions. This agility is critical for achieving desired outcomes, especially when dealing with real dependencies and scenarios in production environments.

Figure 2 shows the flow for Pearson’s GitLab pipeline. The developer pushes the code, it is built, deployed, and then functional testing happens. After that, performance is tested, and finally, the chaos experiments happen. If everything is successful, the artifact is completed as certified.

Figure 2. Diagram for the CI/CD pipeline Pearson uses in GitLab. Different phases are shown in order. Build, deploy, functional testing, performance testing, and chaos engineering before the artifact is finalized.

Pearson now has several different experiments ranging from CPU or memory stress to DynamoDB connectivity disruption or even loss of communication within a virtual private cloud (VPC) to a specific Availability Zone (AZ). For instance, one experiment type might involve stopping Amazon ECS tasks, and the associated attribute could specify the percentage of tasks to stop. Let us look at a few of the experiments.

1. CPU or memory stress experiment

For this type of experiment for Fargate tasks, Pearson uses Systems Manager documents that induce load in Amazon ECS and they check the metrics in CloudWatch for the resulting impact.

Parameters:
Fault injection type: CPU Stress
Fault injection attributes: cluster name, service name, load amount, duration, account number

Parameters:
Fault injection type: memory stress
Fault Injection attributes: cluster name, service name, load amount, duration, account number

In Figure 3 we show how the experiment is run. Going from the previous architecture, the Amazon EC2 instance running JMeter generates load that is injected into the Fargate tasks running the workload. In parallel, FIS runs the Systems Manager document for CPU/memory stress.

Figure 3. Architectural diagram of the solution described when a CPU/memory stress is injected into Fargate tasks using FIS.

2. Fargate task interruption experiment

The Fargate task interruption experiment uses the aws:ecs:stop-task action within FIS to stop target tasks within their cluster. Afterwards, the Pearson team uses CloudWatch to check the performance of the application during the experiment.

Parameters:
Fault injection type: Stop_Task
Fault injection attributes: cluster name, service name, selection mode, account number

In Figure 4 we show how the experiment is run. As in the diagram before, we start with the base architecture, and JMeter generates load that is injected into the Fargate tasks running the workload. This time, FIS runs the stop task experiment and stops the specified percentage of the tasks running.

Figure 4. Architectural diagram of the solution described when the stop task action is injected using FIS. Only one of the tasks shown is stopped without impacting the rest.

3. DynamoDB disrupt network connectivity experiment

As for experiments targeting DynamoDB, Pearson uses the aws:network:disrupt-connectivity action with the scope of DynamoDB. That denies traffic to and from the regional endpoint for DynamoDB.

Parameters:
Fault injection type: disrupt DynamoDB connectivity
Fault injection attributes: selection mode, duration, account number

In Figure 5, we show again how the experiment is run. As in the diagram before, we start with the base architecture, and JMeter generates load that is injected into the Fargate tasks running the workload. This time, FIS runs the connectivity disruption experiment for DynamoDB. This severs the connection towards the service, creating a network partition for the service.

Figure 5. Architectural diagram of the solution when the DynamoDB connectivity action is injected by FIS. Fargate tasks cannot communicate with DynamoDB.4. Amazon S3 disrupt network connectivity experiment

This experiment is like that of the DynamoDB experiment. They target the scope of Amazon S3 inside the disrupt connectivity action within FIS. They then assert if the application continues to function after that connectivity disruption occurs.

Parameters:
Fault injection type: disrupt Amazon S3 connectivity
Fault injection attributes: selection mode, duration, account number

In Figure 6 we show again how the experiment is run. As in the previous diagram for DynamoDB, we have the same behavior but this time we run the experiment on Amazon S3 connectivity. In this scenario, the connection to Amazon S3 is severed.

Figure 6. Architectural diagram of the solution when the Amazon S3 connectivity action is injected by FIS. Fargate tasks cannot communicate with Amazon S3.

5. Availability Zone interruption experiment

To experiment at the AZ level, they use the same action but with the scope “availability-zone”. Shown in Figure 7, this action denies traffic to and from specified subnets within the VPC. They then check the application metrics to understand the impact of the disruption.

Figure 7. Architectural diagram of the solution when the disrupt AZ connectivity action is injected by FIS. Traffic cannot be routed from other AZs inside the same VPC towards the one affected.

Conclusion

Pearson’s ultimate vision is to achieve complete automation of chaos engineering in their development process. They aspire to have their systems seamlessly and autonomously recover based on dashboards and alert settings. This approach significantly reduces the overall recovery time, eliminating the need for manual intervention in evaluating the situation and deciding the outcome. Chaos engineering helps Pearson envision how their workloads behave during real-world scenarios and be better prepared without having to experience them before going to production.

In this post, we described how Pearson innovated and built a solution to automate chaos engineering in a controlled manner using their GitLab CI pipeline. We also illustrated how they are implementing FIS for a diversity of AWS services and tools like Amazon S3, Application Load Balancer, DynamoDB, and Multi-AZ communication. With these architectures, Pearson successfully simulated controlled failures to identify potential weaknesses and vulnerabilities in the system, improving the overall reliability and resilience of their systems.

To learn more, contact us or leave a comment below.

AWS Public Sector Blog

How Pearson improves its resilience with AWS Fault Injection Service

Introduction

Solution overview

Solution architecture

Conclusion

Resources

Follow