AWS Outage: Cloud Redundancy, E2E Testing, and CI/CD Pipeline Recovery

Summary: The AWS US-EAST-1 outage began with a DNS failure that exposed hidden dependencies across Europe. What can custom software teams take away from it, beyond the standard call for cloud redundancy? We highlight end-to-end testing, CI/CD pipeline integrity, and the team roles required to maintain stability and accelerate recovery.

Only last week we discussed platform risk, and now Amazon Web Services experienced a global disruption centered on US-EAST-1 that degraded over a dozen core services and impacted major consumer and enterprise apps. Associated Press reported DNS issues, with the first alerts appearing at 12:11 AM. Even when the DNS was “fully mitigated” by 2:24 AM, cascading failures at Amazons Elastic Compute Cloud (EC2) were noted around 3:35 AM. A second major issue, Network Load Balancing (NLB) health checks, appeared around 7-8 AM. Even though Amazon brought all services back within their SLAs, many websites and services reported 12-15 hours of downtime.

Worldwide Implications and Panic in Europe

The reaction in Europe was instant and unusually public. “My robot vacuum stopped working and can someone explain why a robot in Paris depends on servers in the U.S.?” wrote Ulrike Franke, senior fellow at the European Council on Foreign Relations. Her post went viral because it captured absurdity and dependency in just two posts.

European policymakers echoed the sentiment. Politico framed the outage as a warning about Europe’s autonomy: “Outages like this show how the concentration of computing power makes the internet fragile and turns technical failures into economic risks.”

You did not buy AWS directly. Your supplier did not buy AWS directly. However, one of your supplier’s suppliers did purchase AWS.

The interconnected nature of this reality was exposed the moment US-EAST-1 faltered, and a “European” service inherited a remote failure. The invisible connection from that robot in Paris traveled to Data Center Alley, Northern Virginia.

The AI Blame Game: Who Coded This or Who Tested This?

A Reddit software engineering post summed it up: “The junior AI developer committed the change, the senior AI developer code-reviewed it, the AI test lead passed it, and the AI DevOps deployed it to prod.”

AI developers are now doing everything! They write code with Claude and Codex CLI, probably AI generating the unit tests as well. According to Sundar Pichai, AI produces more than a quarter of Google’s new code. Satya Nadella put Microsoft’s share even higher. Andy Jassy stayed quiet on Amazon’s code, but the AWS CEO was implicated in the AI code scandal!

Do the Hyperscalers Still Hire the Best Engineers?

Amazon, Google, Microsoft, and recently Meta, all share a fascination with software all-stars. World-lcass talent and processes are a matter of prestige, and they are willing to pay for it. At Google or Amazon, senior engineers earn as much as $520K, with staff and principal levels often exceeding $700K. This talent premium reflects the complexity of hyperscaler systems, but it also places immense pressure on individual engineers to operate at that level.

Creating all-star teams works for the NBA and hyperscalers, but most software development and outsourcing companies thrive on cross-functional collaboration instead. Teams built around complementary expertise, such as system architects, front-end and back-end developers, QA automation, and DevOps engineers, deliver consistent value without the costs associated with all-star salaries.

The Curious Case of the Disappearing QA Engineer

The other effect of all-star software has been the steady disappearance of dedicated QA roles across major cloud companies. Rising costs for H-1B visas, historically given to QA engineers from India, only accelerate the decline of this already endangered species. Quality has been folded into developer workflows and automation, and many teams now boast of having zero QA. The professional skeptic is gone, and that is not good news for the software quality.

Culturally, the belief is that an all-star engineer does not make mistakes, just as Messi does not score own goals. Manual testing is treated as exploratory work, or worse, as an anachronism. After all, what self-respecting engineer would add “manual testing” to their résumé?

Why Even the Best CI/CD Pipelines Fail?

CI/CD pipelines fail when their assumptions collapse in sequence. Many of these cascading failures start with a small change. In July 2024, CrowdStrike shipped a defective content update that propagated to millions of Windows machines and resulted in the blue screen of death (BSD). Bounds checking and verification were insufficient, the rollout was not staggered, and the blast radius was immense. Different stacks, different pipelines, same result: millions of systems gone offline.

New Commandment: Know Thy CI/CD Pipeline

We work in insurance and other regulated industries. Claims, onboarding, and customer service are obligations with legal and reputational weight. Outages are not abstract. The approach below reflects that reality and centers DevOps, with QA integrated as part of delivery rather than a separate gate.

1) CI/CD Pipelines You Can Explain, Operate, and Recover

In insurance, pipelines are the real production control system. We build pipelines that remain transparent and operable under stress. Each stage, from build to release, is mapped, owned, and recoverable. We create promotion paths that work even when regions fail, maintain compliant credential storage and audit trails. The result is simple: both your teams and ours can see what runs, fix what breaks, and recover fast.

CI/CD infinity animation Three dots orbit an infinity path to suggest feedback flowing from production to integration and back.
SYNCED CI|CD LOOPS 
Regular Releases » Unified Codebase
Automated Tests « QA and Security
TINQIN LOOP
DevSecOps Pipeline
Execution & Delivery
CLIENT LOOP
Production Pipeline
Validation & Discovery

2) End-to-End Testing That Assumes Parts of the Platform Can Fail

End-to-end testing has value only when it mirrors real operating conditions. The goal is to confirm that essential user flows hold up when the platform slows, returns stale data, or drops connections.

We create testing frameworks that simulate these stress scenarios to keep critical paths intact when dependencies misbehave. We design telemetry that stays independent of the region under test, preserving visibility during outages and ensuring decision-makers can see what users experience, not just what the platform reports.

3) Budget for Cloud Redundancy Is the Elephant in This Room

The AWS outage showed how a single regional dependency can ripple across services and continents. True redundancy means more than backups. It means separating regions, accounts, and identity systems so that critical workloads keep running when a control plane falters. Not every component needs multi-region or multi-cloud coverage, but the critical ones do.

We encourage clients to think critically about infrastructure resilience early in the consulting and architecture phase. Our teams map dependencies, isolate critical workloads, and design failover strategies that meet both performance and regulatory demands. The result is real operational redundancy, not just a diagram that looks good framed on a wall.

Five Years Out: Software Engineering Roles of the Future?

In the wake of CrowdStrike, everyone focused on “the file that broke Windows.” After US-EAST-1, attention shifted to how overreliance on “AWS broke the internet.” The first incident looked like a clear QA miss. The second raises harder questions about the resilience of every external link in a CI/CD pipeline, which today can include almost all of them.

The Division of Human Labor

Hyperscalers increased development velocity by relying on all-star engineers and embedding quality into automation. No one can argue with the all-star model, when profits and market caps are at all-time highs. Yet, world-class talent is rare by definition. Every regular software development company has to rely on experienced, talented engineers, even if they would never win a Nobel Prize like Demis Hassabis.

The Division of AI Labor

If AI capabilities increase on their current trajectory, routine work will depend on agent frameworks running on clouds and managed data stores. The AWS outage was a foretaste of a full banquet of disaster in five years, when most code is written and maintained by AI. Work would stop because many services would rely on the an AI Factory, powered by Nvidia chips.

By then, we may work on an NVIDIA DGX Spark that can already handle 200-billion-parameter models. Even with that capability on our desk, the question about responsibility in software development remains open.

The Operational Mandate: SDLC Principles for Cloud Resilience

Preaching about software engineering best practices to a hyperscaler is not the goal. The goal is to learn from others’ mistakes and apply them where they count. We went over the SDLC principles informing our work and found these to be especially relevant:

  • Consulting and discovery. Engage stakeholders to define the value proposition, critical workflows, and budget constraints prior to design.
  • Insurance industry expertise. Deep understanding of the operational, legal, and data compliance requirements in insurance.
  • European regulatory mapping. Ensure strict alignment with GDPR, DORA, and data residency mandates from project inception.
  • Architecture for resilience. Structure applications for business continuity and fault tolerance.
  • Structured CI/CD. Segment deployment pipelines for critical and peripheral systems.
  • End-to-end testing. Prove core user journeys work, including under stress/load.
  • QA automation. Use AutoQA to transform acceptance criteria into executable checks.
  • Cloud-agnostic DevOps teams. Deploy expertly all systems that require multi-cloud redundancy.
  • Cybersecurity and monitoring. Leverage our Operations Centers, ISO 27001 certified, to enhance observability, monitoring threats and system performance across the entire cloud footprint.

Recap: The AWS outage revived questions about AWS EC2 costs and dependency tradeoffs. It was a reminder that even the most advanced infrastructures can stumble. Yet it also showed that recovery at scale is possible when automation, visibility, and coordination work as intended. Since our teams already operate across all major cloud environments, we help our clients design multi-cloud or cloud-agnostic systems that balance performance, costs, and resilience. Our AWS services also provide AWS monitoring and observability, in addition to our SOC services.