Thought Leadership

Jun 23, 2026

What a Disaster Recovery Gameday Taught Us About Resilience

How Controlled Failure Simulations Build Engineering Confidence and Process Reliability

Give me the TL;DR

What a Disaster Recovery Gameday Taught Us About Resilience

For Qualys admins, NES for .NET directly resolves the EOL/Obsolete Software: Microsoft .NET Version 6 Detected vulnerability, ensuring your systems remain secure and compliant. Fill out the form to get pricing details and learn more.

Resilience is not a document or runbook. It is the ability to face difficult circumstances, head-on, to achieve the best possible outcome. That belief is why HeroDevs runs disaster recovery gamedays. These are controlled simulations of the failures we strive to never let happen in production. This past quarter our platform and infrastructure teams deliberately broke a critical database in a production-like environment and rebuilt it from scratch, measuring every step along the way.

Here is why we do this, what the exercise looked like, and what we took away from the event.

‍

Why run a disaster recovery gameday at all?

Many engineering organizations have a recovery plan, or at least an idea of what one should look like. Far fewer have ever executed it under pressure. There is a big difference between a runbook your team wrote and a runbook your team has actually used. That difference is what separates a manageable outage from a multi-hour scramble.

A gameday closes that gap on purpose. By simulating a real failure in a production-like environment, we get to discover the broken assumptions, the stale documentation, and the manual steps nobody remembered. All without a single customer being affected. Beyond the recovery itself, you will surface process gaps, hidden application dependencies, and concrete action items you did not know you needed. The cost of learning a lesson during a scheduled simulation is a fraction of the cost of learning it at 3 a.m. during a genuine incident.

For a security-focused company whose main mission is keeping software secure and operational long after it reaches end-of-life, this is not optional. Disaster recovery testing is a SOC 2 requirement, and we take that seriously. But we do not stop at the minimum. We run more gamedays than compliance requires because our customers trust us to be the team that has already thought through the failure modes. Practicing recovery is how we keep earning that trust.

‍

The scenario

We started with a deliberately painful premise scenario. A shared Postgres database becomes corrupted and unusable, while every other piece of infrastructure remains healthy. This is a realistic and nasty failure mode, because a dead database does not always take its dependent applications down with it in an obvious way. The blast radius can be wide and quiet at the same time.

The team set four clear objectives:

Diagnose which systems and applications were impacted
Rebuild the database from backups, minimizing the recovery point objective (RPO)
Cut the affected applications over to the restored database
Validate that application functionality was fully restored, and measure mean time to recovery (MTTR)

‍

How the exercise was structured

A gameday is only useful if it mirrors the discipline of a real incident, so we ran it like one. We stood up a dedicated war room and a dedicated incident channel, and we assigned explicit roles before touching anything. An incident coordinator drove the response, took notes, and facilitated communication across the organization. A database recovery lead executed the restoration, adhering to GitFlow and change management policies. An infrastructure representative advised on corrective action and served as a liaison to the recovery lead. And a platform representative, who knew the inner workings of our applications and their dependencies, handled proper validation. Clear ownership and defining roles up front is one of the best strategies for effective disaster recovery.

From there the exercise moved through distinct phases:

Triage. Identify impacted workloads, confirm the scope of impact, notify stakeholders, and decide on a restoration strategy.
Restoration. Locate the latest viable snapshot, rebuild a new instance, repoint applications at the new endpoints, and confirm connectivity from the cluster.
Validation. Work through a concrete checklist: the app loads, authentication works, reads and writes succeed, there are no data integrity issues, and the instance is managed declaratively.
Timeline and postmortem. Reconstruct exactly what happened and when, so the response itself can be measured and improved.
Debrief. Honestly assess what worked, what slowed us down, and what should be automated next time. Identify real action items to improve the process.

We followed our existing incident response documentation throughout, and treated any place where we diverged from the runbook as an opportunity for improvement.

‍

The difference between a simulation and the real thing

A gameday is valuable because it is a controlled exercise. Roles were assigned before the event even started, the war room was scheduled, and the scenario was known. Everyone showed up focused and ready to perform the simulated recovery.

A real incident is a different story. When something breaks in the middle of the night, the pre-assigned roles go out the window. In real-time, you must decide who the incident coordinator will be, who will perform the database recovery, and who will validate each impacted application. The runbook that you practiced against might not match what actually failed.

This is not a shortcoming of HeroDevs, rather it is the nature of incidents. The value of a gameday is not that it mirrors a real fire, it is that it gives the team the opportunity to work collaboratively as if it were a real event. During gamedays, the team gets to put their process to the test while building muscle memory, a shared vocabulary, and confidence. These characteristics are what make an unpredictable situation more manageable. The goal is not to simulate chaos. The goal is to be prepared when it shows up at 2am.

‍

What worked well

The most reassuring outcomes were the boring ones. Restoring from a snapshot using our declarative infrastructure controller was smooth, and internal databases and users were preserved through the rebuild. The documented process was largely accurate, the incident response guide held up, and every resource the team needed to execute the recovery was available when needed. A plan is only as good as the team behind it, and the team delivered. This is what good preparation looks like in practice.

‍

What we learned

The real value of a gameday comes from what it teaches you. A few areas for improvement surfaced:

Applications do not always fail loudly when the database underneath them dies. Pods kept running even though their datastore was gone, which means health checks that only watch the process can hide a serious outage. We are revisiting how critical dependencies propagate failure.
Configuration drift shows up at the worst moment. The original instance was configured with storage autoscaling, and re-creating it from a backup declaratively surfaced configuration drift in the allocated storage parameter, preventing our controller from performing the restore. Watching the reconciliation events sooner would have saved meaningful time.
Some manual steps quietly cost us time. Certain configuration changes that should have been routine moved awkwardly through our normal change process, and finding that kind of friction is the whole point of a gameday.

Each of these became an action item, from investigating immediate-apply behavior for protected resources, to ensuring dependent workloads fail clearly when a critical database disappears, to defining a standard validation protocol every product team can adopt.

‍

Resilience as a practice

The point of a gameday is not to prove the plan is perfect. It is to find the places where it is not, while the stakes are low. Every broken assumption we uncover in a gameday event is one we will never have to discover during a real customer-facing event.

This is the same philosophy that drives everything HeroDevs does. Software does not stop having vulnerabilities the day it reaches end-of-life, and infrastructure does not stop failing because you wrote a runbook. Staying secure and operational over the long haul takes deliberate, repeated practice. If you want to see the kind of risks we track and protect against in end-of-life open source, browse our vulnerability directory. When you are ready to have a conversation about securing your applications, reach out to HeroDevs.

We will keep breaking our own systems on purpose, so that when something breaks for real, the team is prepared and ready to perform.

Share via: