·16 min read

The Reliability-First Development System: A Practical Checklist for Engineers Who Ship

Most teams know they should think about reliability. Few have a system for it.

The difference between the two shows up in production. Teams without a system react — they scramble when things break, patch the immediate problem, and move on. Teams with a system anticipate — they have runbooks, alerts that fire before users notice, and post-mortems that actually change behaviour.

What follows is the system I use. I call it the Reliability-First Development System — RFDS for short. It is not a heavy methodology. It does not require you to throw out Agile or reorganise your team. It is a mindset and a checklist you layer on top of whatever you already do.

Seven pillars, followed in order, iterated continuously.


The Five Principles You Cannot Violate

Before the pillars, the mindset. These are non-negotiable:

  1. Failure is normal, not rare. Plan for it the same way you plan for the happy path.
  2. Make expectations explicit and enforceable. If it is only in documentation, it will be ignored.
  3. Design for observability, recoverability, and graceful degradation. In that order.
  4. Automate everything that can be automated. Testing, deployment, recovery — if a human has to remember to do it, eventually a human will forget.
  5. Measure reliability continuously. It is not a one-time audit. It is a live metric like any other.

Hold these in your head as you work through the pillars.


Pillar 1: Explicit Expectations

This is the foundation. Everything else builds on it.

The most common source of production surprises is not bad code — it is unexamined assumptions. Someone assumed the database would always respond within 200ms. Someone assumed the third-party API would always return a valid JSON body. Someone assumed that edge case would never happen in production.

Write it down before you write a line of code:

  • What inputs does this accept? What does it reject?
  • What states can this system be in? What transitions between them are valid?
  • What are the non-functional requirements — in numbers? Latency SLO, error budget, availability target (99.9%? 99.99?), maximum throughput.
  • What components can fail, delay, or behave unexpectedly? Include your own dependencies, the network, hardware, third-party APIs, and yes — humans.
  • What does "done" look like, and how will you verify it?

The output is a living document — a Reliability Spec or a set of Architecture Decision Records. It does not have to be long. It has to be honest.

Practical checklist:

  • Happy paths documented with concrete examples
  • Non-functional requirements stated as numbers, not adjectives ("fast" is not a requirement; "p99 < 300ms" is)
  • Failure model written: every dependency that can fail is listed
  • Acceptance criteria defined upfront, agreed by the team

Pillar 2: Resilient Architecture

Once you know what can fail, you design around it.

The core patterns are not new, but they are under-applied:

Isolation. Use bulkheads — separate thread pools, queues, processes, or services for different concerns. If your image processing pipeline saturates, it should not affect your checkout. Blast radius control is architecture, not an afterthought.

Circuit breakers. When a downstream service is failing, stop hammering it. Open the circuit, serve a degraded response or fail fast, and try again after a cooldown. Libraries like Resilience4j (Java) and Polly (.NET) do this in three lines of configuration.

Idempotency. Assume any operation might be retried — by your retry logic, by a user clicking twice, by a network hiccup. Design writes so that running them twice produces the same result as running them once. This is not optional in distributed systems.

Timeouts and deadlines everywhere. The most common production failure mode is not an error. It is a hang. A downstream service stops responding and your thread pool fills up waiting. Set a timeout on every external call. Set a maximum runtime on every background job. Set a dead-letter destination on every queue.

Degradation modes. Know in advance what your system looks like when it is partially healthy. Can you serve read-only? Can you serve cached data that is 5 minutes stale? Can you return a lower-quality response faster rather than a perfect response slowly? These are architectural decisions. Make them deliberately.

Practical checklist:

  • Every external call has a timeout
  • Circuit breakers on all critical downstream dependencies
  • All writes are idempotent or guarded with idempotency keys
  • Degradation modes defined and tested
  • Redundancy applied where the failure cost justifies it (not everywhere)

Pillar 3: Defensive Implementation

Architecture decisions become code. This pillar is about not losing the gains at the implementation layer.

Input validation and schema enforcement. Use types, protobufs, Zod, OpenAPI specs — anything that makes "expected input" machine-enforced rather than documentation-dependent. Documentation rots. Types don't lie. Validate at every system boundary: HTTP handlers, message consumers, database reads from untrusted sources.

Explicit error handling. Never swallow an error silently. If you catch an exception and do nothing with it, you have made the failure invisible — which is worse than crashing, because now you don't know something is wrong. Log it. Alert on it. Handle it. Or let it propagate to the layer that can.

Fail fast in development. The best time to find an error is before it reaches production. Assertions, panics, strict type checks — make your code loud in dev and staging. Graceful degradation is for production users. In development, you want the crash.

Feature flags for risky changes. If you are shipping something behavioural — new pricing logic, changed API contracts, reworked authentication — put it behind a flag. You can turn it off in seconds without a deploy if something goes wrong.

Practical checklist:

  • Schema validation at all system boundaries
  • No silently swallowed exceptions
  • Errors logged with enough context to diagnose without reproducing locally
  • Risky changes wrapped in feature flags
  • Fail-fast assertions in development builds

Pillar 4: Comprehensive Verification

Testing for reliability goes further than unit tests. Most teams stop at the happy path. That is where the gap opens.

Unit and integration tests cover the expected behaviour. They are necessary but not sufficient.

Contract tests verify that your assumptions about external services are still true. If a third-party API changes the shape of a response, your contract test should fail before your production code does.

Chaos and resilience testing is where most teams have the most to gain. Inject failures deliberately: kill a pod, add latency to a downstream call, corrupt a response, fill a disk. Then watch what happens. Tools like Gremlin and Chaos Monkey make this tractable. If you are not running chaos experiments, you are relying on hope.

Load and performance testing under stress reveals failure modes that only appear at scale. A query that runs in 12ms for 10 concurrent users might run in 4 seconds for 1,000. k6 and Locust are your friends here.

Progressive deployment testing. Canary deployments — where you send 5% of traffic to the new version before rolling it out fully — are the most underused reliability practice in shipping teams. They give you production feedback before production impact.

Practical checklist:

  • Unit and integration test coverage for all critical paths
  • Contract tests for every external dependency that matters
  • At least one chaos experiment run before every major release
  • Load tests run on a regular cadence, not just pre-launch
  • Canary or progressive rollout configured for production deploys

Pillar 5: Observability and Monitoring

Monitoring tells you something is wrong. Observability lets you ask why — even for failure modes you did not predict.

The three pillars of observability are metrics, logs, and distributed traces. For any non-trivial system, all three are non-negotiable.

Metrics tell you the shape of what is happening: request rates, error rates, latency percentiles, queue depths. Alert on symptoms, not resources. "Error rate above 1%" is a symptom. "CPU above 80%" is a resource. Symptoms affect users. Resources might not.

Logs give you the narrative: what happened to this specific request, in this specific context, at this specific moment. Log enough to debug without having to reproduce the issue locally. That means request IDs, user context, relevant state — not just the error message.

Distributed traces show you where time went across service boundaries. In a microservices architecture, a slow response might be caused by a slow downstream call three hops away. Without traces, you are guessing.

Dashboards should show user experience, not just infrastructure. A dashboard showing CPU and memory tells you about your machines. A dashboard showing p95 latency, error rate, and successful checkout rate tells you about your users.

Practical checklist:

  • Metrics, logs, and traces implemented (OpenTelemetry is a good starting point)
  • Alerts fire on symptoms, not resources
  • Every request carries a correlation ID through all systems
  • At least one user-experience dashboard exists and is reviewed regularly
  • Error budgets tracked and reviewed in sprint reviews

Pillar 6: Safe Deployment and Operations

The deploy is where reliability is most often lost. A system that works perfectly in staging can break in production because of a configuration difference, a data difference, or a load difference.

Automated CI/CD with strong gates. Every commit should pass tests, security scans, and linting before it can reach production. The gate should be automated — not "remember to run the tests before you merge".

Progressive rollouts. Canary first. Then staged. Then full. If your metrics degrade in the canary phase, roll back automatically. This requires your observability to be in place first — which is why Pillar 5 comes before Pillar 6.

Incident response runbooks. For every alert that can fire, there should be a runbook: what it means, what to check first, how to roll back, who to wake up. Written in advance, not improvised at 2am.

Game Days. Once a quarter — or more frequently if you can — simulate a major failure in production. Run it with monitoring live. See what breaks that you did not expect. Fix it before the real incident finds it first.

Blameless post-mortems. When something does break, the question is not who caused it — it is what system conditions made the failure possible. Fix the system, not the person. Add the test you wish had existed. Update the runbook. Close the gap.

Practical checklist:

  • CI/CD pipeline with automated test, lint, and security gates
  • Progressive rollout configured with automatic rollback on metric degradation
  • Runbooks written for every alert
  • At least one Game Day run per quarter
  • Post-mortem template exists and is used consistently

Pillar 7: Continuous Learning and Improvement

Reliability is not a state you reach. It is a practice you maintain.

Every incident — every near-miss, every unexpected edge case — is a data point about the gap between your mental model of the system and how it actually behaves. The system should get more resilient with each cycle, not just more familiar.

Feed incidents back into the failure model. When something breaks in a way you did not anticipate, add it to the list. The list should grow over time — not because your system is getting worse, but because your understanding of it is getting more accurate.

Run reliability reviews before major changes. The same way you run architecture reviews, run reliability reviews. Ask the five questions: What can fail? How will we know? How do we recover? What is the blast radius? Have we tested it?

Track reliability as a KPI. Error budget consumption, mean time to recovery, p99 latency — these belong in your sprint reviews alongside feature velocity. If you only measure what you ship, you will optimise only for shipping.

Practical checklist:

  • Incident learnings captured and tracked against the failure model
  • Reliability review checklist used before major releases
  • Reliability metrics reviewed every sprint
  • "What could go wrong?" added as a standing question in design reviews

How to Start

New project: Apply all seven pillars from day one. The cost of retrofitting observability or idempotency into an existing system is ten times the cost of building it in from the start.

Existing system: Start with a reliability audit. Map your current failure modes. Add observability first — it gives you the biggest immediate return because it makes every subsequent problem easier to diagnose. Then add chaos testing to surface the surprises before your users do.

Team practices: Add "What could go wrong?" to every design review. Include reliability stories in your backlog — not as a separate track, but as part of the definition of done. Celebrate recovered incidents as well as shipped features.


The Templates

Reading about a system is one thing. Having the artefacts ready to use is another. Below are the templates I reach for most often. Adapt them — don't let them become bureaucratic overhead. For small teams, a Reliability Spec can live in your README. What matters is that the thinking happens, not that the document is polished.


Template 1: Reliability Spec

Document Title: Reliability Specification — [System/Component Name] Version: 1.0 | Date: [YYYY-MM-DD] | Owner: [Name/Team]

1. System Overview

  • Purpose / Business Value:
  • Key Stakeholders:
  • Criticality (Tier): High / Medium / Low

2. Quantitative Goals (SLOs)

Metric SLO Target Measurement Window Error Budget
Availability 99.9% 30 days
Success Rate 99.95% 1 hour
P95 Latency < 300ms 1 day
P99 Latency < 800ms 1 day
Throughput X req/sec Peak hour

3. Expected Behaviour (Happy Path)

  • Main use cases / flows:
  • Input schemas and validation rules:
  • Output contracts:
  • State transitions (table or diagram):
  • Invariants (conditions that must always be true):

4. Failure Model

Dependency Failure Mode Probability Impact Mitigation Detection
Database Connection timeout High High Circuit breaker + retry Metrics + timeout alert
External API Returns 5xx / slow Medium High Fallback + cache Latency histogram
Network Partition / packet loss Medium High Multi-AZ + retries Connectivity checks

Also consider: hardware failure, deployment failures, traffic spikes, data corruption, human/operational errors.

5. Resilience Requirements

  • Graceful degradation modes (ordered by severity): 1. 2.
  • Recovery Time Objective (RTO):
  • Recovery Point Objective (RPO):
  • Idempotency requirements:
  • Timeout strategy (default + per-operation):

6. Observability Requirements

  • Key metrics:
  • Critical logs (with context fields):
  • Traces / spans to capture:
  • Alerting rules:

7. Testing Requirements

  • Chaos / failure scenarios to test:
  • Load / stress scenarios:
  • Monitoring coverage target:

8. Deployment and Rollback

  • Rollout plan (canary percentages, duration):
  • Rollback criteria:
  • Feature flag usage:

Sign-off: Architecture/SRE: ___________ Product: ___________ Date: ___________


Template 2: Post-Mortem (Incident Review)

Incident Title: [Short description] Incident ID: INC-XXXX | Severity: SEV-1 / SEV-2 / SEV-3 Date/Time: [Start] → [End] | Owner: [Incident Commander]

1. Executive Summary

  • What happened (1–2 sentences):
  • Impact (duration, users affected, revenue impact):
  • Root cause (high-level):

2. Timeline

Time (UTC) Event / Action Actor
YYYY-MM-DD HH:MM Alert fired System
... ... ...

3. Root Cause Analysis (5 Whys)

  1. Why did the outage occur?
  2. Why did that happen?
  3. Why did that happen?
  4. Why did that happen?
  5. Underlying cause:

Contributing factors:

4. What Went Well

5. What Could Have Been Better

6. Action Items

Action Owner Due Date Type Status
Add chaos test for X Prevent Open
Improve monitoring on Y Detect Open

Types: Prevent / Detect / Mitigate

7. Lessons Learned

  • Key learnings:
  • Process / documentation changes:
  • Reliability Spec updates required:

Follow-up Review Date: [Date]


Template 3: Failure Mode & Effects Analysis (FMEA)

Use this when you want to score and prioritise failure modes systematically. RPN = Risk Priority Number (Severity × Likelihood × Detectability, each scored 1–10).

Failure Mode Cause Effect Severity Likelihood Detectability RPN Mitigation

Higher RPN = address first.


Template 4: Pre-Launch Reliability Checklist

Run this before every major release. Keep it short enough that the team actually uses it.

  • SLOs defined and actively monitored
  • Chaos tests executed for the top 3 failure modes
  • Observability complete — metrics, logs, alerts in place
  • Rollback tested and confirmed to work
  • Error budget remaining is positive
  • Runbooks written or updated for every new alert
  • Post-mortem process ready if needed

A Final Note

No system makes you immune to failure. Complexity always finds a way. The point of RFDS is not perfection — it is converting surprises into handled cases. Over time, your failure taxonomy grows, your runbooks get sharper, and the incidents that used to wake someone up at 3am become automated recoveries that nobody notices.

Start with one template. Use it on a real project. Adapt it. The act of filling it out — of writing down what can fail and what success actually looks like — changes how you build. That is where the value is.

The happy path is easy to build. Building something that holds up when the happy path disappears — that is the work. This system is how you do it without starting from scratch every time.