“Production ready” gets thrown around like it means something. Feature complete? Passed QA? Has a deploy button?
None of that matters if nobody knows what to do when it breaks at 2 AM.
Production readiness isn’t about the code. It’s about the people and processes around the code. Can you respond to an incident without panic? Can someone who wasn’t in the original meetings figure out what’s happening? Can you roll back in minutes, not hours?
The 2 AM test
Here’s a simple heuristic: imagine an alert fires at 2 AM. A pager wakes someone up. What happens next?
- Do they know where to look?
- Is there a runbook, or are they digging through Slack history?
- Can they fix or mitigate without waking up three more people?
- Can they roll back if the fix doesn’t work?
If the answer to any of these is “no” or “maybe,” you’re not production ready. You’re production hopeful.
Ownership isn’t optional
Every system needs an owner. Not “the team” — a person. Someone whose name is attached to the on-call rotation, who gets paged when things break, who cares whether the runbooks are up to date.
If nobody owns it, nobody maintains it. The monitoring goes stale. The docs drift. The deploy scripts break. And when it fails, everyone points at everyone else.
Assign ownership before you launch. Write it down. Make it visible.
The minimum viable readiness checklist
Observability: Can you see what’s happening? Logs, metrics, traces for the paths that matter. Dashboards that show health at a glance. If you have to SSH into a box to understand what’s wrong, you’re not ready.
Alerting: Alerts that mean something. Not “CPU is at 80%” — that’s noise. “Error rate exceeded 1% for 5 minutes” — that’s actionable. Every alert should have a response, ideally linked to a runbook.
Runbooks: Short, current, tested. For the top 3-5 failure modes, what do you do? Where do you look? What can you try? Runbooks are not documentation — they’re emergency response guides.
Deployment: Automated. Repeatable. Auditable. If deploying requires a specific person or a specific machine, you’ve built a liability. If rolling back is scarier than pushing forward, your deploys are a risk factor.
Backups: They exist. They’re tested. Someone has actually restored from them, recently. Untested backups are not backups — they’re hope.
What “tested” means
Load testing: does it handle expected traffic? Expected + 50%? What breaks first?
Failure testing: what happens when the database is slow? When a downstream service times out? When the network partitions?
These don’t need to be elaborate. A simple load test and one failure scenario puts you ahead of most launches. The point is knowing where the limits are before users find them.
Support readiness
Production systems have users. Users have problems. What happens when they report issues?
- Is there an intake process?
- Can support see what they need without escalating to engineering?
- Are the common issues documented?
Production readiness includes the human systems around the software.
After launch
Launch is not the end. Schedule a review at 30 days and 90 days:
- Are SLOs being met?
- Are incidents decreasing?
- Can the team sustain the operational load?
If the answer to any of these is no, you have work to do. Production readiness is maintained, not achieved once.
The uncomfortable truth
Most systems that “go to production” aren’t production ready. They’re feature complete with fingers crossed. The first real incident reveals the gaps.
That’s not a failure of engineering — it’s a failure of process. Production readiness takes time and attention that’s easy to skip under deadline pressure.
But skipping it doesn’t save time. It just moves the cost to 2 AM, when everything is harder.
XIThing helps teams build systems that stay reliable after launch. Get in touch if your production readiness needs work.


