Practice production judgment before production teaches you the hard way.
Code Duck is an AI-driven incident simulator for DevOps, SRE, cloud, and platform engineers. Instead of memorizing commands or clicking through multiple-choice labs, learners work through realistic production-style incidents in natural language.
You investigate. Code Duck pushes back.
A short snippet from a real Kubernetes incident scenario. Notice the AI doesn't just give you the answer — it nudges your reasoning forward.
- HighMemoryUsage (3 pods OOMKilling)
- 5xxRate (spiked at 14:32)
Recent change: deploy at 14:25 by @backend.
The hard skill isn't command recall.
Most DevOps training overvalues command recall and undervalues operational reasoning. In real incidents, you rarely get a clean prompt — you get symptoms, partial logs, noisy alerts, unclear ownership, recent changes, pressure, and incomplete context.
Code Duck is built to help engineers practice how to think through that mess. The AI plays the role of the system, the logs, the teammate, the interviewer, and the coach — while you investigate, ask questions, form hypotheses, choose safe next steps, and explain your reasoning.
One scenario, run the way real on-call works.
Open-ended, non-linear investigation. You can go in the "wrong" direction, recover, ask better questions, and improve.
Pick a scenario
Pre-built incidents based on real-world production failure patterns. Pick by stack or skill area.
Investigate in natural language
Ask for logs, metrics, configs, recent changes, or context. No rigid command-match required.
Form hypotheses & choose next steps
Propose what's wrong, what you'd check next, and what action you'd take — with awareness of risk.
Get a coaching debrief
Feedback on troubleshooting flow, hypothesis quality, risk awareness, assumptions, and gaps to practice.
The rubric is built around judgment, not trivia.
Knowing every flag isn't the goal. Knowing what to investigate, why it matters, and what a safe next step looks like — that's the goal.
Troubleshooting flow
Symptoms β evidence β cause
Hypothesis quality
Reasoning from what you saw
Technical understanding
Concepts, even without exact syntax
Risk awareness
Avoiding "just restart everything"
Systems thinking
App / infra / deploy / cloud
Communication
Explaining what & why clearly
Assumptions
Spotting what you're inferring
Recovery from wrong turns
Backing out cleanly & learning
The kind of incidents you actually get paged for.
Each scenario is built around a real production failure pattern — not a tutorial.
App failing after a config change — OOMKilled in a loop, but the change looked harmless.
Unexpected infrastructure drift after a routine apply, and state isn't where you'd expect.
GitHub Actions deployment failing after a secrets rotation — with confusing error output.
Service degraded due to an IAM or networking misconfiguration with no obvious smoking gun.
Consumer lag spiking during broker pressure — and rebalances making everything worse.
Failover triggered, but the application is now in a connection-pool death spiral.
Join the Code Duck early access list.
Be one of the first to try Code Duck. No spam, just build updates and an invite when scenarios are ready for early testers.