Early access · Teller’s Tech Labs

Practice production judgment before production teaches you the hard way.

Code Duck is an AI-driven incident simulator for DevOps, SRE, cloud, and platform engineers. Instead of memorizing commands or clicking through multiple-choice labs, learners work through realistic production-style incidents in natural language. No clusters to spin up, no environments to clean up — just open the browser and start.

Join Early Access See How It Works

Browser-based No setup Kubernetes Terraform AWS CI/CD Incident Response On-Call Debugging Interview Prep

Open the browser. Investigate the incident.

No setup. No cleanup. No surprise bills.

Most DevOps practice tools ask you to provision a cluster, configure tooling, and remember to tear it down. Code Duck does none of that.

Nothing to provision

No clusters to spin up. No state files. No cloud account. Scenarios load in your browser in seconds.

Nothing to break

Each scenario is a fresh, isolated environment. Roll back, blow it up, re-run from scratch as many times as you want.

Nothing to clean up

Close the tab when you’re done. No teardown, no orphaned resources, no surprise charges next month.

What It Looks Like

You investigate. Code Duck pushes back.

Watch a session play out: you investigate in natural language, Code Duck answers with the real signals, and it nudges your reasoning instead of handing you the answer.

youWhat does the on-call queue look like right now?

duckOne alert firing — checkout success rate dropped 99.2% → 61.1% in us-east-1.
Most-impacted service: orders-api.
A deploy + config update landed ~4 min before the alert — that’s the obvious first thread to pull.

youkubectl get pods -n orders

duckNAME READY STATUS RESTARTS AGE
orders-api-7bf6b6d77f-abc12 0/1 Running 3 12m
orders-api-7bf6b6d77f-def34 0/1 Running 2 12m
orders-api-6cb5b96d6f-xyz99 1/1 Running 0 2dTwo new pods are Running but never Ready, so they take no traffic — only xyz99 (the previous revision) is serving. Running isn’t the same as healthy here.

youkubectl logs orders-api-7bf6b6d77f-abc12 -n orders

duck2026-03-30T11:48:09Z INFO starting orders-api
2026-03-30T11:48:10Z INFO loading configuration
2026-03-30T11:48:10Z ERROR required environment variable PAYMENT_API_URL is not set
2026-03-30T11:48:10Z ERROR startup validation failed
2026-03-30T11:48:10Z INFO health endpoint returning 500 until startup validation succeedsSame error in both pods on every restart — PAYMENT_API_URL is unset, so startup validation fails and readiness never flips. This isn’t a flaky probe.

youIs rollback safe?

duckYes — revision 18 was healthy and the failure starts exactly at this rollout. A restart wouldn’t help (it reloads the same broken config); rollback restores the working value.

youRoll back to revision 18, and post that checkout is degraded for some merchants.

↳ evidence first, then a safe action — that's the move.

youWhat’s the current state of the incident?

duckThe ALB is throwing 503s, and the routine terraform apply from ~10 min ago is the prime suspect.
But that apply logged no changes — so it may be a red herring.

youterraform plan

duck # aws_security_group.alb will be updated in-place
~ resource "aws_security_group" "alb" {
+ ingress {
+ from_port = 443
+ to_port = 443
+ protocol = "tcp"
+ cidr_blocks = ["0.0.0.0/0"]
}
}

Plan: 0 to add, 1 to change, 0 to destroy.State and config disagree on one rule: the live security group is missing the 443 ingress the config still expects. With HTTPS closed at the SG, the ALB has nowhere to send traffic — hence the 503s.

youWho changed the security group?

duckCloudTrail shows a manual console edit ~25 min ago that removed the 443 ingress rule. The apply 10 min ago ran with -refresh=false, so it never saw the drift and logged no diff — which is exactly why this looked like the deploy.

youIs it safe to re-apply to restore the rule?

duckLow risk — the diff is just that one rule. A refreshed apply re-adds 443 and reconverges state with config; nothing else is touched.

youRe-apply to restore the 443 ingress, and flag the manual change in the incident channel.

↳ you found the drift instead of force-pushing state — clean recovery.

youWhat’s failing?

duckThe deploy workflow has failed on main 4 times since 09:10 — every run dies at the same step, “Push image to ECR” (exit 1).
A secrets rotation ran at 09:05, five minutes before the first failure.

youShow me the failing step logs.

duckRun docker push 491516382134.dkr.ecr.us-east-1.amazonaws.com/orders-api:sha-9f3c1a
The push refers to repository [491516382134.dkr.ecr.us-east-1.amazonaws.com/orders-api]
denied: authentication required
Error: Process completed with exit code 1.Docker never authenticated — the login step ran with a blank AWS_SECRET_KEY, so the registry rejects the push. An empty value points at the secret, not the image.

youDid the rotation rename the registry secret?

duckYes — the rotation created AWS_SECRET_ACCESS_KEY, but the workflow still references the old name AWS_SECRET_KEY, which now resolves to empty.

youSo the login runs with no credentials?

duckExactly. The image build and app code are fine — it’s a config reference, not a code regression.

youUpdate the workflow to the new secret name and re-run the deploy.

↳ traced it to config, not code — no needless rollback.

youGive me the full picture — what’s degraded, and since when?

duckapi-gateway p99 climbed 180ms → 4.2s over ~12 min; error rate 0.1% → 7%, almost all 504s. Onset 13:48.
No deploy in the last 3h — and three services sit behind this gateway.

youPull a trace for one slow request.

ducktrace 9f2c1a · GET /checkout · total 4.18s
api-gateway 4180ms 504
├─ orders-svc 40ms 200
└─ payments-svc 3902ms timeout
└─ ext-processor 3500ms (slow)95% of the request is the payments-svc call — yet payments-svc CPU is only 22%.

youIf its CPU is fine, what’s it waiting on?

duckpayments-svc calls an external processor whose p99 jumped 200ms → 3.5s at 13:46 — two minutes before our onset. The trigger is upstream of you.

youIs our database the bottleneck?

duckrds-prod-orders CPU 35% connections 88/100 query-p99 6msNo — connections are elevated but queries are 6ms. The threads are blocked on the processor call, not the DB.

youThen why are search and account 504ing too?

duckapi-gateway · connection pool
in_use 200 / 200
waiting 140
rejected 1.2k/min 504The shared pool is exhausted — slow payments calls hold every thread, so unrelated routes starve. And retries are 3× with no backoff, amplifying the load.

youSo the vendor is the trigger; retries + the shared pool are the amplifier. Mitigate blast radius before root cause?

duckRight call — the root cause is the vendor’s latency, which you can’t fix from here. Fastest safe levers: trip the payments circuit breaker, drop retries to 1, and tighten the payments timeout to ~1s so threads release. That frees the pool for search + account immediately.

youTrip the breaker on payments, retries→1, timeout 1s, and post: payments degraded, checkout impacted, search + account recovering.

↳ separated the trigger (vendor latency) from the amplifier (retries + shared pool) and contained blast radius before chasing root cause — senior-grade.

What You Can Type

kubectl get pods -n orders

command

What changed recently?

question

I think it’s a missing env var

hypothesis

Rollback the deployment

action

I’d tell the team checkout success is down

communication

The Core Belief

The hard skill isn't command recall.

Most DevOps training overvalues command recall and undervalues operational reasoning. In real incidents, you rarely get a clean prompt — you get symptoms, partial logs, noisy alerts, unclear ownership, recent changes, pressure, and incomplete context.

Code Duck is built to help engineers practice how to think through that mess. The AI plays the role of the system, the logs, the teammate, the interviewer, and the coach — while you investigate, ask questions, form hypotheses, choose safe next steps, and explain your reasoning.

How It Works

One scenario, run the way real on-call works.

Open-ended, non-linear investigation. You can go in the "wrong" direction, recover, ask better questions, and improve.

Pick a scenario

Pre-built incidents based on real-world production failure patterns. Pick by stack or skill area.

Investigate in natural language

Ask for logs, metrics, configs, recent changes, or context. No rigid command-match required.

Form hypotheses & choose next steps

Propose what's wrong, what you'd check next, and what action you'd take — with awareness of risk.

Get a coaching debrief

Feedback on troubleshooting flow, hypothesis quality, risk awareness, assumptions, and gaps to practice.

How You Use It

Three modes for how deep you want to go.

The same scenarios, three different relationships with feedback. Pick what matches your moment.

Practice

Hints + coaching

Open-ended investigation with proactive hints during play and a full coaching debrief at the end. Good when you’re building muscle memory or learning a new stack.

Assessment

No hints, recap only

Open-ended, but the AI stays out of your way until you’re done. Recap arrives at the end. Closest to a real on-call shift — or a take-home interview.

Guided

Optional nudges

Open-ended with light nudges available on request. Useful when you want to go solo but keep a safety net for when you’re truly stuck.

What It Evaluates

The rubric is built around judgment, not trivia.

Knowing every flag isn't the goal. Knowing what to investigate, why it matters, and what a safe next step looks like — that's the goal.

Troubleshooting flow

Symptoms → evidence → cause

Hypothesis quality

Reasoning from what you saw

Technical understanding

Concepts, even without exact syntax

Risk awareness

Avoiding "just restart everything"

Systems thinking

App / infra / deploy / cloud

Communication

Explaining what & why clearly

Assumptions

Spotting what you're inferring

Recovery from wrong turns

Backing out cleanly & learning

The six scoring categories

evidence_gathering prioritization action_safety communication missed_signals stronger_next_steps

The Debrief

Every session ends with a coaching recap.

When you finish, Code Duck grades the session against the six categories above — not on whether you typed the perfect command, but on how you reasoned. Here’s an illustrative recap of the kind of investigation shown earlier.

Orders API degraded after deploy · SEV2 · checkout success rate down 38%

evidence_gathering Strong

You gathered evidence broadly and used multiple data sources to build a picture.

Checked pod status and restart count
Inspected pod details for readiness probe information
Reviewed pod logs to understand startup failures

prioritization Strong

You prioritised the right signals and formed a clear investigation path.

Correlated symptoms with recent changes early
Formed a hypothesis to guide investigation

action_safety Strong

Actions were well-grounded in evidence and posed low additional risk.

Gathered evidence (logs) before proposing disruptive actions
Considered rollback as a potential safe action
Proposed a safe resolution path with evidence

communication Solid

Some communication happened but could be more detailed.

Provided a status update (could include more impact detail)
Status update lacked specific customer impact information

missed_signals

Both failing pods (abc12 and def34) logged the same startup error — a signal this was a fleet-wide config problem, not one bad pod.

stronger_next_steps

Communicate customer impact to stakeholders with current status.
Strong session — consider documenting this as a post-mortem template.

Overall

Solid session with some areas for improvement. You investigated systematically and chose a safe remediation — tightening stakeholder communication is the fastest win.

Illustrative recap of the scenario shown above. Real recaps are generated from your session — the evidence you gathered, the order you gathered it in, and the actions you proposed.

Example Scenarios

The kind of incidents you actually get paged for.

Each scenario is built around a real production failure pattern — not a tutorial.

Kubernetes

App failing after a config change — OOMKilled in a loop, but the change looked harmless.

Terraform

Unexpected infrastructure drift after a routine apply, and state isn't where you'd expect.

CI/CD

GitHub Actions deployment failing after a secrets rotation — with confusing error output.

AWS

Service degraded due to an IAM or networking misconfiguration with no obvious smoking gun.

Kafka

Consumer lag spiking during broker pressure — and rebalances making everything worse.

RDS