0:00
A repo named Private-CISA was reportedly public
0:03
on GitHub, which is a pretty painful reminder
0:06
that naming something private is not access control.
0:10
This week, we've got leaked AWS keys, internal
0:13
deployment docs, Terraform, Kubernetes manifests,
0:17
Argo CD files, AI root cause analysis, agents
0:22
clicking through legacy apps. Claude Code showing
0:25
up in CI/CD. And another Kubernetes security reminder
0:29
hiding in a boring default. The theme is pretty
0:32
simple. Automation is getting more powerful.
0:35
But we are still leaking secrets, missing defaults,
0:39
and writing postmortem action items that quietly
0:42
go to die. I'm Brian Teller from Teller's Tech,
0:46
and this is Ship It Weekly. Welcome back to Ship
1:06
It Weekly, the show where we look at the DevOps,
1:08
SRE, cloud, platform, and security stories that
1:12
actually matter when you are the person who eventually
1:15
has to keep the thing running. This week, we're
1:18
starting with the CISA contractor GitHub leak.
1:21
Then we'll get into AWS DevOps agent and automated
1:25
root cause analysis. Microsoft Copilot Studio
1:28
computer using agents, Atlassian adding cloud
1:32
code to Bitbucket Pipelines, and CVE-2026-46333
1:38
with the Kubernetes seccomp angle. Then in the
1:41
lightning round, we'll hit GitHub expanding OIDC
1:44
support for Dependabot and code scanning. Java
1:47
pods getting OOMKilled even when heap looks
1:50
fine. And why LLM-generated SQL can be wrong
1:54
in ways that still run. And for the human closer,
1:57
we'll talk about why postmortem action items
2:00
die. So let's get into it. First up, the big
2:08
one. A contractor for CISA reportedly had a public
2:12
GitHub repository named Private-CISA. Again,
2:17
Private-CISA. That is like naming an S3 bucket
2:22
definitely not customer data and hoping AWS respects
2:26
the vibe. GitGuardian says it found the public
2:29
repo on May 14th containing around 844 megabytes
2:33
of exposed data. Reporting included plain text
2:37
passwords, AWS tokens, Entra ID SAML certificates,
2:41
and internal material tied to CISA systems. And
2:46
this is the part that matters for us. This was
2:48
not just one forgotten .env file. The reporting
2:52
describes CI/CD logs, Kubernetes manifests, Argo
2:56
CD files, Terraform code, GitHub Actions workflows,
3:01
deployment details, and internal docs. This is
3:05
not only a secret leak. This is context leakage.
3:08
Credentials are bad. AWS tokens are bad. Plain
3:12
text passwords are bad. But when you also leak
3:15
deployment docs, infrastructure code, manifests,
3:19
CI/CD logs, and internal workflows, you are giving
3:23
someone more than a key. You are giving them
3:25
the floor plan. Now, we should be careful. From
3:28
the outside, we do not know the full operational
3:30
impact. Public reporting says CISA was notified
3:34
and the repo was taken down. But the lesson is
3:38
already clear. Secrets scanning matters, but it
3:41
is not enough. The repo should not have been
3:44
public. The archive should not have been there.
3:46
The credentials should not have been valid. And
3:49
once operational context leaks, the response
3:52
cannot just be rotate the key and move on. You
3:55
need to ask what the leaked context teaches an
3:58
attacker. Most teams hear a story like this and
4:01
think, wow, how does this happen? And then somewhere
4:04
in their own org, there is a repo called old
4:07
prod migration backup final last touched by a
4:10
contractor in 2022. containing a zip file nobody
4:14
wants to open because it feels cursed. Most of
4:17
us have some version of this risk. Old repos,
4:20
personal forks, contractor projects, build logs,
4:24
Terraform state, kube configs, migration scripts,
4:28
weird backup folders. So the takeaway is not
4:31
just don't commit secrets. That is true, but
4:33
it is table stakes. The better takeaway is to
4:36
inventory the weird stuff. Look for old repos,
4:40
archived repos, contractor -owned repos, personal
4:44
forks, internal examples, demo projects, and
4:47
places where somebody may have dumped operational
4:50
context while trying to move fast. And when a
4:54
leak happens, treat the context as compromise
4:56
too, because sometimes the key is the headline,
4:59
but the deployment map is the real prize. Next
5:07
up, AWS published a post on using AWS DevOps
5:10
Agent to automate root cause analysis across
5:13
Datadog and Elasticsearch with CloudTrail and
5:17
EKS involved too. This one sits right at the
5:20
intersection of useful and slightly unsettling.
5:24
The AWS example starts with a Datadog alert.
5:27
AWS DevOps Agent gets access to EKS. so it can
5:31
describe Kubernetes objects, pull pod logs, and
5:35
look at cluster events. It also correlates Elasticsearch
5:38
logs, Datadog metrics, and CloudTrail deployment
5:42
events to figure out what changed. And honestly,
5:45
that sounds a lot like incident response. You
5:47
start with one symptom. Then you try to rebuild
5:50
the timeline. What changed? What deployed? Which
5:54
pod restarted? What metric moved first? Did someone
5:58
push a config change? Did a dependency decide
6:00
today was a great day to build character? Half
6:03
of incident response is not fix the thing. Half
6:07
of it is reconstructing the story fast enough
6:09
that you fix the right thing. So if an agent
6:12
can gather context and build a decent first-pass
6:15
timeline, that is useful. Logs over here. Metrics
6:19
over there. CloudTrail in another tab. Kubernetes
6:23
events in a terminal. Datadog on one monitor
6:26
and Slack on another. and someone asking an update
6:29
every 90 seconds, which is understandable and
6:32
still emotionally damaging. But automated RCA
6:35
can become very convincing very quickly. A tool
6:39
that says probable root cause with a clean summary
6:41
can easily become the thing everybody believes,
6:44
especially when the incident channel is noisy
6:46
and everybody is tired. So I like the direction,
6:49
but I would treat AI RCA like a fast incident
6:52
scribe or junior investigator. Great at pulling
6:55
threads. Useful for assembling context. Helpful
6:59
for reducing time to understanding. But not the
7:02
final authority on causality. Humans still need
7:06
to ask, is this correlation or cause? Did this
7:09
happen before or after customer impact? What
7:12
else changed? What evidence would prove this
7:14
wrong? Because if the incident review action
7:17
item is that AI said it was Elasticsearch, so
7:20
we restarted Elasticsearch, that is not RCA.
7:23
That is vibes with a dashboard. Before we get
7:26
to the next story, a quick note from this week's
7:28
sponsor, Guardsquare. If you are building mobile
7:30
apps, good enough security is usually a problem
7:34
waiting to happen. Guardsquare focuses on actually
7:37
protecting your code in addition to scanning
7:40
it. That means code hardening, runtime protection,
7:43
testing, and visibility into what's happening
7:46
once your app is out in the wild. So if you are
7:49
responsible for shipping and securing mobile
7:51
apps, Android or iOS, Definitely worth taking
7:55
a look at guardsquare.com. Alright, back to
7:59
the show. Third story. Microsoft says computer
8:07
using agents in Copilot Studio are now generally
8:11
available. This one is interesting because it
8:13
changes the automation boundary. A lot of automation
8:16
assumes there is an API. You call an endpoint.
8:20
You get a response. You wire it into a workflow.
8:23
Everyone pretends the internal CRM is not held
8:26
together by three workflows, a CSV export, and
8:30
one person named Linda. Computer using agents
8:33
are different. Microsoft describes these agents
8:35
as interacting with graphical user interfaces,
8:38
websites, desktop apps, screens, buttons, forms.
8:43
So instead of saying there is no API, so we cannot
8:46
automate this, the pitch becomes the agent can
8:49
use the UI like a human. And look. I get why
8:53
this is appealing. Every company has legacy systems.
8:56
Every company has some vendor portal that looks
8:59
like it was designed during the emotional low
9:01
point of enterprise software. Every company has
9:04
workflows where someone copies data from one
9:06
screen into another and calls it a process. If
9:09
an agent can take some of that away, great. Nobody
9:12
needs a fulfilling career in manually clicking
9:15
invoice screens. But this is also where the risk
9:18
gets weird. API automation usually gives you
9:21
structure. scopes, endpoints, logs, schemas.
9:26
UI automation is messier. The agent is looking
9:28
at screens, reading labels, clicking buttons,
9:31
entering text, and deciding what to do next.
9:34
Which means your automation path may now include
9:37
a model looking at a webpage and deciding which
9:40
button seems right. That sounds funny until the
9:43
button says submit payment, delete record, approve
9:46
request, or yes, I understand this is permanent.
9:50
So the governance questions matter. What apps
9:53
can the agent access? What account does it use?
9:57
Can it reach production admin screens? Are actions
10:00
auditable? Can it pause before destructive steps?
10:03
What happens if the UI changes slightly? And
10:06
who owns the outcome when it clicks the wrong
10:08
thing? Computer using agents may become useful
10:11
because plenty of enterprise systems will never
10:14
get good APIs. But when the agent operates through
10:17
a UI, treat that UI like an automation interface.
10:21
Use restricted accounts. Use test environments.
10:25
Use approval gates. Log actions. Limit destructive
10:29
workflows. And be very suspicious of anything
10:32
involving bulk update. A computer -using agent
10:35
is not just a better macro. It is automation
10:38
with eyeballs. And like most things with eyeballs
10:41
in enterprise software, it probably needs supervision.
10:48
Fourth story. Atlassian says agentic pipelines
10:51
now support Claude Code as a provider in bitbucket
10:55
pipelines. This sounds like a feature announcement
10:58
until you think about where it lives. It lives
11:01
in CI/CD. Atlassian's examples include README
11:05
updates, security report triage, feature flag
11:09
cleanup, PR descriptions, and other repetitive
11:12
engineering chores. And this keeps coming back
11:15
to a point I've made before. Developer tooling
11:18
is production now. CI/CD is not just the stuff
11:21
around production. It is the path code takes
11:24
to become production. So when you put agents
11:27
inside pipeline workflows, you are not just making
11:29
developers faster. You are changing the delivery
11:32
path. And yes, some of these tasks are genuinely
11:35
annoying. Security report triage, feature flag
11:39
cleanup, PR descriptions, documentation updates.
11:43
Please take them. Nobody is sitting around hoping
11:45
for more stale flags. and slightly wrong README
11:49
files. But once an AI agent is part of a pipeline,
11:52
the boring questions matter. What repository
11:55
context does it get? What logs does it see? What
11:59
secrets are visible to that step? Can it modify
12:02
files? Can it open PRs? Can it change tests?
12:06
Can it generate security triage notes that humans
12:09
treat as authoritative? Can it make a flaky test
12:11
look fixed by weakening the assertion? That last
12:15
one is not me being dramatic. If the task is
12:18
fix the failing pipeline, a bad agentic workflow
12:21
might make the pipeline pass without making the
12:24
system better. And yes, humans do this too. We
12:27
just call it temporary and let it survive three
12:29
reorgs. Atlassian also points users to guidance
12:33
around third-party agent providers and data
12:36
handling. That matters because if you bring Claude
12:38
Code into a pipeline, you need to understand
12:41
what code, prompts, logs, and generated context
12:45
may leave your environment. That does not mean
12:48
do not use it. It means do not discover your
12:51
data flow by reading the invoice. Treat agentic
12:54
pipeline steps like any other privileged CI step.
12:58
Start with low-risk tasks. Avoid secrets. Avoid
13:02
production deploy authority. Make outputs reviewable.
13:06
Require human approval when the agent changes
13:08
code. And document what data goes to the provider.
13:12
The agent does not need to be terrifying, but
13:14
it also should not be a surprise guest in your
13:17
delivery path. Fifth story. Let's talk about
13:24
CVE-2026-46333 and Kubernetes seccomp defaults.
13:31
This one is more technical, but it's a good grounding
13:34
story after all of the AI agent stuff. Because
13:37
sometimes the most important security decision
13:39
is not a new tool or a new model. Sometimes it
13:43
is whether your pods are running with the syscall
13:45
profile that you thought they were running with.
13:48
NVD tracks CVE-2026-46333 as a Linux kernel
13:53
issue in the ptrace path. Qualys describes
13:56
it as a local privilege escalation and credential
14:00
disclosure issue. The Kubernetes angle is that
14:03
unset or unconfined seccomp profiles can leave
14:07
pods exposed to the tested path, while runtime
14:10
default block the tested pidfd_getfd path.
14:14
PSS restricted added more protection as well.
14:17
The operator version is simple. Your Kubernetes
14:20
security defaults matter. And they matter most
14:22
when nobody is thinking about them. seccomp is
14:25
easy to mentally file under container security
14:28
stuff we should revisit someday. Which is engineering
14:31
speak for future incident seasoning. Kubernetes
14:34
docs say that if seccompDefault is enabled,
14:37
pods use the RuntimeDefault seccomp profile
14:40
when no other profile is specified. Otherwise,
14:44
the default is unconfined. That difference matters.
14:47
Because unset does not always mean safe. Sometimes
14:51
unset means congratulations. You are raw -dogging
14:54
syscalls in production. Probably do not put that
14:57
in your architecture diagram. So the takeaway
14:59
is straightforward. Check whether RuntimeDefaults
15:02
is actually being applied in your clusters. Review
15:05
your pod security standards posture. Know where
15:08
unconfined seccomp is allowed. Know where privileged
15:11
pods exist. Know which namespaces have exceptions.
15:16
And if you use managed Kubernetes, do not assume
15:19
the provider magically made your pod security
15:21
posture sane because the control plane has a
15:24
nice logo. Containers share the host kernel.
15:27
That is the deal. So when there is a local kernel
15:30
bug and your workload security posture allows
15:33
the relevant path, the cluster configuration
15:36
suddenly matters a lot. The boring defaults are
15:39
not boring. They are latent decisions. And every
15:42
once in a while, a CVE shows up and asks what
15:46
you decided. Now let's do a quick lightning round.
15:56
First, GitHub expanded OIDC support for Dependabot
16:00
and code scanning. GitHub says that Dependabot
16:02
and code scanning now support OpenID Connect
16:05
authentication for organization-level private
16:08
registries for Cloudsmith and Google Artifact
16:12
Registry. The short version is fewer long-lived
16:15
registry credentials sitting around as secrets.
16:18
And that is good. OIDC -based access is not magic.
16:21
But short-lived identity-based auth is usually
16:25
healthier than, here's a token, Please don't
16:27
leak it. Best of luck to everyone involved. Second,
16:30
Java pods getting OOMKilled in Kubernetes even
16:34
when the heap looks fine. Classic operator trap.
16:37
Someone sets -Xmx, looks at heap, and thinks we're
16:41
fine. Then Kubernetes kills the pod anyway. And
16:44
everyone stares at the graphs like the cluster
16:47
betrayed them personally. But JVM heap is not
16:50
the whole memory footprint. Metaspace, direct
16:53
buffers, thread stacks, native memory, JIT. GC
16:57
overhead, and off -heap usage still count towards
17:01
the container memory limit. Kubernetes does not
17:03
care that your heap looked reasonable. It cares
17:06
that the process crossed the limit. So if your
17:09
Java pods are getting OOMKilled, do not only
17:12
look at -Xmx. Look at total container memory and
17:15
leave headroom. Because production loves punishing
17:18
tight memory math. Third, LLM-generated SQL
17:21
can be wrong in ways that still run. A broken
17:25
query that fails loudly is annoying, but at least
17:28
you know it failed. A query that returns plausible
17:31
nonsense is worse. The dashboard loads. The numbers
17:35
look reasonable. Someone puts it in a slide deck.
17:38
And now your business metric is powered by a
17:41
hallucinated join and a missing filter. So text
17:44
to SQL needs guardrails. Read-only roles. Query
17:48
limits. Schema-aware validation. Known templates
17:52
where possible. Human review for anything important.
17:55
The danger is not always that AI writes broken
17:58
SQL. Sometimes it writes SQL that is wrong quietly.
18:02
And quiet wrongness is how bad decisions get
18:05
confidence. The Human Closer this week is about
18:16
postmortem action items. Because if there is
18:18
one place that engineering organizations consistently
18:21
lie to themselves, it is the bottom of a postmortem
18:25
document. Not maliciously, just optimistically.
18:28
The incident happens. People jump in. The team
18:31
writes a timeline. There is a good discussion.
18:33
Nobody blames anyone. Everybody agrees on what
18:37
went wrong. Then the action items show up. Improve
18:40
monitoring. Review runbooks. Add better alerting.
18:44
Investigate retry behavior. Clean up ownership.
18:47
These are not action items. These are wishes
18:50
wearing a Jira costume. Incident .io had a good
18:54
piece on why postmortem action items die. The
18:57
reasons are painfully familiar. No named owner,
19:00
wrong tracking place, vague wording, and no follow
19:04
-up cadence. That's basically the whole game.
19:07
A postmortem action item without an owner is
19:09
not an action item. It's a group hallucination.
19:13
An action item that lives in a doc that nobody
19:15
opens again is documentation tax. And an action
19:19
item that says improve monitoring is not an action
19:22
item. It is a mood. A real action item sounds
19:25
more like Maria adds a replication lag alert
19:28
for the payment database by Friday. Or Kevin
19:32
removes production deploy access from the old
19:34
CI token before the next release. Named owner.
19:38
Specific verb. Clear outcome. Real tracking
19:43
location, due date. That does not make the work
19:46
easy, but it makes it real. This connects back
19:48
to every story this week. The CISA leak is not
19:52
fixed by saying review GitHub practices. AI RCA
19:56
is not useful if the follow-up is improve incident
19:59
response. Computer using agents are not safe
20:03
because someone wrote ensure governance. Kubernetes
20:06
seccomp is not handled because someone says
20:09
harden workloads. This is the staff and principal
20:12
engineer part of the job that does not always
20:15
look exciting on a roadmap. Turning vague risk
20:18
into specific work. Turning incidents into system
20:22
changes. That is where reliability actually improves.
20:25
Because production does not care how good the
20:28
postmortem sounded. Production cares whether
20:31
anything changed. Okay, that's it for this week
20:34
of Ship It Weekly. We covered the CISA contractor
20:37
GitHub leak, AWS DevOps agent and automated root
20:41
cause analysis, Microsoft Copilot Studio computer
20:45
using agents, Atlassian agentic pipelines with
20:48
Claude code, CVE-2026-46333, and Kubernetes
20:53
seccomp defaults, plus a lightning round on GitHub
20:57
OIDC support, Java pods getting OOMKilled, and
21:01
LLM-generated SQL. If this episode was useful,
21:04
follow or subscribe wherever you are listening
21:07
or watching. If you're on YouTube, hit subscribe.
21:09
If you're in a podcast app, follow the show there.
21:12
And if you know somebody on a DevOps, SRE, platform,
21:16
security, or engineering leadership team who
21:20
is dealing with secrets, agents, Kubernetes defaults,
21:23
or postmortem follow-up, send this one to them.
21:27
It helps the show grow. And honestly, it helps
21:30
me keep making this kind of content for people
21:32
who actually live with these systems. You can
21:35
find the weekly brief at OnCallBrief.com and
21:38
more episodes and show notes on ShipItWeekly
21:41
.fm. I'm Brian from Teller's Tech, and thanks
21:43
for listening. And remember, if your repo is
21:46
named Private-CISA, your agent can click buttons,
21:50
your pipeline can call Claude, and your postmortem
21:53
action item says improve monitoring, maybe take
21:56
a breath then go find the owner the token the
22:00
default and the ticket because production does
22:02
not run on good intentions it runs on the stuff
22:05
that someone actually fixed
This episode is really about one idea: automation does not remove the boring work. It makes the boring work matter more.
That sounds backwards, because most automation is sold as a way to avoid the annoying parts. Less clicking. Less digging through logs. Less manual triage. Less “who owns this?” Less staring at a dashboard trying to remember which service writes to which topic, which database, in which region, for which customer path.
And honestly, I want that too.
Nobody gets into platform or SRE work because they want to spend their best years spelunking through CloudTrail, Kubernetes events, CI logs, and one Confluence page last updated by someone who left in 2021.
But the stories this week all point to the same uncomfortable thing.
The more powerful the automation gets, the more expensive your old mess becomes.
The CISA contractor GitHub leak is the blunt version. GitGuardian said it found a public repository called
Private-CISAwith 844 MB of exposed material, including plaintext passwords, AWS tokens, and Entra ID SAML certificates. KrebsOnSecurity also reported that the repo exposed credentials for AWS GovCloud accounts and files showing how CISA builds, tests, and deploys software internally. (blog.gitguardian.com)That is not just a “whoops, rotate the key” story.
That is context exposure.
A leaked credential is bad. A leaked credential plus Terraform, Kubernetes manifests, Argo CD files, CI/CD logs, internal deployment docs, and GitHub Actions workflows is worse. At that point, you may have leaked not just the key, but a pretty good map of how the system works.
That distinction matters.
A lot of teams treat secrets as the only scary artifact. They run secret scanning, rotate tokens, and move on. But attackers do not only care about credentials. They care about shape. Naming conventions. Deployment paths. Control planes. Environments. Build steps. Internal assumptions. Which systems trust which other systems. Which scripts look abandoned but still work.
The floor plan matters.
And that is where the staff/principal engineer alarm bell should go off. Not because every leak is catastrophic in the same way, but because operational context is part of your attack surface.
Old repos, contractor-owned repos, personal forks, demo projects, migration backups, Terraform state, kubeconfigs, CI logs, and zip files named something like
prod-final-backup-really-finalare not harmless just because they are boring. Boring is where production risk hides, mostly because boring things stop getting reviewed.The AWS DevOps Agent story is almost the opposite side of the same coin. Instead of leaking operational context, AWS is showing an agent trying to gather it during an incident. Their post walks through automated RCA across Datadog and Elasticsearch, with EKS access for Kubernetes objects, pod logs, and cluster events, plus CloudTrail deployment context. (Amazon Web Services, Inc.)
That is useful. I can absolutely see the value.
A lot of incident response is context reconstruction. What changed? What deployed? Which pod restarted? What metric moved first? What log line started showing up? Which dependency decided to become a learning opportunity at 2:13 PM on a Tuesday?
If an agent can assemble that timeline faster, great.
But automated RCA is one of those places where the output can sound more certain than it deserves to be. A clean summary with “probable root cause” in bold can become the thing everyone believes, especially when the channel is noisy and everyone is tired.
So the question is not “should we use AI for incident response?”
The better question is: where does the agent sit in the decision chain?
Is it a scribe?
An investigator?
A summarizer?
A hypothesis generator?
Or is it becoming the person in the room everyone quietly defers to because it sounds confident and nobody wants to keep digging?
That boundary matters.
The same thing shows up in Microsoft Copilot Studio computer-using agents. Microsoft says computer use in Copilot Studio is generally available, and its docs describe agents interacting with websites and desktop apps through graphical user interfaces. (TECHCOMMUNITY.MICROSOFT.COM)
That sounds amazing if you live in the real enterprise world, where half the important systems either have bad APIs, no APIs, or APIs technically exist but somehow the only supported process is still “log into the portal and click the thing.”
Computer-using agents are going after that mess.
But they also make the boundary fuzzy.
API automation at least gives you endpoints, scopes, schemas, logs, and a reasonably clear mental model. UI automation is more like, “the agent looked at the screen and clicked what seemed right.”
That may be fine when the button is “Download report.”
It is a little less fine when the button is “Approve,” “Delete,” “Submit payment,” or “Yes, I understand this is permanent.”
Again, the tool is not automatically bad. The failure mode is lazy governance. If an agent can use a UI, then the UI is now an automation interface. That means restricted accounts, audit logs, test environments, approval gates, and very strong feelings about bulk updates.
Atlassian adding Claude Code support to Bitbucket Agentic Pipelines is another version of this. Atlassian says Agentic Pipelines lets teams embed AI agents into Bitbucket Pipelines steps to analyze code, troubleshoot failing pipelines, fix flaky tests, and more. Atlassian also has separate guidance about third-party agent providers, including Claude, and what that means for permissions and data handling. (Atlassian Support)
That is the part I keep coming back to.
CI/CD is not “developer tooling around production.”
CI/CD is the path code takes to become production.
So when agents enter CI/CD, they are not just helping with chores. They are entering the delivery path. That means the boring questions matter immediately.
What code does the agent see?
What logs does it see?
What secrets are available?
Can it modify tests?
Can it open pull requests?
Can it generate security triage notes that people treat as fact?
Can it make the pipeline pass without making the system better?
That last one is not theoretical. Humans do it constantly. We just call it temporary, put it in a PR description, and then let it survive three reorgs.
The Kubernetes seccomp story is the grounding wire for all of this.
After all the agent talk, CVE-2026-46333 is a reminder that your old defaults still matter. Kubernetes seccomp docs describe how
seccompDefaultcan apply the RuntimeDefault profile when no profile is specified, while otherwise workloads may run unconfined depending on configuration. (Microsoft Learn)That is not flashy. It will not win a keynote. Nobody is making a cinematic launch video for “check your pod security defaults.”
But those are the kinds of settings that decide whether a theoretical exploit path becomes a practical one.
The boring defaults are not boring. They are latent decisions.
And every once in a while, a CVE shows up and asks what you decided.
That is also why the lightning round fits the episode.
GitHub expanding OIDC support for Dependabot and code scanning is not flashy, but short-lived identity-based access is healthier than long-lived registry secrets sitting around forever. Java pods getting OOMKilled even when heap looks fine is a reminder that abstractions leak, and Kubernetes does not care that your
-Xmxlooked reasonable. LLM-generated SQL that returns plausible but wrong results is a reminder that failure is not always loud.Sometimes the system breaks quietly.
Sometimes the dashboard loads.
Sometimes the query runs.
Sometimes the postmortem gets published.
Sometimes the action item says “improve monitoring,” and everyone nods like that is a plan.
That is why the human closer matters.
Postmortem action items die because they are often not real work yet. They are good intentions with vague verbs. “Improve monitoring.” “Review runbooks.” “Clean up ownership.” “Investigate retries.”
Those are not action items.
They are vibes in ticket form.
A real action item has an owner, a clear outcome, a tracking location, and a due date. incident.io’s piece on failed postmortem actions points at the same basic reasons: no named owner, vague wording, wrong tracking place, and no follow-up cadence. (Atlassian Support)
And that is the part that ties the whole episode together.
The CISA leak is not fixed by saying “review GitHub practices.”
AI RCA is not useful if the follow-up is “improve incident response.”
Computer-using agents are not governed because someone wrote “ensure controls.”
Claude Code in CI/CD is not safe because someone said “be careful with third-party providers.”
Kubernetes seccomp is not handled because someone said “harden workloads.”
At some point, someone has to turn the vague thing into real work.
Name the owner.
Find the repo.
Rotate the token.
Delete the archive.
Scope the account.
Document the data flow.
Apply the default.
Track the exception.
Close the loop.
That is not the glamorous part of engineering, but it is the part that compounds.
The staff and principal engineer job is often less about having the cleverest take and more about turning fuzzy risk into specific work that actually changes the system.
Automation is going to keep getting more powerful. Agents will get better. RCA tools will get faster. Pipelines will get more intelligent. UI automation will keep reaching into systems that never had proper APIs.
Fine.
But if the ownership model is messy, the secrets are stale, the defaults are unknown, the CI permissions are broad, and the postmortem actions are vague, then automation does not save you.
It scales the mess.
That is the lesson I keep taking from these stories.
Production does not run on good intentions.
It runs on the stuff someone actually fixed.
Additional links worth including somewhere if you have room: KrebsOnSecurity’s CISA leak coverage, Microsoft’s computer-use docs, Atlassian’s third-party agent provider guidance, Kubernetes seccomp docs, GitHub’s Dependabot/code scanning OIDC changelog, Readyset’s LLM SQL piece, and incident.io’s postmortem follow-up article. (Krebs on Security)