0:00
Everybody wants AI to help run infrastructure.
0:02
A lot fewer people are asking where that AI is
0:05
allowed to fail. Because the hard part is not
0:07
getting an agent to suggest a change. The hard
0:10
part is making sure that change can be tested,
0:13
challenged, and debugged before anything touches
0:16
real cloud infrastructure. And that is what made
0:19
this conversation interesting to me. Not just
0:21
the AI angle. The idea that before we trust agents
0:24
with real systems, they may need a real training
0:27
ground first. Thank you. Hey, I'm Brian Teller.
0:46
I work in DevOps and SRE, and I run Teller's
0:49
Tech. Ship It Weekly is where I filter the noise
0:52
and focus on what actually matters when you are
0:55
the one running infrastructure and owning reliability.
0:58
Most weeks, it's a quick news recap. In between
1:01
those, I do interview episodes with people building
1:03
tools, systems, and ideas that could actually
1:06
change how this work gets done. Today is one
1:10
of those conversations. I'm joined by Ang Chen.
1:13
associate professor at the University of Michigan.
1:16
He's working on Project Vera, which is now being
1:19
positioned as a high -fidelity multi -cloud emulator
1:22
you can run locally on your laptop with support
1:25
for AWS EC2 and GCP compute. At a practical level,
1:30
the pitch is pretty straightforward. Test cloud
1:33
infrastructure locally, use standard tooling,
1:35
avoid real accounts, real spend, and real blast
1:39
radius while you are iterating. But the bigger
1:42
idea behind Vera is what really got my attention.
1:45
Ang frames this as part of a longer -term vision
1:48
for giving AI agents a safe learning environment.
1:51
or what he calls a kind of world model for digital
1:55
systems, where they can build operational intelligence
1:58
before ever touching real infrastructure. So
2:01
in this conversation, we get into what high fidelity
2:04
really means, how Vera works at the API layer,
2:07
how it can sit under workflows that already use
2:10
CLI tools, SDKs, or Terraform, and why that matters
2:15
if you want faster feedback without pointing
2:18
tests at the real cloud. We also get into the
2:21
skeptical operator questions. How close does
2:24
something like this actually need to be before
2:26
you trust it? Where is it strong today and where
2:29
is it still early? And if AI is going to play
2:32
a bigger role in infrastructure, what kind of
2:34
safety layers should exist first? That's the
2:37
real conversation here. Not whether AI can generate
2:40
infra work. Whether it can be forced to prove
2:42
itself somewhere safe before it earns access
2:45
to the real thing. If you like these kinds of
2:48
conversations, follow the show wherever you listen.
2:51
Subscribe on YouTube and check out ShipItWeekly
2:54
.fm or TellersTech .com for more episodes, show
2:58
notes, and everything else that I'm building.
3:01
All right, let's jump in. Today, I'm joined by
3:07
Ang Chen, an associate professor at the University
3:10
of Michigan. He's working on Project Vera, which
3:13
is basically trying to build a high -fidelity
3:15
cloud emulator using AI agents, starting with
3:19
EC2. Ang, thank you for joining me. Thank you,
3:21
Brian. I'm excited to be here. So tell me about
3:24
Project Vera. What is it? It's an effort that
3:27
automatically generates a digital twin of your
3:30
cloud deployment. DevOps can be very tricky to
3:34
get right. And we don't want any downtime or
3:38
security issues when we actually push the program
3:41
to the cloud. So you can think of it as a sandbox.
3:44
That's a digital copy of your actual infrastructure.
3:48
Within this sandbox, the DevOps programs can
3:52
be tested. They can be debugged. You can even
3:54
deploy an AI agent to play with the sandbox and
3:57
get to know more about your deployment without
3:59
actually reaching into the actual deployment
4:02
itself. And what's interesting about the sandbox
4:05
is that it's actually generated by AI agent itself.
4:08
Have an AI agent that reads. the cloud documentation,
4:12
and it could also observe traces and logs about
4:15
the deployment. And it uses a very specialized
4:19
program synthesis pipeline to generate an emulator
4:23
framework. The emulator framework will mimic
4:25
the behavior of EC2, for instance, in terms of
4:28
how to respond to a certain call, what should
4:31
be the responses and formats in a very high fidelity
4:34
manner. And the same idea generalizes to other
4:37
services in AWS and it generalizes to other clouds
4:40
as well. Actually, we're building it for Azure
4:43
and GCP and other clouds as well. So it's an
4:46
agent building a simulator of the cloud. And
4:50
on that cloud, on that simulator, the DevOps
4:53
engineers can do a lot of their works much easier.
4:57
What's the target audience for a tool like this?
4:59
Right. It would be primarily for DevOps engineers
5:03
who... want to test their programs in the sandbox.
5:08
The DevOps engineer can deploy their programs
5:11
in the sandbox and observe the behavior and debug
5:16
their programs before they push it to the actual
5:19
cloud. And that sandbox can also be used to support
5:24
DevOps like AI assistance. AI is getting very
5:29
powerful every day, but we often don't want the
5:32
AI to directly work on the infrastructure. because
5:34
it could hallucinate. So having an AI agent testing
5:38
its proposed actions in the sandbox before putting
5:42
it to the cloud would be another use case of
5:45
the sandbox. Is it interfacing with like IAC,
5:49
like a Terraform or CloudFormation? Or how does
5:52
that integrate? Right. The simulator emulates
5:56
the cloud at the API level. Basically, every
5:59
API that creates virtual machines and subnets
6:02
is captured here. So basically, it can support
6:05
SDK scripts, but it can also support CloudFormation
6:09
and Terraform because eventually they all call
6:12
into the APIs. And in the release that we have,
6:15
we have like CRI test cases that mimic Amazon,
6:18
but also Terraform programs that can be booted
6:21
on this emulator. When you say high fidelity,
6:23
what does that mean in practice? Right. It means
6:26
that there are two key properties of this emulator
6:29
because this is generated by an AI co -developer.
6:33
so to speak that reads the cloud documentation
6:35
and test against the cloud we want to make sure
6:39
that this is not vulnerable to hallucination
6:42
ais are getting very good but they still have
6:45
hallucination and we have two ways to prevent
6:48
this from happening and the first is the ai agent
6:52
that we have built that's behind vera is using
6:55
formal abstractions is using formal methods and
6:58
verification to make sure that the code eliminates
7:02
classes of hallucination problems. So it's built
7:05
to be correct by construction without suffering
7:08
from arbitrary errors that an AI model would
7:11
otherwise introduce. And the second is that the
7:14
AI agent also takes this simulator and strategically
7:19
tests this against Amazon. Because this emulator
7:23
is generated by the agent, the agent understands
7:26
the inner workings of the emulator and it can
7:28
understand what might be some edge cases and
7:32
what might be some places where strategic testing
7:36
would be helpful. So the agent also takes this
7:39
emulator and produce traces and send it to the
7:41
cloud, observe whether the behaviors are the
7:44
same or not. And if they're same, that's what
7:47
we mean by high fidelity. And if there are discrepancies,
7:50
the AI agent will then consume these two traces
7:52
and automatically patch the emulator so that
7:55
in the next test case, they will be aligned with
7:57
each other. Interesting. I guess I'm curious,
8:00
how does it get around the non -deterministic
8:02
behavior of an AI or an LLM specifically? Right.
8:07
And that's a very good question. That's where
8:10
the formal abstractions come in. Instead of having
8:13
the AI write code in a freeform style, We actually
8:18
have a lot of scaffolding. That's the key part.
8:22
The structure of the emulator is a deterministic
8:25
framework. And what we ask the AI to do is essentially
8:29
fill in the blanks that we have left out instead
8:32
of being creative about writing everything about
8:35
the emulator. So it's a combination of neural
8:38
and symbolic methods where the neural framework
8:41
constrains the behavior. and it's fully deterministic.
8:45
And there are strategic parts where the AI needs
8:48
to read the documentation and understand what
8:50
it's supposed to do. And it's only filling in
8:52
these blanks in a way that's constrained by the
8:55
scaffolding. So it's like a spec then that you're
8:58
built around, okay. Exactly. Or in Cursor or
9:01
Kiro, it's like a plan file that it's reading.
9:05
Is it like a pre -prompt or is it more specific
9:08
than that? It's more specific than that. So we
9:11
use a special kind of phone methods that builds
9:15
classes and abstractions, almost like a template.
9:18
And the template has a very well -defined structure.
9:23
And we know that the structure cannot go wrong
9:25
because it's deterministic. But the structure
9:28
also has certain stops. And the stops are where
9:30
the AI agents will generate code and insert them
9:35
into. So it's more specific than a pre -prompt,
9:38
almost like... a class that can be inherited
9:41
and can be turned into a compute instance, can
9:45
be turned into a subnet, a firewall, and so forth.
9:48
So can you walk me through the shape of the system?
9:50
Like if I'm calling an EC2 API, what's happening
9:54
behind the scenes? Right. So if it's calling
9:56
into the EC2 API, so that API will be captured
10:00
by the emulator framework and it will create
10:03
a class, so to speak. that captures the behaviors
10:07
as specified in the EC2 virtual machine documentation.
10:12
For instance, there is a run instance which creates
10:14
the virtual machine. And you could destroy it,
10:16
you could attach disks to it, and so forth. And
10:19
that will trigger some state modification, almost
10:22
like IAC, where Terraform contains the state.
10:26
So here, the emulator mimics that workflow, and
10:29
it also contains state. But now we have a virtual
10:32
machine. And the virtual machine could have a
10:34
specific name. And if there's another API that
10:36
attaches a disk to the virtual machine, the emulator
10:39
will also capture that by modifying the internal
10:42
state. So it is a hierarchy of these services
10:46
where you could instantiate a virtual machine
10:49
and the virtual machine could be contained in
10:51
a VPC. So when you're updating the virtual machine,
10:54
for instance, the emulator knows that it also
10:57
must. updates the VPC. So talking about state,
11:00
how do you deal with weird edge cases like eventual
11:03
consistency, retries, throttling, quota errors,
11:06
that sort of thing? Right. So the emulator framework
11:09
itself is generated by an AI agent that reads
11:13
the cloud documentation. So the cloud documentation
11:16
describes the key behaviors of the cloud, but
11:20
it doesn't describe everything. So the question
11:22
that you ask is a very important class of problems.
11:25
which are not fully documented in the documentation.
11:28
As an example, eventual consistency and consistent
11:31
guarantees are often not described in detail.
11:34
But the API behaviors, how it should perform,
11:37
is documented very extensively. So what we are
11:41
doing here is that we are taking the emulator
11:43
and bootstrap it to a fully functional emulator,
11:47
but doesn't capture some of the nuances regarding
11:49
throttling, rate limiting, consistency. But we
11:53
have another simulator. in the backend that can
11:55
supply some of these semantics. So this functional
11:58
emulation can be, if there's a call into an API,
12:01
that API, we can also emulate latency for that
12:04
API throughout behaviors and quotas. So there
12:08
are an orthogonal subsystem that supplies that
12:12
kind of intricate detail to the emulator. That's
12:15
a great question. What's the success bar? Is
12:17
it like same response, same timing, same failure
12:20
mode? Yeah. So there are two milestones. The
12:23
first milestone is that it should enable the
12:26
same inputs and outputs for the APIs so that
12:29
DevOps engineers doesn't have to actually go
12:32
to the cloud to understand whether their program
12:35
is working. So then we just tested it and...
12:38
observe the actions in this emulator. And the
12:41
second milestone is that actually this emulator
12:44
can help with DevOps perform better debugging
12:47
than the cloud can. And the reason is that when
12:50
the cloud has an error, it gives you some trace,
12:53
but that trace is often verbose. It doesn't really
12:56
help with pinpointing which line of code is problematic
12:59
in your Terraform file or in your SDK file. Because
13:03
there's an AI agent living in the sandbox. The
13:06
agent can analyze the traces and produce better
13:09
debugging information and even pinpoint the problems
13:12
in Terraform. So the second milestone is actually
13:15
to do better debugging than what the cloud can
13:18
do. Interesting. So how do you prove that it's
13:20
not lying to me? That is a heart of the question.
13:25
How do we make sure that this emulator is actually
13:29
producing the same responses in the first release?
13:33
in the github we have more than 200 test cases
13:36
and these are cri command lines that you would
13:39
type into aws and we run a test between vera
13:44
and an existing emulator so what we've shown
13:47
is that vera is already doing much better than
13:51
existing emulators but the same set of test cases
13:55
I've also shown that Vera sometimes fails to
13:58
produce the same behavior because this is a agent
14:01
that continuously improves itself. And the first
14:05
release gets a 70 % based on our measurement.
14:09
And by this agent, we have another version that's
14:12
continuous running and improving itself until
14:15
it hits all the test cases. So one way that people
14:18
test it is to use a leading emulator called local
14:21
stack. Local stack is this really nice tool.
14:24
that emulates AWS APIs. It's not one -to -one.
14:28
I've found it's good in some ways, but yeah,
14:30
it's... It's not one -to -one. It's close enough
14:34
to enable classes of DevOps testing. In the open
14:37
source release, actually, we did a comparison
14:40
between Vera and local stack. So what we found
14:43
is that Vera covers more than 70 % of the cases,
14:47
whereas local stack covers 40%. So the first
14:51
version of Vera is already performing. quite
14:54
well in that regard. And we also have another
14:57
version that's continuously improving itself.
15:00
And the goal is to simulate the behavior of the
15:03
cloud to Terraform and DevOps programs, almost
15:06
like the Turing test. The ultimate goal is when
15:10
we run a DevOps program against the simulator
15:13
versus against the cloud, the DevOps program
15:17
doesn't feel any difference. It doesn't necessarily
15:19
mean that has to be line by line. character by
15:22
character the same regarding the logs and the
15:25
outputs, but we want it to be high fidelity enough
15:28
that DevOps engineers can test it thoroughly
15:31
in this simulator. So if I'm a platform team,
15:34
where would I actually plug this in? Local dev
15:36
versus CI versus prepod, validation, like what
15:39
would be a good first step? Right. One way of
15:42
using this is to integrate it to the CI -CD pipeline.
15:46
When there are code changes, there's a new Terraform
15:48
file. The agent can take the changes and validate
15:52
it in the sandbox first and suggest changes to
15:55
the program if there are errors and fix these
15:58
errors and generate corrections for the DevOps
16:01
engineers so that this would be integrated to
16:04
the CICD before it's actually pushed to the cloud.
16:07
Is there any use cases where maybe it's not well
16:10
suited for yet? Maybe it doesn't have enough
16:13
testing around it or validation. So there are
16:15
two things that we know about the limitations
16:18
of Vera. The first limitation is that it doesn't
16:21
yet cover all resources in AWS. It does cover
16:25
EC2, which is a key service. There's also a lot
16:29
more beyond EC2. That's the first limitation
16:31
that we know. And the second limitation is that
16:34
the current version, the current version doesn't
16:39
do some of the things that we... are thinking
16:42
about. For instance, I've talked about AI -based
16:45
debugging suggestions to DevOps engineers. So
16:47
that tooling is not fully ready yet. So currently,
16:52
if there's a bug, error doesn't automatically
16:54
diagnose the bug for you, which is part of our
16:58
ongoing plan. For the first limitation, which
17:01
is that it doesn't support all APIs and it doesn't
17:04
support customization. For instance, it doesn't
17:07
automatically understand how a specific deployment
17:10
is like. That deployment for an enterprise may
17:14
not all use all the APIs. They may use the APIs
17:16
in a very specialized way. So these kind of customizations
17:19
are also not there yet, but they are on our agenda.
17:23
So it sounds like EC2 is its area where you've
17:27
had a lot of focus and it seems like you have
17:29
trusted output there. What's the nastiest EC2
17:33
edge case that you've had to emulate? There's
17:36
a very interesting edge case that we have found
17:39
in this exercise, which is that sometimes our
17:42
AI co -developer that writes the emulator uses
17:46
one types of string formats, like a camel cases,
17:50
where it's easy to make format the same string
17:52
differently. And that's very interesting because
17:55
a Terraform program expects a certain type of
17:58
format. And if it's formatted slightly differently,
18:01
then Terraform program won't run. So there we
18:03
had to create specialized directions for the
18:06
agent so that you would only produce camel cases
18:09
when it's supposed to be camel cases. In other
18:13
cases, produce snake cases and so forth. I thought
18:15
that was a very interesting and unexpected edge
18:19
case where the initial version of Vera didn't
18:22
produce the exact response and we had to do extra
18:25
engineering to make that align. Interesting.
18:29
Do you think clouds will ever make official emulators
18:31
good enough or is learned emulation the path?
18:35
Right. One thing that's very special about cloud
18:39
emulation compared to other types of emulators
18:42
is that the cloud is a moving target. There are
18:45
new services every week and there are API changes.
18:48
Many of these changes will introduce different
18:50
behaviors. So it's building an emulator for the
18:54
cloud. Our experience is that there are two key
18:56
challenges. One is that the size of the cloud
18:59
is so big and there are so many different clouds
19:01
with different behaviors. Beyond AWS, we've also
19:05
investigated Azure and GCP, which is on our agenda
19:08
as well. They all have different APIs. They have
19:11
different behaviors. The emulation for one cloud
19:14
doesn't really generalize to the other. The second
19:17
is that the APIs go through constant evolution.
19:21
Because the clouds need, they want to stay competitive.
19:23
They're introducing new services, new ways of
19:26
using these services. We really believe that
19:28
learned and AI agent built emulator is the path.
19:32
Because the AI agent doesn't have to spend a
19:35
lot more extra effort once the emulator framework
19:38
is there. It still has to align the emulator
19:41
periodically. Whenever there's an API change,
19:44
the agent has to understand what has changed.
19:47
It has to generate strategic test cases. to realign
19:51
that API, but it doesn't have to do everything
19:53
from scratch. So this agent can keep up with
19:56
the changes that happen in the cloud and it can
19:59
gradually expand to different clouds. So this
20:01
is an ever -expanding emulator that can catch
20:04
up with the speed of the cloud. And that's something
20:07
that we are very excited about regarding this
20:10
learned emulation. Have you done much as far
20:12
as GCP training yet? I'm just curious because...
20:16
The IAM approach in GCP is completely different
20:20
than the IAM approach in AWS. Right. Or even
20:24
like Cloud Run versus Lambda is also completely
20:27
different. Fundamentally different services,
20:29
although the same general idea or same general
20:33
focus. Yeah. And that is a very good question.
20:36
The clouds call the same service differently.
20:39
So they're almost aliases. So here is where AI
20:43
will shine. Because as long as we can... make
20:46
the AI understand. There are certain concepts
20:49
across clouds that are similar. Instances are
20:52
called virtual machines in a different cloud.
20:54
Then there are certain knowledge base in the
20:56
AI that can transfer from one cloud to another.
20:59
So there's a core of the learned knowledge that
21:02
can transfer, but not everything. The APIs are
21:05
still different and the services do not always
21:09
have a counterpart across clouds. What I'm excited
21:12
about this approach is that some core knowledge
21:15
of the cloud can be transferred so that when
21:17
we're building the second emulator for GCP, it
21:20
will be much faster than the first one for AWS,
21:23
where it has to learn everything, all the concepts
21:26
from scratch. Where can people find more information
21:29
about Vera? We have an open source release called
21:32
Project Vera. And it's project -vera .github
21:36
.io, where we have an open source release, as
21:39
well as the publication that we had over the
21:41
years that eventually led to this paper, to this
21:43
simulator. And that could be a good source of
21:47
information that not only is about the release.
21:51
but also the rationale behind the release and
21:54
the specific approach that we take in designing
21:56
Vera and other tools that we have built in the
21:59
past couple of years surrounding AIOps and DevOps.
22:03
And we are looking for contributors to help us
22:06
improve Vera. And if you're interested in contributing
22:10
to the open source release or contributing new
22:13
ideas, or if you have a service that you would
22:15
like to see emulated, this is something that
22:17
we are very excited to help you with. What's
22:20
the license model for Vera? currently? It's under
22:23
MIT license in open source. That's good to hear.
22:26
Too many new open source projects like to limit
22:29
their open source initiatives, which is a little
22:32
frustrating. Right. It's fully open source under
22:36
MIT license. Thanks. Well, thank you, Ang, for
22:38
coming on. Really appreciate it. Thank you very
22:40
much, Brian. It's great to be here. All right.
22:43
That's my conversation with Ang Chen. My biggest
22:46
takeaway for this one is that AI for infrastructure
22:49
gets a lot more believable when it has to survive
22:52
a sandbox first. That's what makes Vera interesting
22:55
to me. Not just that it emulates cloud behavior,
22:58
but that it is trying to create a local, high
23:01
-fidelity environment where cloud workflows can
23:04
be tested without real credentials, real billing,
23:07
or real production risk. And since we recorded
23:10
this, the project has clearly kept moving. It's
23:13
now being presented publicly as a multi -cloud
23:16
emulator, not just an EC2 -focused idea, with
23:20
AWS and GCP support and a stronger public story
23:24
around local testing and safer iteration. I also
23:27
liked that the bigger vision was not just AI
23:30
does ops. It was more grounded than that. Give
23:33
agents a training ground. Let them learn inside
23:36
something rule -based. See how they behave, see
23:39
where they fail, then decide what, if anything,
24:04
Thanks for listening, and I'll see you later
24:07
this week.