Ship It Conversations: Ang Chen on Project Vera, AI Cloud Emulation, and Safer Infrastructure Testing

Transcript

0:00 Everybody wants AI to help run infrastructure.

0:02 A lot fewer people are asking where that AI is

0:05 allowed to fail. Because the hard part is not

0:07 getting an agent to suggest a change. The hard

0:10 part is making sure that change can be tested,

0:13 challenged, and debugged before anything touches

0:16 real cloud infrastructure. And that is what made

0:19 this conversation interesting to me. Not just

0:21 the AI angle. The idea that before we trust agents

0:24 with real systems, they may need a real training

0:27 ground first. Thank you. Hey, I'm Brian Teller.

0:46 I work in DevOps and SRE, and I run Teller's

0:49 Tech. Ship It Weekly is where I filter the noise

0:52 and focus on what actually matters when you are

0:55 the one running infrastructure and owning reliability.

0:58 Most weeks, it's a quick news recap. In between

1:01 those, I do interview episodes with people building

1:03 tools, systems, and ideas that could actually

1:06 change how this work gets done. Today is one

1:10 of those conversations. I'm joined by Ang Chen.

1:13 associate professor at the University of Michigan.

1:16 He's working on Project Vera, which is now being

1:19 positioned as a high -fidelity multi -cloud emulator

1:22 you can run locally on your laptop with support

1:25 for AWS EC2 and GCP compute. At a practical level,

1:30 the pitch is pretty straightforward. Test cloud

1:33 infrastructure locally, use standard tooling,

1:35 avoid real accounts, real spend, and real blast

1:39 radius while you are iterating. But the bigger

1:42 idea behind Vera is what really got my attention.

1:45 Ang frames this as part of a longer -term vision

1:48 for giving AI agents a safe learning environment.

1:51 or what he calls a kind of world model for digital

1:55 systems, where they can build operational intelligence

1:58 before ever touching real infrastructure. So

2:01 in this conversation, we get into what high fidelity

2:04 really means, how Vera works at the API layer,

2:07 how it can sit under workflows that already use

2:10 CLI tools, SDKs, or Terraform, and why that matters

2:15 if you want faster feedback without pointing

2:18 tests at the real cloud. We also get into the

2:21 skeptical operator questions. How close does

2:24 something like this actually need to be before

2:26 you trust it? Where is it strong today and where

2:29 is it still early? And if AI is going to play

2:32 a bigger role in infrastructure, what kind of

2:34 safety layers should exist first? That's the

2:37 real conversation here. Not whether AI can generate

2:40 infra work. Whether it can be forced to prove

2:42 itself somewhere safe before it earns access

2:45 to the real thing. If you like these kinds of

2:48 conversations, follow the show wherever you listen.

2:51 Subscribe on YouTube and check out ShipItWeekly

2:54 .fm or TellersTech .com for more episodes, show

2:58 notes, and everything else that I'm building.

3:01 All right, let's jump in. Today, I'm joined by

3:07 Ang Chen, an associate professor at the University

3:10 of Michigan. He's working on Project Vera, which

3:13 is basically trying to build a high -fidelity

3:15 cloud emulator using AI agents, starting with

3:19 EC2. Ang, thank you for joining me. Thank you,

3:21 Brian. I'm excited to be here. So tell me about

3:24 Project Vera. What is it? It's an effort that

3:27 automatically generates a digital twin of your

3:30 cloud deployment. DevOps can be very tricky to

3:34 get right. And we don't want any downtime or

3:38 security issues when we actually push the program

3:41 to the cloud. So you can think of it as a sandbox.

3:44 That's a digital copy of your actual infrastructure.

3:48 Within this sandbox, the DevOps programs can

3:52 be tested. They can be debugged. You can even

3:54 deploy an AI agent to play with the sandbox and

3:57 get to know more about your deployment without

3:59 actually reaching into the actual deployment

4:02 itself. And what's interesting about the sandbox

4:05 is that it's actually generated by AI agent itself.

4:08 Have an AI agent that reads. the cloud documentation,

4:12 and it could also observe traces and logs about

4:15 the deployment. And it uses a very specialized

4:19 program synthesis pipeline to generate an emulator

4:23 framework. The emulator framework will mimic

4:25 the behavior of EC2, for instance, in terms of

4:28 how to respond to a certain call, what should

4:31 be the responses and formats in a very high fidelity

4:34 manner. And the same idea generalizes to other

4:37 services in AWS and it generalizes to other clouds

4:40 as well. Actually, we're building it for Azure

4:43 and GCP and other clouds as well. So it's an

4:46 agent building a simulator of the cloud. And

4:50 on that cloud, on that simulator, the DevOps

4:53 engineers can do a lot of their works much easier.

4:57 What's the target audience for a tool like this?

4:59 Right. It would be primarily for DevOps engineers

5:03 who... want to test their programs in the sandbox.

5:08 The DevOps engineer can deploy their programs

5:11 in the sandbox and observe the behavior and debug

5:16 their programs before they push it to the actual

5:19 cloud. And that sandbox can also be used to support

5:24 DevOps like AI assistance. AI is getting very

5:29 powerful every day, but we often don't want the

5:32 AI to directly work on the infrastructure. because

5:34 it could hallucinate. So having an AI agent testing

5:38 its proposed actions in the sandbox before putting

5:42 it to the cloud would be another use case of

5:45 the sandbox. Is it interfacing with like IAC,

5:49 like a Terraform or CloudFormation? Or how does

5:52 that integrate? Right. The simulator emulates

5:56 the cloud at the API level. Basically, every

5:59 API that creates virtual machines and subnets

6:02 is captured here. So basically, it can support

6:05 SDK scripts, but it can also support CloudFormation

6:09 and Terraform because eventually they all call

6:12 into the APIs. And in the release that we have,

6:15 we have like CRI test cases that mimic Amazon,

6:18 but also Terraform programs that can be booted

6:21 on this emulator. When you say high fidelity,

6:23 what does that mean in practice? Right. It means

6:26 that there are two key properties of this emulator

6:29 because this is generated by an AI co -developer.

6:33 so to speak that reads the cloud documentation

6:35 and test against the cloud we want to make sure

6:39 that this is not vulnerable to hallucination

6:42 ais are getting very good but they still have

6:45 hallucination and we have two ways to prevent

6:48 this from happening and the first is the ai agent

6:52 that we have built that's behind vera is using

6:55 formal abstractions is using formal methods and

6:58 verification to make sure that the code eliminates

7:02 classes of hallucination problems. So it's built

7:05 to be correct by construction without suffering

7:08 from arbitrary errors that an AI model would

7:11 otherwise introduce. And the second is that the

7:14 AI agent also takes this simulator and strategically

7:19 tests this against Amazon. Because this emulator

7:23 is generated by the agent, the agent understands

7:26 the inner workings of the emulator and it can

7:28 understand what might be some edge cases and

7:32 what might be some places where strategic testing

7:36 would be helpful. So the agent also takes this

7:39 emulator and produce traces and send it to the

7:41 cloud, observe whether the behaviors are the

7:44 same or not. And if they're same, that's what

7:47 we mean by high fidelity. And if there are discrepancies,

7:50 the AI agent will then consume these two traces

7:52 and automatically patch the emulator so that

7:55 in the next test case, they will be aligned with

7:57 each other. Interesting. I guess I'm curious,

8:00 how does it get around the non -deterministic

8:02 behavior of an AI or an LLM specifically? Right.

8:07 And that's a very good question. That's where

8:10 the formal abstractions come in. Instead of having

8:13 the AI write code in a freeform style, We actually

8:18 have a lot of scaffolding. That's the key part.

8:22 The structure of the emulator is a deterministic

8:25 framework. And what we ask the AI to do is essentially

8:29 fill in the blanks that we have left out instead

8:32 of being creative about writing everything about

8:35 the emulator. So it's a combination of neural

8:38 and symbolic methods where the neural framework

8:41 constrains the behavior. and it's fully deterministic.

8:45 And there are strategic parts where the AI needs

8:48 to read the documentation and understand what

8:50 it's supposed to do. And it's only filling in

8:52 these blanks in a way that's constrained by the

8:55 scaffolding. So it's like a spec then that you're

8:58 built around, okay. Exactly. Or in Cursor or

9:01 Kiro, it's like a plan file that it's reading.

9:05 Is it like a pre -prompt or is it more specific

9:08 than that? It's more specific than that. So we

9:11 use a special kind of phone methods that builds

9:15 classes and abstractions, almost like a template.

9:18 And the template has a very well -defined structure.

9:23 And we know that the structure cannot go wrong

9:25 because it's deterministic. But the structure

9:28 also has certain stops. And the stops are where

9:30 the AI agents will generate code and insert them

9:35 into. So it's more specific than a pre -prompt,

9:38 almost like... a class that can be inherited

9:41 and can be turned into a compute instance, can

9:45 be turned into a subnet, a firewall, and so forth.

9:48 So can you walk me through the shape of the system?

9:50 Like if I'm calling an EC2 API, what's happening

9:54 behind the scenes? Right. So if it's calling

9:56 into the EC2 API, so that API will be captured

10:00 by the emulator framework and it will create

10:03 a class, so to speak. that captures the behaviors

10:07 as specified in the EC2 virtual machine documentation.

10:12 For instance, there is a run instance which creates

10:14 the virtual machine. And you could destroy it,

10:16 you could attach disks to it, and so forth. And

10:19 that will trigger some state modification, almost

10:22 like IAC, where Terraform contains the state.

10:26 So here, the emulator mimics that workflow, and

10:29 it also contains state. But now we have a virtual

10:32 machine. And the virtual machine could have a

10:34 specific name. And if there's another API that

10:36 attaches a disk to the virtual machine, the emulator

10:39 will also capture that by modifying the internal

10:42 state. So it is a hierarchy of these services

10:46 where you could instantiate a virtual machine

10:49 and the virtual machine could be contained in

10:51 a VPC. So when you're updating the virtual machine,

10:54 for instance, the emulator knows that it also

10:57 must. updates the VPC. So talking about state,

11:00 how do you deal with weird edge cases like eventual

11:03 consistency, retries, throttling, quota errors,

11:06 that sort of thing? Right. So the emulator framework

11:09 itself is generated by an AI agent that reads

11:13 the cloud documentation. So the cloud documentation

11:16 describes the key behaviors of the cloud, but

11:20 it doesn't describe everything. So the question

11:22 that you ask is a very important class of problems.

11:25 which are not fully documented in the documentation.

11:28 As an example, eventual consistency and consistent

11:31 guarantees are often not described in detail.

11:34 But the API behaviors, how it should perform,

11:37 is documented very extensively. So what we are

11:41 doing here is that we are taking the emulator

11:43 and bootstrap it to a fully functional emulator,

11:47 but doesn't capture some of the nuances regarding

11:49 throttling, rate limiting, consistency. But we

11:53 have another simulator. in the backend that can

11:55 supply some of these semantics. So this functional

11:58 emulation can be, if there's a call into an API,

12:01 that API, we can also emulate latency for that

12:04 API throughout behaviors and quotas. So there

12:08 are an orthogonal subsystem that supplies that

12:12 kind of intricate detail to the emulator. That's

12:15 a great question. What's the success bar? Is

12:17 it like same response, same timing, same failure

12:20 mode? Yeah. So there are two milestones. The

12:23 first milestone is that it should enable the

12:26 same inputs and outputs for the APIs so that

12:29 DevOps engineers doesn't have to actually go

12:32 to the cloud to understand whether their program

12:35 is working. So then we just tested it and...

12:38 observe the actions in this emulator. And the

12:41 second milestone is that actually this emulator

12:44 can help with DevOps perform better debugging

12:47 than the cloud can. And the reason is that when

12:50 the cloud has an error, it gives you some trace,

12:53 but that trace is often verbose. It doesn't really

12:56 help with pinpointing which line of code is problematic

12:59 in your Terraform file or in your SDK file. Because

13:03 there's an AI agent living in the sandbox. The

13:06 agent can analyze the traces and produce better

13:09 debugging information and even pinpoint the problems

13:12 in Terraform. So the second milestone is actually

13:15 to do better debugging than what the cloud can

13:18 do. Interesting. So how do you prove that it's

13:20 not lying to me? That is a heart of the question.

13:25 How do we make sure that this emulator is actually

13:29 producing the same responses in the first release?

13:33 in the github we have more than 200 test cases

13:36 and these are cri command lines that you would

13:39 type into aws and we run a test between vera

13:44 and an existing emulator so what we've shown

13:47 is that vera is already doing much better than

13:51 existing emulators but the same set of test cases

13:55 I've also shown that Vera sometimes fails to

13:58 produce the same behavior because this is a agent

14:01 that continuously improves itself. And the first

14:05 release gets a 70 % based on our measurement.

14:09 And by this agent, we have another version that's

14:12 continuous running and improving itself until

14:15 it hits all the test cases. So one way that people

14:18 test it is to use a leading emulator called local

14:21 stack. Local stack is this really nice tool.

14:24 that emulates AWS APIs. It's not one -to -one.

14:28 I've found it's good in some ways, but yeah,

14:30 it's... It's not one -to -one. It's close enough

14:34 to enable classes of DevOps testing. In the open

14:37 source release, actually, we did a comparison

14:40 between Vera and local stack. So what we found

14:43 is that Vera covers more than 70 % of the cases,

14:47 whereas local stack covers 40%. So the first

14:51 version of Vera is already performing. quite

14:54 well in that regard. And we also have another

14:57 version that's continuously improving itself.

15:00 And the goal is to simulate the behavior of the

15:03 cloud to Terraform and DevOps programs, almost

15:06 like the Turing test. The ultimate goal is when

15:10 we run a DevOps program against the simulator

15:13 versus against the cloud, the DevOps program

15:17 doesn't feel any difference. It doesn't necessarily

15:19 mean that has to be line by line. character by

15:22 character the same regarding the logs and the

15:25 outputs, but we want it to be high fidelity enough

15:28 that DevOps engineers can test it thoroughly

15:31 in this simulator. So if I'm a platform team,

15:34 where would I actually plug this in? Local dev

15:36 versus CI versus prepod, validation, like what

15:39 would be a good first step? Right. One way of

15:42 using this is to integrate it to the CI -CD pipeline.

15:46 When there are code changes, there's a new Terraform

15:48 file. The agent can take the changes and validate

15:52 it in the sandbox first and suggest changes to

15:55 the program if there are errors and fix these

15:58 errors and generate corrections for the DevOps

16:01 engineers so that this would be integrated to

16:04 the CICD before it's actually pushed to the cloud.

16:07 Is there any use cases where maybe it's not well

16:10 suited for yet? Maybe it doesn't have enough

16:13 testing around it or validation. So there are

16:15 two things that we know about the limitations

16:18 of Vera. The first limitation is that it doesn't

16:21 yet cover all resources in AWS. It does cover

16:25 EC2, which is a key service. There's also a lot

16:29 more beyond EC2. That's the first limitation

16:31 that we know. And the second limitation is that

16:34 the current version, the current version doesn't

16:39 do some of the things that we... are thinking

16:42 about. For instance, I've talked about AI -based

16:45 debugging suggestions to DevOps engineers. So

16:47 that tooling is not fully ready yet. So currently,

16:52 if there's a bug, error doesn't automatically

16:54 diagnose the bug for you, which is part of our

16:58 ongoing plan. For the first limitation, which

17:01 is that it doesn't support all APIs and it doesn't

17:04 support customization. For instance, it doesn't

17:07 automatically understand how a specific deployment

17:10 is like. That deployment for an enterprise may

17:14 not all use all the APIs. They may use the APIs

17:16 in a very specialized way. So these kind of customizations

17:19 are also not there yet, but they are on our agenda.

17:23 So it sounds like EC2 is its area where you've

17:27 had a lot of focus and it seems like you have

17:29 trusted output there. What's the nastiest EC2

17:33 edge case that you've had to emulate? There's

17:36 a very interesting edge case that we have found

17:39 in this exercise, which is that sometimes our

17:42 AI co -developer that writes the emulator uses

17:46 one types of string formats, like a camel cases,

17:50 where it's easy to make format the same string

17:52 differently. And that's very interesting because

17:55 a Terraform program expects a certain type of

17:58 format. And if it's formatted slightly differently,

18:01 then Terraform program won't run. So there we

18:03 had to create specialized directions for the

18:06 agent so that you would only produce camel cases

18:09 when it's supposed to be camel cases. In other

18:13 cases, produce snake cases and so forth. I thought

18:15 that was a very interesting and unexpected edge

18:19 case where the initial version of Vera didn't

18:22 produce the exact response and we had to do extra

18:25 engineering to make that align. Interesting.

18:29 Do you think clouds will ever make official emulators

18:31 good enough or is learned emulation the path?

18:35 Right. One thing that's very special about cloud

18:39 emulation compared to other types of emulators

18:42 is that the cloud is a moving target. There are

18:45 new services every week and there are API changes.

18:48 Many of these changes will introduce different

18:50 behaviors. So it's building an emulator for the

18:54 cloud. Our experience is that there are two key

18:56 challenges. One is that the size of the cloud

18:59 is so big and there are so many different clouds

19:01 with different behaviors. Beyond AWS, we've also

19:05 investigated Azure and GCP, which is on our agenda

19:08 as well. They all have different APIs. They have

19:11 different behaviors. The emulation for one cloud

19:14 doesn't really generalize to the other. The second

19:17 is that the APIs go through constant evolution.

19:21 Because the clouds need, they want to stay competitive.

19:23 They're introducing new services, new ways of

19:26 using these services. We really believe that

19:28 learned and AI agent built emulator is the path.

19:32 Because the AI agent doesn't have to spend a

19:35 lot more extra effort once the emulator framework

19:38 is there. It still has to align the emulator

19:41 periodically. Whenever there's an API change,

19:44 the agent has to understand what has changed.

19:47 It has to generate strategic test cases. to realign

19:51 that API, but it doesn't have to do everything

19:53 from scratch. So this agent can keep up with

19:56 the changes that happen in the cloud and it can

19:59 gradually expand to different clouds. So this

20:01 is an ever -expanding emulator that can catch

20:04 up with the speed of the cloud. And that's something

20:07 that we are very excited about regarding this

20:10 learned emulation. Have you done much as far

20:12 as GCP training yet? I'm just curious because...

20:16 The IAM approach in GCP is completely different

20:20 than the IAM approach in AWS. Right. Or even

20:24 like Cloud Run versus Lambda is also completely

20:27 different. Fundamentally different services,

20:29 although the same general idea or same general

20:33 focus. Yeah. And that is a very good question.

20:36 The clouds call the same service differently.

20:39 So they're almost aliases. So here is where AI

20:43 will shine. Because as long as we can... make

20:46 the AI understand. There are certain concepts

20:49 across clouds that are similar. Instances are

20:52 called virtual machines in a different cloud.

20:54 Then there are certain knowledge base in the

20:56 AI that can transfer from one cloud to another.

20:59 So there's a core of the learned knowledge that

21:02 can transfer, but not everything. The APIs are

21:05 still different and the services do not always

21:09 have a counterpart across clouds. What I'm excited

21:12 about this approach is that some core knowledge

21:15 of the cloud can be transferred so that when

21:17 we're building the second emulator for GCP, it

21:20 will be much faster than the first one for AWS,

21:23 where it has to learn everything, all the concepts

21:26 from scratch. Where can people find more information

21:29 about Vera? We have an open source release called

21:32 Project Vera. And it's project -vera .github

21:36 .io, where we have an open source release, as

21:39 well as the publication that we had over the

21:41 years that eventually led to this paper, to this

21:43 simulator. And that could be a good source of

21:47 information that not only is about the release.

21:51 but also the rationale behind the release and

21:54 the specific approach that we take in designing

21:56 Vera and other tools that we have built in the

21:59 past couple of years surrounding AIOps and DevOps.

22:03 And we are looking for contributors to help us

22:06 improve Vera. And if you're interested in contributing

22:10 to the open source release or contributing new

22:13 ideas, or if you have a service that you would

22:15 like to see emulated, this is something that

22:17 we are very excited to help you with. What's

22:20 the license model for Vera? currently? It's under

22:23 MIT license in open source. That's good to hear.

22:26 Too many new open source projects like to limit

22:29 their open source initiatives, which is a little

22:32 frustrating. Right. It's fully open source under

22:36 MIT license. Thanks. Well, thank you, Ang, for

22:38 coming on. Really appreciate it. Thank you very

22:40 much, Brian. It's great to be here. All right.

22:43 That's my conversation with Ang Chen. My biggest

22:46 takeaway for this one is that AI for infrastructure

22:49 gets a lot more believable when it has to survive

22:52 a sandbox first. That's what makes Vera interesting

22:55 to me. Not just that it emulates cloud behavior,

22:58 but that it is trying to create a local, high

23:01 -fidelity environment where cloud workflows can

23:04 be tested without real credentials, real billing,

23:07 or real production risk. And since we recorded

23:10 this, the project has clearly kept moving. It's

23:13 now being presented publicly as a multi -cloud

23:16 emulator, not just an EC2 -focused idea, with

23:20 AWS and GCP support and a stronger public story

23:24 around local testing and safer iteration. I also

23:27 liked that the bigger vision was not just AI

23:30 does ops. It was more grounded than that. Give

23:33 agents a training ground. Let them learn inside

23:36 something rule -based. See how they behave, see

23:39 where they fail, then decide what, if anything,

24:04 Thanks for listening, and I'll see you later

24:07 this week.

Ship It Conversations: Ang Chen on Project Vera, AI Cloud Emulation, and Safer Infrastructure Testing

Watch this episode here

Transcript

Catch This Episode

Host Commentary

Show Notes