Ship It Conversations: Human-in-the-Loop Fixer Bots and AI Guardrails in CI/CD (with Gracious James)

Transcript

0:09 Hey, I'm Brian Teller. I work in DevOps and SRE,

0:12 and I run Teller's Tech. Ship It Weekly is where

0:15 I filter the noise and focus on what actually

0:18 matters when you're the one running infrastructure

0:20 and owning reliability. Most weeks, it's a quick

0:23 news recap. In between those, I drop interview

0:26 episodes with folks who are actually building

0:28 in the space. Today is one of those interviews.

0:31 We're talking with Gracious James about TARS,

0:34 his human -in -the -loop fixer bot wired into

0:37 CICD. We get into how he segments incident response

0:41 into sub workflows, how he keeps agents from

0:44 turning into SSH with a chat bot, and the guardrails

0:47 he thinks are non -negotiable if you're going

0:50 to let AI anywhere near production. Today, I'm

0:58 joined by Gracious James. He's been building

1:00 TARS, a human in the loop fixer bot wired into

1:04 CICD. And we're going to talk about what it takes

1:06 to make AI automation safe enough for real teams.

1:10 Gracious, thank you for joining me. Thanks, Brian.

1:12 Great to be here. Like you do it. I'm doing really

1:15 well. Thanks for asking. So I'm really interested

1:17 in this TARS bot that you built. Can you give

1:19 me a thesis of what you were actually building

1:22 with TARS? And why is human in the loop the core

1:24 idea? All right. So I am a software engineer.

1:28 And when I coded and pushed my code into, you

1:32 know, CI CD pipelines, I saw that it Took a lot

1:35 of time to get through a lot of teams, especially

1:38 the DevOps ones. So I started looking at the

1:41 processes of how DevOps go through, the incident

1:43 response and everything like that. There's a

1:45 lot of steps, manual steps that each and every

1:48 person has to cover through. Going through it,

1:50 I realized that all of these things are very

1:53 manual, very repetitive. And don't get me wrong,

1:58 everything is important and can break the system

2:00 at any point. So having a human in the loop is

2:03 very important. But looking at the processes,

2:06 I also see that all of these things can be automated.

2:08 So having an AI agent where human in the loop

2:13 approach increasingly helps us to fasten the

2:19 processes is what got me going into all of this.

2:22 Interesting. So what problem were you trying

2:24 to solve that like existing tooling didn't handle?

2:28 Right. So all of this, all of the tools that

2:31 I used into this workflow. are existing tools,

2:35 only that these are not custom made for, you

2:38 know, my requirement, my company or whatever

2:41 it is. So I had to custom make the workflow using

2:45 multiple tools from different, different domains.

2:48 So I custom made, you know, the SS access in

2:54 which there are, you know, validated SSH commands

2:56 that my AI agent can utilize to find out what

3:00 is going on into my Docker containers. to get

3:03 a proper system status report. And in order to

3:06 do a comprehensive check, I compare those logs

3:09 or the errors or whatever it is that the Docker

3:11 is pointing out to my previous GitHub comments,

3:15 be it the previous ones, the latest ones, or

3:17 whatever it is, I compare them and get a comprehensive

3:20 check. And based on those checks, it helps me

3:23 to figure out where I should narrow my focus

3:27 to. And once I do that, I kind of understand

3:30 what are the fixes that I should be figuring

3:32 out in a very short incident response time that

3:36 I get. That makes sense. Okay, so this was originally

3:39 Friday, and then you morphed into TARS, right?

3:43 Yes. Can you walk me through this evolution from

3:46 Friday? Like, what was Friday and what is TARS?

3:48 Well, Friday in itself is an Aniton workflow.

3:53 Now, Aniton is a... high -level abstract API,

3:58 sorry, multiple API application system where

4:01 we can connect different programs, subsections,

4:06 and sub -workflows in which we take the data

4:08 from one workflow, one action, and use it for

4:11 the next process and step -by -step so that based

4:15 on our custom input, we get a custom output which

4:18 we utilize for something. But Friday in itself

4:21 used a lot of tools, like some of the tools I

4:24 myself created and named, like visit development

4:27 server, visit production server, visit this API

4:30 container, that container. And based on all of

4:33 this, my LLM creates a system status report in

4:38 which I get to know whether some container is

4:41 down or up or what is going on, if everything

4:44 is working fine. And if anything is down. Then

4:47 it sends me a report via Telegram. And based

4:52 on that report, it's kind of an alarm going off

4:55 like in a plane's cockpit. And I am the pilot.

4:59 All I see that, okay, Friday is telling me something

5:02 is wrong over there. So it's my duty to go and

5:04 evaluate, check what's going on over there. So

5:07 yeah, it's kind of an alarm alert system. Interesting.

5:10 How do you keep this from turning into like a

5:13 chatbot just with like shell access? Okay, so

5:15 yeah. Everything is revolving around guardrails.

5:21 So when it comes to prompt, the AI agent for

5:25 doing system checks has its own prompt, and the

5:29 prompt involves not just doing any actions. So

5:33 it's very limited to commands like Docker PS,

5:37 Docker logs, the errors, and all of this information

5:41 that it can gather to make a comprehensive update

5:47 as to what the system is going through. It actually

5:50 cannot do commands that are harmful for the system.

5:52 And that's given in the prompt. Now I get your

5:55 question as to like, the LLM can go ahead and

5:59 do anything if it has the SSH access. For that,

6:02 what I have included is that in the tools like

6:05 visit development server or production server,

6:08 those are sub workflows. And those sub workflows.

6:12 don't just contain the SSH access. It has some

6:16 validation parameters. It won't just let the

6:19 LLM run any command it wants. It has validation

6:22 checks based on what I have programmed it for.

6:25 For logs, it has only Docker PS, Docker logs,

6:29 and it can never actually go and maybe Docker

6:33 stop or Docker restart because all of those commands

6:36 I have deliberately blocked in that sub workflow.

6:40 So it is kind of a manual override over there.

6:43 But yeah, that is a limitation that I have translated

6:45 as a guardrail. That makes sense. So, OK, speaking

6:49 on guardrails a little bit more online, you had

6:51 mentioned explicit rollback phases. What do those

6:54 look like? OK, rollback phases. So, well, as

6:57 a software engineer, when I do push some version

7:00 into my development branch or from development

7:03 to production, there's a lot of chances like.

7:06 after the CI, CD testing and integration, something

7:10 does break when it actually goes out into the

7:13 real world. So what we do know always as a software

7:17 engineer is that the previous version, well,

7:19 right, the previous version was pretty stable.

7:22 So our go -to response is, you know, of course,

7:25 fixing everything and putting it into the right

7:28 place. But if it's going to take time, if we

7:30 know based on the system status check. that,

7:34 okay, this error is a big one and it's going

7:36 to take time rather than, you know, just sitting

7:39 on it for a couple of days and, you know, just

7:42 letting the clients know that, okay, I'm going

7:44 to take my own time and do this. I'm just going

7:47 to roll back to the previous stable version so

7:49 that their work don't get, you know, hindered

7:51 or whatever. You get it, right? Oh, that makes

7:54 sense. Okay, so given the guardrails that you

7:56 have in place, where do you require... human

7:58 approval or is there human approval like in the

8:00 process yeah there is there is the human approval

8:03 uh does not come during the system status check

8:09 so i don't really tell the ai agent you know

8:11 where to go to read the data where to go to debug

8:16 the code or whatever it is i have given it a

8:18 set of instructions like you can check the docker

8:21 logs you can check the terminal logs you can

8:24 check the logging system that i have implemented

8:27 per code base you can check the previous github

8:31 commits to correlate as to what's going wrong

8:34 but i don't really say it to you know go and

8:37 check over here i think it might be wrong over

8:39 there it doesn't do that what the human in the

8:42 loop approach where it comes in is to do some

8:47 kind of action now the action involves you know

8:50 pressing some kind of button or running a pipeline

8:54 or a workflow that actually changes some things

8:57 in the system. All of these things, the system

9:01 checks and everything is just reading the data

9:02 and creating a status report. But where it actually

9:07 comes to change something, change in the system,

9:10 that's where the human in the loop approach comes

9:12 in. And without my, you know, verbal API key,

9:15 me actually telling the agent to run this pipeline,

9:19 run this workflow. It won't really do that. Okay,

9:22 so you had mentioned too that you had prompts

9:24 with like guardrails. What are the hard never

9:27 run this rules that you have set up? So it is

9:30 kind of a corollary. What I'm trying to say is

9:33 without a specific phrase, like run the CI CD

9:37 pipeline to deploy this, this version. This is

9:41 the exact phrase it requires to run the CI CD

9:45 pipeline. So in my prompt, I have written it

9:48 so that it. looks for the specific phrase to

9:52 actually activate that function. So without it,

9:55 without an actual phrase, without those actual

9:57 wordings, it can't run it. And that's in the

9:59 prompt, I get that. But as I told you before,

10:01 as a guardrail, I have put it also into the sub

10:04 workflow so that it doesn't just run anything.

10:07 Once Friday gives me the comprehensive system

10:09 check, I have two options usually. One is to

10:12 ask it again for a comprehensive system check

10:15 where it has access to GitHub repositories and

10:18 the GitHub actions. And that's when TARS activates

10:22 because Friday doesn't have access to the GitHub,

10:24 TARS does. So TARS goes through the normal Docker

10:28 logs and compares it with the GitHub commits

10:30 and what might have gone wrong. And it gives

10:32 me a proper report as to this is the area that

10:36 I need to focus in. Now, it does ask me for a

10:39 response as to what to do next, because after

10:42 all, something is wrong in the system. So it's

10:44 waiting for my approval or some kind of message

10:47 that I need it to get. So I can again ask it

10:50 to do some kind of comprehensive check. But if

10:53 I'm good with whatever information it has given

10:56 me, what I'll do is I can tell it to, you know,

10:59 run the CI CD pipeline to redeploy the previous

11:02 stable version, the previous GitHub commit. So

11:06 what it does is that based on this specific phrase,

11:09 it activates the CI CD pipeline because in the

11:13 sub workflow of the CI CD. pipeline, I have actually

11:17 written that in order to run this, you need the

11:20 specific phrase, run the CSED pipeline to redeploy

11:24 the previous stable version. So without this

11:26 phrase, so if this phrase involves like, okay,

11:29 run the pipeline without the CSED or run the

11:33 pipeline that I last deployed in the GitHub actions,

11:36 it won't ever run because it does not match that

11:38 specific phrase. And that's kind of rudimentary,

11:41 but it gives me more, you know. believable focus

11:45 as to, you know, it wouldn't just run anything

11:48 just because I wrote it. Like, it is not up to

11:51 its comprehension. That makes sense. Okay, so

11:55 let's say that a team wants to adopt human -in

11:58 -the -loop AI SRE automation. Where do you think

12:02 they should start? Okay, they should start, honestly,

12:06 where they feel that a lot of steps that they

12:09 do are not just rudimentary, but repetitive in

12:12 tasks. Because that's where I started. I saw

12:15 that the incident response procedure in itself

12:18 has a lot of manual and repetitive steps. And

12:21 not that I wanted to automate all of it, because

12:24 after all, it requires access to SSH and a lot

12:27 of system. And I can't just let an LLM run wild

12:32 into it. So kind of segmenting. Yeah, that's

12:35 the right word. Segmenting the procedure into

12:37 different steps. converting those steps into

12:41 sub workflows and integrating all of those sub

12:44 workflows into a pipeline one by one where i

12:48 have the control between the input of one sub

12:51 workflow to the output of other one so that i

12:54 get in i keep in control as to what the whole

12:57 picture is without actually letting one llm know

13:00 what the entire big picture is so that it can

13:02 manipulate is the go -to for integrating these

13:06 type of tools. How would you pitch human -in

13:09 -the -loop AI SRE automation to someone that's

13:12 maybe more skeptical? All right. So if a person

13:17 has their own skepticism about AI actions, LLMs,

13:24 thinking on their own, having access to all of

13:27 these tools, I get that. I completely get that

13:30 because that's where I started as well. But when

13:33 you delve deep into how an LLM works, it works

13:36 on an input and an output. So as long as you

13:39 can use that particular stage of giving an input

13:44 to an LLM and phrasing that output in a way that

13:48 I want to see and using that output into another

13:52 LLM so that that LLM only gets the input that

13:56 I am giving it, keeps the big picture to myself.

13:59 and never the AI letting it know what the big

14:03 picture is. So you obviously introduced this

14:05 because you're trying to, you know, reduce toil.

14:07 You're trying to iterate quicker. What was the

14:10 first thing that you used this AI tool to help

14:14 solve? Okay, so... Where'd you start? Yeah, it

14:17 never started as a fixer bot. It actually started

14:20 just on the basis of, you know, go into the system

14:23 and tell me what went wrong and why did the Docker

14:26 container stop? That's all. What I used to do

14:29 is creating a simple LLM workflow in which I

14:33 copied the output of the Docker logs, pasted

14:37 it into the LLM, letting it know whatever the

14:40 data I have, I'm giving it to you, tell me what

14:42 went wrong. So then it explains me what went

14:45 wrong one by one. And that's when I thought that,

14:48 okay, this is one portion of it. Why not I use

14:52 the same portion for every single container that

14:54 I have, but I want to automate it so that all

14:58 I tell it is that, okay, go to this container

15:00 and check it out rather than me copy pasting

15:03 the logs and putting it and whatever it is. So

15:05 for that, I needed to give it access to the terminal,

15:08 right? So that's where the SSH access came in.

15:11 And that became a sub workflow because I still

15:14 needed to keep it safe so that the LLM just can't,

15:17 you know, write anything. I put the guardrail

15:19 over there so that it can only write stuff like

15:23 Docker PS or Docker logs. And it did the same

15:26 thing for me, told me what is going wrong or

15:29 how the system is working. And that's where it

15:31 started. And that's where it saved tons of time,

15:35 me just copy pasting stuff, me just looking in

15:37 the wrong places, because I have to look to all

15:40 of those containers to see what is going wrong.

15:43 So instead of that, I know exactly where I need

15:46 to narrow my focus. That's where it all started.

15:49 And then you just iterated on that and started

15:51 adding more features. Yeah, yeah, absolutely.

15:54 What does future development look like for TARS?

15:57 Like, let's say the next three, six months, assuming

16:00 you have the runway and you have the time, what

16:02 features would you like to add? That is a really

16:04 good question. Well, TARS in itself is, I don't

16:07 want to say limited. Limited is not the right

16:09 word because two, three months ago, I was pretty

16:12 astonished as to what TARS could do with the

16:14 tools that it had. And that itself was a huge

16:17 accomplishment over there. But now after a couple

16:19 of months when I use it regularly, I feel that

16:22 I could go more. So I could give it access to

16:25 more tools. I can give it access to the infrastructure,

16:27 maybe Terraform and go from understanding what

16:32 resources my system needs. And if it's lagging

16:35 somewhere, what is the focus that I should maybe

16:39 increase some resources somewhere or maybe give

16:42 it access to. change stuff in Terraform or something

16:46 like that. That's Terraform side of things, infrastructure.

16:48 And that's one more tool to think about other

16:51 than just, you know, GitHub actions and SSH access.

16:54 Let's say you have teams that are interested

16:55 in experimenting with agents in CICD, but they're

16:58 apprehensive. Do you have any advice for how

17:01 to get started or where to start? Yep, yep. So

17:04 it never starts with an LLM, I'll tell you that.

17:06 Yes, we are trying to embed AI into it, but it

17:10 never starts with an LLM. It always starts with...

17:12 actually getting your thoughts and putting it

17:15 into a paper, putting it into steps, logical

17:17 steps that I can program first and then take

17:22 that program and understand how an LLM will be

17:25 effective in what positions. Because if you don't

17:28 really have a work plan or a game plan to begin

17:31 with, then the AI will also be as confused as

17:34 you were. Yeah. That's where I'll start. So what's

17:38 your take on fully autonomous AI ops agents?

17:41 It's scary. Because I myself am not a very big

17:46 fan of, you know, I don't want to say it out

17:50 loud, but I don't really believe a lot in experience

17:54 counts. Because I believe if you're smart enough

17:57 to do the smart thing, then sometimes even you

18:00 can, you know, outgrow experienced people. If

18:04 you know where to look and how things work very

18:06 logically instead of, you know, just gut feeling.

18:09 Because experienced people tend to do that more.

18:11 And that's where the AI and LLM lacks. It's kind

18:16 of a hypocritical thing to say that thinking

18:19 about the AI and LLM, always taking the smart

18:22 choice instead of, you know, just having lots

18:24 of experience, just a human like a human has.

18:27 But what I'm trying to say is that I still believe

18:29 in the gut feeling that I truly hate in a lot

18:33 of people. And that gut feeling is never going

18:36 to be at least. With the information that I know

18:38 about AI and LLMs, it has never proved to have

18:44 a conscience or a gut feeling that even if it

18:47 sees all the data pointing to one thing, maybe

18:50 a human knows that where exactly the error is,

18:54 but the LLM will always go towards the probable

18:56 cause instead of the gut feeling. So that's what

19:00 the... That's where LLMs lack. And that's why

19:03 they can never replace humans because we have

19:06 that creative aspect to us. And it's just a machine

19:09 at the end of the day. Yeah, that's fair. Okay,

19:12 so going back to guardrails, what do you think?

19:15 We've talked about guardrails a lot. We've talked

19:17 about AI and... autonomy a lot. What would you

19:20 say is the single most important guardrail for

19:22 an AI fixer bot? In terms of maybe single most,

19:27 I would say validating every output that an LLM

19:31 gives before you use that output for anything

19:34 else. Yeah, that's fair. Okay, so closing thoughts,

19:37 just curious, what would be like one of the main

19:40 reasons to start using AI and CICD? What can

19:44 that buy you? Well, AI is like an assistant.

19:48 And it works to automate a lot of repetitive

19:51 tasks that humans don't want to do. And that

19:54 primarily saves time. And time saved is money

19:57 saved. But whenever we do tend to automate these

20:01 things, we tend to get carried away in a way

20:04 that, okay, I can automate this. I can automate

20:06 that. I don't need you. I don't need that. I

20:09 can put all of this into code. Well and good,

20:11 but... Anything can go wrong when an AI hallucinates.

20:16 Yeah, for sure. All right, so closing out, what

20:19 advice do you have for people using AI in their

20:22 pipelines? Well, I understand there are a lot

20:25 of disadvantages for a full -scale AI taking

20:28 over a traditional system, but don't let that

20:31 hamper you. Don't let that stop you from understanding

20:35 your own process, understanding your own thoughts,

20:38 writing it down. writing a program and, you know,

20:41 running along with it because everyone gets their

20:44 own time and you know how to utilize that time.

20:47 So you can always stop at a point, introspect

20:50 as to what you have done till now, how it's going

20:53 to affect the future, how it has changed your

20:56 past, how it's going to help other people, how

20:59 it's going to hamper other people and take all

21:03 of these things into confidence and building

21:05 the next step. Makes sense. Appreciate it. Thank

21:08 you, Gracious, for coming on. Thanks a lot, Brian.

21:10 It was great being on this podcast with you.

21:20 All right. That's the conversation with Gracious.

21:24 I really liked his angle of designing the process

21:26 first and only then dropping LLMs in as helpers,

21:30 not magic. If this episode was useful, share

21:34 it with a platform, SRE, or DevOps friend who's

21:37 been playing with AI agents and trying not to

21:39 blow up prod. Hit follow wherever you listen

21:42 so you don't miss the weekly news recaps plus

21:45 these guest interviews. We'll be back with a

21:47 regular Ship It Weekly news episode later this

21:50 week. See you then. Thank you.

Ship It Conversations: Human-in-the-Loop Fixer Bots and AI Guardrails in CI/CD (with Gracious James)

Transcript

Catch This Episode

Host Commentary

Show Notes

More from Ship It Weekly

Coinbase Outage, Meta AI Account Recovery, AWS AgentCore Code Injection, Apigee Tenant Isolation, and the Glue That Breaks Production

Kiro CLI Approval Bypass, Amazon Braket Pickle Risk, AWS Org Logging, KEDA Upgrades, and Automation’s Hidden Boundaries

GitHub Supply Chain Attacks, Railway’s GCP Outage, Discord’s Voice Failure, AWS Retry Changes, and Trusted Tool Risk

Ship It Conversations: Jake Warner on Cycle.io, Bare Metal’s Comeback, and Why Private Cloud Is Getting Interesting Again