Ship It Conversations: Human-in-the-Loop Fixer Bots and AI Guardrails in CI/CD (with Gracious James)

Transcript

0:09 Hey, I'm Brian Teller. I work in DevOps and SRE,

0:12 and I run Teller's Tech. Ship It Weekly is where

0:15 I filter the noise and focus on what actually

0:18 matters when you're the one running infrastructure

0:20 and owning reliability. Most weeks, it's a quick

0:23 news recap. In between those, I drop interview

0:26 episodes with folks who are actually building

0:28 in the space. Today is one of those interviews.

0:31 We're talking with Gracious James about TARS,

0:34 his human -in -the -loop fixer bot wired into

0:37 CICD. We get into how he segments incident response

0:41 into sub workflows, how he keeps agents from

0:44 turning into SSH with a chat bot, and the guardrails

0:47 he thinks are non -negotiable if you're going

0:50 to let AI anywhere near production. Today, I'm

0:58 joined by Gracious James. He's been building

1:00 TARS, a human in the loop fixer bot wired into

1:04 CICD. And we're going to talk about what it takes

1:06 to make AI automation safe enough for real teams.

1:10 Gracious, thank you for joining me. Thanks, Brian.

1:12 Great to be here. Like you do it. I'm doing really

1:15 well. Thanks for asking. So I'm really interested

1:17 in this TARS bot that you built. Can you give

1:19 me a thesis of what you were actually building

1:22 with TARS? And why is human in the loop the core

1:24 idea? All right. So I am a software engineer.

1:28 And when I coded and pushed my code into, you

1:32 know, CI CD pipelines, I saw that it Took a lot

1:35 of time to get through a lot of teams, especially

1:38 the DevOps ones. So I started looking at the

1:41 processes of how DevOps go through, the incident

1:43 response and everything like that. There's a

1:45 lot of steps, manual steps that each and every

1:48 person has to cover through. Going through it,

1:50 I realized that all of these things are very

1:53 manual, very repetitive. And don't get me wrong,

1:58 everything is important and can break the system

2:00 at any point. So having a human in the loop is

2:03 very important. But looking at the processes,

2:06 I also see that all of these things can be automated.

2:08 So having an AI agent where human in the loop

2:13 approach increasingly helps us to fasten the

2:19 processes is what got me going into all of this.

2:22 Interesting. So what problem were you trying

2:24 to solve that like existing tooling didn't handle?

2:28 Right. So all of this, all of the tools that

2:31 I used into this workflow. are existing tools,

2:35 only that these are not custom made for, you

2:38 know, my requirement, my company or whatever

2:41 it is. So I had to custom make the workflow using

2:45 multiple tools from different, different domains.

2:48 So I custom made, you know, the SS access in

2:54 which there are, you know, validated SSH commands

2:56 that my AI agent can utilize to find out what

3:00 is going on into my Docker containers. to get

3:03 a proper system status report. And in order to

3:06 do a comprehensive check, I compare those logs

3:09 or the errors or whatever it is that the Docker

3:11 is pointing out to my previous GitHub comments,

3:15 be it the previous ones, the latest ones, or

3:17 whatever it is, I compare them and get a comprehensive

3:20 check. And based on those checks, it helps me

3:23 to figure out where I should narrow my focus

3:27 to. And once I do that, I kind of understand

3:30 what are the fixes that I should be figuring

3:32 out in a very short incident response time that

3:36 I get. That makes sense. Okay, so this was originally

3:39 Friday, and then you morphed into TARS, right?

3:43 Yes. Can you walk me through this evolution from

3:46 Friday? Like, what was Friday and what is TARS?

3:48 Well, Friday in itself is an Aniton workflow.

3:53 Now, Aniton is a... high -level abstract API,

3:58 sorry, multiple API application system where

4:01 we can connect different programs, subsections,

4:06 and sub -workflows in which we take the data

4:08 from one workflow, one action, and use it for

4:11 the next process and step -by -step so that based

4:15 on our custom input, we get a custom output which

4:18 we utilize for something. But Friday in itself

4:21 used a lot of tools, like some of the tools I

4:24 myself created and named, like visit development

4:27 server, visit production server, visit this API

4:30 container, that container. And based on all of

4:33 this, my LLM creates a system status report in

4:38 which I get to know whether some container is

4:41 down or up or what is going on, if everything

4:44 is working fine. And if anything is down. Then

4:47 it sends me a report via Telegram. And based

4:52 on that report, it's kind of an alarm going off

4:55 like in a plane's cockpit. And I am the pilot.

4:59 All I see that, okay, Friday is telling me something

5:02 is wrong over there. So it's my duty to go and

5:04 evaluate, check what's going on over there. So

5:07 yeah, it's kind of an alarm alert system. Interesting.

5:10 How do you keep this from turning into like a

5:13 chatbot just with like shell access? Okay, so

5:15 yeah. Everything is revolving around guardrails.

5:21 So when it comes to prompt, the AI agent for

5:25 doing system checks has its own prompt, and the

5:29 prompt involves not just doing any actions. So

5:33 it's very limited to commands like Docker PS,

5:37 Docker logs, the errors, and all of this information

5:41 that it can gather to make a comprehensive update

5:47 as to what the system is going through. It actually

5:50 cannot do commands that are harmful for the system.

5:52 And that's given in the prompt. Now I get your

5:55 question as to like, the LLM can go ahead and

5:59 do anything if it has the SSH access. For that,

6:02 what I have included is that in the tools like

6:05 visit development server or production server,

6:08 those are sub workflows. And those sub workflows.

6:12 don't just contain the SSH access. It has some

6:16 validation parameters. It won't just let the

6:19 LLM run any command it wants. It has validation

6:22 checks based on what I have programmed it for.

6:25 For logs, it has only Docker PS, Docker logs,

6:29 and it can never actually go and maybe Docker

6:33 stop or Docker restart because all of those commands

6:36 I have deliberately blocked in that sub workflow.

6:40 So it is kind of a manual override over there.

6:43 But yeah, that is a limitation that I have translated

6:45 as a guardrail. That makes sense. So, OK, speaking

6:49 on guardrails a little bit more online, you had

6:51 mentioned explicit rollback phases. What do those

6:54 look like? OK, rollback phases. So, well, as

6:57 a software engineer, when I do push some version

7:00 into my development branch or from development

7:03 to production, there's a lot of chances like.

7:06 after the CI, CD testing and integration, something

7:10 does break when it actually goes out into the

7:13 real world. So what we do know always as a software

7:17 engineer is that the previous version, well,

7:19 right, the previous version was pretty stable.

7:22 So our go -to response is, you know, of course,

7:25 fixing everything and putting it into the right

7:28 place. But if it's going to take time, if we

7:30 know based on the system status check. that,

7:34 okay, this error is a big one and it's going

7:36 to take time rather than, you know, just sitting

7:39 on it for a couple of days and, you know, just

7:42 letting the clients know that, okay, I'm going

7:44 to take my own time and do this. I'm just going

7:47 to roll back to the previous stable version so

7:49 that their work don't get, you know, hindered

7:51 or whatever. You get it, right? Oh, that makes

7:54 sense. Okay, so given the guardrails that you

7:56 have in place, where do you require... human

7:58 approval or is there human approval like in the

8:00 process yeah there is there is the human approval

8:03 uh does not come during the system status check

8:09 so i don't really tell the ai agent you know

8:11 where to go to read the data where to go to debug

8:16 the code or whatever it is i have given it a

8:18 set of instructions like you can check the docker

8:21 logs you can check the terminal logs you can

8:24 check the logging system that i have implemented

8:27 per code base you can check the previous github

8:31 commits to correlate as to what's going wrong

8:34 but i don't really say it to you know go and

8:37 check over here i think it might be wrong over

8:39 there it doesn't do that what the human in the

8:42 loop approach where it comes in is to do some

8:47 kind of action now the action involves you know

8:50 pressing some kind of button or running a pipeline

8:54 or a workflow that actually changes some things

8:57 in the system. All of these things, the system

9:01 checks and everything is just reading the data

9:02 and creating a status report. But where it actually

9:07 comes to change something, change in the system,

9:10 that's where the human in the loop approach comes

9:12 in. And without my, you know, verbal API key,

9:15 me actually telling the agent to run this pipeline,

9:19 run this workflow. It won't really do that. Okay,

9:22 so you had mentioned too that you had prompts

9:24 with like guardrails. What are the hard never

9:27 run this rules that you have set up? So it is

9:30 kind of a corollary. What I'm trying to say is

9:33 without a specific phrase, like run the CI CD

9:37 pipeline to deploy this, this version. This is

9:41 the exact phrase it requires to run the CI CD

9:45 pipeline. So in my prompt, I have written it

9:48 so that it. looks for the specific phrase to

9:52 actually activate that function. So without it,

9:55 without an actual phrase, without those actual

9:57 wordings, it can't run it. And that's in the

9:59 prompt, I get that. But as I told you before,

10:01 as a guardrail, I have put it also into the sub

10:04 workflow so that it doesn't just run anything.

10:07 Once Friday gives me the comprehensive system

10:09 check, I have two options usually. One is to

10:12 ask it again for a comprehensive system check

10:15 where it has access to GitHub repositories and

10:18 the GitHub actions. And that's when TARS activates

10:22 because Friday doesn't have access to the GitHub,

10:24 TARS does. So TARS goes through the normal Docker

10:28 logs and compares it with the GitHub commits

10:30 and what might have gone wrong. And it gives

10:32 me a proper report as to this is the area that

10:36 I need to focus in. Now, it does ask me for a

10:39 response as to what to do next, because after

10:42 all, something is wrong in the system. So it's

10:44 waiting for my approval or some kind of message

10:47 that I need it to get. So I can again ask it

10:50 to do some kind of comprehensive check. But if

10:53 I'm good with whatever information it has given

10:56 me, what I'll do is I can tell it to, you know,

10:59 run the CI CD pipeline to redeploy the previous

11:02 stable version, the previous GitHub commit. So

11:06 what it does is that based on this specific phrase,

11:09 it activates the CI CD pipeline because in the

11:13 sub workflow of the CI CD. pipeline, I have actually

11:17 written that in order to run this, you need the

11:20 specific phrase, run the CSED pipeline to redeploy

11:24 the previous stable version. So without this

11:26 phrase, so if this phrase involves like, okay,

11:29 run the pipeline without the CSED or run the

11:33 pipeline that I last deployed in the GitHub actions,

11:36 it won't ever run because it does not match that

11:38 specific phrase. And that's kind of rudimentary,

11:41 but it gives me more, you know. believable focus

11:45 as to, you know, it wouldn't just run anything

11:48 just because I wrote it. Like, it is not up to

11:51 its comprehension. That makes sense. Okay, so

11:55 let's say that a team wants to adopt human -in

11:58 -the -loop AI SRE automation. Where do you think

12:02 they should start? Okay, they should start, honestly,

12:06 where they feel that a lot of steps that they

12:09 do are not just rudimentary, but repetitive in

12:12 tasks. Because that's where I started. I saw

12:15 that the incident response procedure in itself

12:18 has a lot of manual and repetitive steps. And

12:21 not that I wanted to automate all of it, because

12:24 after all, it requires access to SSH and a lot

12:27 of system. And I can't just let an LLM run wild

12:32 into it. So kind of segmenting. Yeah, that's

12:35 the right word. Segmenting the procedure into

12:37 different steps. converting those steps into

12:41 sub workflows and integrating all of those sub

12:44 workflows into a pipeline one by one where i

12:48 have the control between the input of one sub

12:51 workflow to the output of other one so that i

12:54 get in i keep in control as to what the whole

12:57 picture is without actually letting one llm know

13:00 what the entire big picture is so that it can

13:02 manipulate is the go -to for integrating these

13:06 type of tools. How would you pitch human -in

13:09 -the -loop AI SRE automation to someone that's

13:12 maybe more skeptical? All right. So if a person

13:17 has their own skepticism about AI actions, LLMs,

13:24 thinking on their own, having access to all of

13:27 these tools, I get that. I completely get that

13:30 because that's where I started as well. But when

13:33 you delve deep into how an LLM works, it works

13:36 on an input and an output. So as long as you

13:39 can use that particular stage of giving an input

13:44 to an LLM and phrasing that output in a way that

13:48 I want to see and using that output into another

13:52 LLM so that that LLM only gets the input that

13:56 I am giving it, keeps the big picture to myself.

13:59 and never the AI letting it know what the big

14:03 picture is. So you obviously introduced this

14:05 because you're trying to, you know, reduce toil.

14:07 You're trying to iterate quicker. What was the

14:10 first thing that you used this AI tool to help

14:14 solve? Okay, so... Where'd you start? Yeah, it

14:17 never started as a fixer bot. It actually started

14:20 just on the basis of, you know, go into the system

14:23 and tell me what went wrong and why did the Docker

14:26 container stop? That's all. What I used to do

14:29 is creating a simple LLM workflow in which I

14:33 copied the output of the Docker logs, pasted

14:37 it into the LLM, letting it know whatever the

14:40 data I have, I'm giving it to you, tell me what

14:42 went wrong. So then it explains me what went

14:45 wrong one by one. And that's when I thought that,

14:48 okay, this is one portion of it. Why not I use

14:52 the same portion for every single container that

14:54 I have, but I want to automate it so that all

14:58 I tell it is that, okay, go to this container

15:00 and check it out rather than me copy pasting

15:03 the logs and putting it and whatever it is. So

15:05 for that, I needed to give it access to the terminal,

15:08 right? So that's where the SSH access came in.

15:11 And that became a sub workflow because I still

15:14 needed to keep it safe so that the LLM just can't,

15:17 you know, write anything. I put the guardrail

15:19 over there so that it can only write stuff like

15:23 Docker PS or Docker logs. And it did the same

15:26 thing for me, told me what is going wrong or

15:29 how the system is working. And that's where it

15:31 started. And that's where it saved tons of time,

15:35 me just copy pasting stuff, me just looking in

15:37 the wrong places, because I have to look to all

15:40 of those containers to see what is going wrong.

15:43 So instead of that, I know exactly where I need

15:46 to narrow my focus. That's where it all started.

15:49 And then you just iterated on that and started

15:51 adding more features. Yeah, yeah, absolutely.

15:54 What does future development look like for TARS?

15:57 Like, let's say the next three, six months, assuming

16:00 you have the runway and you have the time, what

16:02 features would you like to add? That is a really

16:04 good question. Well, TARS in itself is, I don't

16:07 want to say limited. Limited is not the right

16:09 word because two, three months ago, I was pretty

16:12 astonished as to what TARS could do with the

16:14 tools that it had. And that itself was a huge

16:17 accomplishment over there. But now after a couple

16:19 of months when I use it regularly, I feel that

16:22 I could go more. So I could give it access to

16:25 more tools. I can give it access to the infrastructure,

16:27 maybe Terraform and go from understanding what

16:32 resources my system needs. And if it's lagging

16:35 somewhere, what is the focus that I should maybe

16:39 increase some resources somewhere or maybe give

16:42 it access to. change stuff in Terraform or something

16:46 like that. That's Terraform side of things, infrastructure.

16:48 And that's one more tool to think about other

16:51 than just, you know, GitHub actions and SSH access.

16:54 Let's say you have teams that are interested

16:55 in experimenting with agents in CICD, but they're

16:58 apprehensive. Do you have any advice for how

17:01 to get started or where to start? Yep, yep. So

17:04 it never starts with an LLM, I'll tell you that.

17:06 Yes, we are trying to embed AI into it, but it

17:10 never starts with an LLM. It always starts with...

17:12 actually getting your thoughts and putting it

17:15 into a paper, putting it into steps, logical

17:17 steps that I can program first and then take

17:22 that program and understand how an LLM will be

17:25 effective in what positions. Because if you don't

17:28 really have a work plan or a game plan to begin

17:31 with, then the AI will also be as confused as

17:34 you were. Yeah. That's where I'll start. So what's

17:38 your take on fully autonomous AI ops agents?

17:41 It's scary. Because I myself am not a very big

17:46 fan of, you know, I don't want to say it out

17:50 loud, but I don't really believe a lot in experience

17:54 counts. Because I believe if you're smart enough

17:57 to do the smart thing, then sometimes even you

18:00 can, you know, outgrow experienced people. If

18:04 you know where to look and how things work very

18:06 logically instead of, you know, just gut feeling.

18:09 Because experienced people tend to do that more.

18:11 And that's where the AI and LLM lacks. It's kind

18:16 of a hypocritical thing to say that thinking

18:19 about the AI and LLM, always taking the smart

18:22 choice instead of, you know, just having lots

18:24 of experience, just a human like a human has.

18:27 But what I'm trying to say is that I still believe

18:29 in the gut feeling that I truly hate in a lot

18:33 of people. And that gut feeling is never going

18:36 to be at least. With the information that I know

18:38 about AI and LLMs, it has never proved to have

18:44 a conscience or a gut feeling that even if it

18:47 sees all the data pointing to one thing, maybe

18:50 a human knows that where exactly the error is,

18:54 but the LLM will always go towards the probable

18:56 cause instead of the gut feeling. So that's what

19:00 the... That's where LLMs lack. And that's why

19:03 they can never replace humans because we have

19:06 that creative aspect to us. And it's just a machine

19:09 at the end of the day. Yeah, that's fair. Okay,

19:12 so going back to guardrails, what do you think?

19:15 We've talked about guardrails a lot. We've talked

19:17 about AI and... autonomy a lot. What would you

19:20 say is the single most important guardrail for

19:22 an AI fixer bot? In terms of maybe single most,

19:27 I would say validating every output that an LLM

19:31 gives before you use that output for anything

19:34 else. Yeah, that's fair. Okay, so closing thoughts,

19:37 just curious, what would be like one of the main

19:40 reasons to start using AI and CICD? What can

19:44 that buy you? Well, AI is like an assistant.

19:48 And it works to automate a lot of repetitive

19:51 tasks that humans don't want to do. And that

19:54 primarily saves time. And time saved is money

19:57 saved. But whenever we do tend to automate these

20:01 things, we tend to get carried away in a way

20:04 that, okay, I can automate this. I can automate

20:06 that. I don't need you. I don't need that. I

20:09 can put all of this into code. Well and good,

20:11 but... Anything can go wrong when an AI hallucinates.

20:16 Yeah, for sure. All right, so closing out, what

20:19 advice do you have for people using AI in their

20:22 pipelines? Well, I understand there are a lot

20:25 of disadvantages for a full -scale AI taking

20:28 over a traditional system, but don't let that

20:31 hamper you. Don't let that stop you from understanding

20:35 your own process, understanding your own thoughts,

20:38 writing it down. writing a program and, you know,

20:41 running along with it because everyone gets their

20:44 own time and you know how to utilize that time.

20:47 So you can always stop at a point, introspect

20:50 as to what you have done till now, how it's going

20:53 to affect the future, how it has changed your

20:56 past, how it's going to help other people, how

20:59 it's going to hamper other people and take all

21:03 of these things into confidence and building

21:05 the next step. Makes sense. Appreciate it. Thank

21:08 you, Gracious, for coming on. Thanks a lot, Brian.

21:10 It was great being on this podcast with you.

21:20 All right. That's the conversation with Gracious.

21:24 I really liked his angle of designing the process

21:26 first and only then dropping LLMs in as helpers,

21:30 not magic. If this episode was useful, share

21:34 it with a platform, SRE, or DevOps friend who's

21:37 been playing with AI agents and trying not to

21:39 blow up prod. Hit follow wherever you listen

21:42 so you don't miss the weekly news recaps plus

21:45 these guest interviews. We'll be back with a

21:47 regular Ship It Weekly news episode later this

21:50 week. See you then. Thank you.

Ship It Conversations: Human-in-the-Loop Fixer Bots and AI Guardrails in CI/CD (with Gracious James)

Transcript

Catch This Episode

Host Commentary

Show Notes

Meet Brian Teller