0:09
Hey, I'm Brian Teller. I work in DevOps and SRE,
0:12
and I run Teller's Tech. Ship It Weekly is where
0:15
I filter the noise and focus on what actually
0:18
matters when you're the one running infrastructure
0:20
and owning reliability. Most weeks, it's a quick
0:23
news recap. In between those, I drop interview
0:26
episodes with folks who are actually building
0:28
in the space. Today is one of those interviews.
0:31
We're talking with Gracious James about TARS,
0:34
his human -in -the -loop fixer bot wired into
0:37
CICD. We get into how he segments incident response
0:41
into sub workflows, how he keeps agents from
0:44
turning into SSH with a chat bot, and the guardrails
0:47
he thinks are non -negotiable if you're going
0:50
to let AI anywhere near production. Today, I'm
0:58
joined by Gracious James. He's been building
1:00
TARS, a human in the loop fixer bot wired into
1:04
CICD. And we're going to talk about what it takes
1:06
to make AI automation safe enough for real teams.
1:10
Gracious, thank you for joining me. Thanks, Brian.
1:12
Great to be here. Like you do it. I'm doing really
1:15
well. Thanks for asking. So I'm really interested
1:17
in this TARS bot that you built. Can you give
1:19
me a thesis of what you were actually building
1:22
with TARS? And why is human in the loop the core
1:24
idea? All right. So I am a software engineer.
1:28
And when I coded and pushed my code into, you
1:32
know, CI CD pipelines, I saw that it Took a lot
1:35
of time to get through a lot of teams, especially
1:38
the DevOps ones. So I started looking at the
1:41
processes of how DevOps go through, the incident
1:43
response and everything like that. There's a
1:45
lot of steps, manual steps that each and every
1:48
person has to cover through. Going through it,
1:50
I realized that all of these things are very
1:53
manual, very repetitive. And don't get me wrong,
1:58
everything is important and can break the system
2:00
at any point. So having a human in the loop is
2:03
very important. But looking at the processes,
2:06
I also see that all of these things can be automated.
2:08
So having an AI agent where human in the loop
2:13
approach increasingly helps us to fasten the
2:19
processes is what got me going into all of this.
2:22
Interesting. So what problem were you trying
2:24
to solve that like existing tooling didn't handle?
2:28
Right. So all of this, all of the tools that
2:31
I used into this workflow. are existing tools,
2:35
only that these are not custom made for, you
2:38
know, my requirement, my company or whatever
2:41
it is. So I had to custom make the workflow using
2:45
multiple tools from different, different domains.
2:48
So I custom made, you know, the SS access in
2:54
which there are, you know, validated SSH commands
2:56
that my AI agent can utilize to find out what
3:00
is going on into my Docker containers. to get
3:03
a proper system status report. And in order to
3:06
do a comprehensive check, I compare those logs
3:09
or the errors or whatever it is that the Docker
3:11
is pointing out to my previous GitHub comments,
3:15
be it the previous ones, the latest ones, or
3:17
whatever it is, I compare them and get a comprehensive
3:20
check. And based on those checks, it helps me
3:23
to figure out where I should narrow my focus
3:27
to. And once I do that, I kind of understand
3:30
what are the fixes that I should be figuring
3:32
out in a very short incident response time that
3:36
I get. That makes sense. Okay, so this was originally
3:39
Friday, and then you morphed into TARS, right?
3:43
Yes. Can you walk me through this evolution from
3:46
Friday? Like, what was Friday and what is TARS?
3:48
Well, Friday in itself is an Aniton workflow.
3:53
Now, Aniton is a... high -level abstract API,
3:58
sorry, multiple API application system where
4:01
we can connect different programs, subsections,
4:06
and sub -workflows in which we take the data
4:08
from one workflow, one action, and use it for
4:11
the next process and step -by -step so that based
4:15
on our custom input, we get a custom output which
4:18
we utilize for something. But Friday in itself
4:21
used a lot of tools, like some of the tools I
4:24
myself created and named, like visit development
4:27
server, visit production server, visit this API
4:30
container, that container. And based on all of
4:33
this, my LLM creates a system status report in
4:38
which I get to know whether some container is
4:41
down or up or what is going on, if everything
4:44
is working fine. And if anything is down. Then
4:47
it sends me a report via Telegram. And based
4:52
on that report, it's kind of an alarm going off
4:55
like in a plane's cockpit. And I am the pilot.
4:59
All I see that, okay, Friday is telling me something
5:02
is wrong over there. So it's my duty to go and
5:04
evaluate, check what's going on over there. So
5:07
yeah, it's kind of an alarm alert system. Interesting.
5:10
How do you keep this from turning into like a
5:13
chatbot just with like shell access? Okay, so
5:15
yeah. Everything is revolving around guardrails.
5:21
So when it comes to prompt, the AI agent for
5:25
doing system checks has its own prompt, and the
5:29
prompt involves not just doing any actions. So
5:33
it's very limited to commands like Docker PS,
5:37
Docker logs, the errors, and all of this information
5:41
that it can gather to make a comprehensive update
5:47
as to what the system is going through. It actually
5:50
cannot do commands that are harmful for the system.
5:52
And that's given in the prompt. Now I get your
5:55
question as to like, the LLM can go ahead and
5:59
do anything if it has the SSH access. For that,
6:02
what I have included is that in the tools like
6:05
visit development server or production server,
6:08
those are sub workflows. And those sub workflows.
6:12
don't just contain the SSH access. It has some
6:16
validation parameters. It won't just let the
6:19
LLM run any command it wants. It has validation
6:22
checks based on what I have programmed it for.
6:25
For logs, it has only Docker PS, Docker logs,
6:29
and it can never actually go and maybe Docker
6:33
stop or Docker restart because all of those commands
6:36
I have deliberately blocked in that sub workflow.
6:40
So it is kind of a manual override over there.
6:43
But yeah, that is a limitation that I have translated
6:45
as a guardrail. That makes sense. So, OK, speaking
6:49
on guardrails a little bit more online, you had
6:51
mentioned explicit rollback phases. What do those
6:54
look like? OK, rollback phases. So, well, as
6:57
a software engineer, when I do push some version
7:00
into my development branch or from development
7:03
to production, there's a lot of chances like.
7:06
after the CI, CD testing and integration, something
7:10
does break when it actually goes out into the
7:13
real world. So what we do know always as a software
7:17
engineer is that the previous version, well,
7:19
right, the previous version was pretty stable.
7:22
So our go -to response is, you know, of course,
7:25
fixing everything and putting it into the right
7:28
place. But if it's going to take time, if we
7:30
know based on the system status check. that,
7:34
okay, this error is a big one and it's going
7:36
to take time rather than, you know, just sitting
7:39
on it for a couple of days and, you know, just
7:42
letting the clients know that, okay, I'm going
7:44
to take my own time and do this. I'm just going
7:47
to roll back to the previous stable version so
7:49
that their work don't get, you know, hindered
7:51
or whatever. You get it, right? Oh, that makes
7:54
sense. Okay, so given the guardrails that you
7:56
have in place, where do you require... human
7:58
approval or is there human approval like in the
8:00
process yeah there is there is the human approval
8:03
uh does not come during the system status check
8:09
so i don't really tell the ai agent you know
8:11
where to go to read the data where to go to debug
8:16
the code or whatever it is i have given it a
8:18
set of instructions like you can check the docker
8:21
logs you can check the terminal logs you can
8:24
check the logging system that i have implemented
8:27
per code base you can check the previous github
8:31
commits to correlate as to what's going wrong
8:34
but i don't really say it to you know go and
8:37
check over here i think it might be wrong over
8:39
there it doesn't do that what the human in the
8:42
loop approach where it comes in is to do some
8:47
kind of action now the action involves you know
8:50
pressing some kind of button or running a pipeline
8:54
or a workflow that actually changes some things
8:57
in the system. All of these things, the system
9:01
checks and everything is just reading the data
9:02
and creating a status report. But where it actually
9:07
comes to change something, change in the system,
9:10
that's where the human in the loop approach comes
9:12
in. And without my, you know, verbal API key,
9:15
me actually telling the agent to run this pipeline,
9:19
run this workflow. It won't really do that. Okay,
9:22
so you had mentioned too that you had prompts
9:24
with like guardrails. What are the hard never
9:27
run this rules that you have set up? So it is
9:30
kind of a corollary. What I'm trying to say is
9:33
without a specific phrase, like run the CI CD
9:37
pipeline to deploy this, this version. This is
9:41
the exact phrase it requires to run the CI CD
9:45
pipeline. So in my prompt, I have written it
9:48
so that it. looks for the specific phrase to
9:52
actually activate that function. So without it,
9:55
without an actual phrase, without those actual
9:57
wordings, it can't run it. And that's in the
9:59
prompt, I get that. But as I told you before,
10:01
as a guardrail, I have put it also into the sub
10:04
workflow so that it doesn't just run anything.
10:07
Once Friday gives me the comprehensive system
10:09
check, I have two options usually. One is to
10:12
ask it again for a comprehensive system check
10:15
where it has access to GitHub repositories and
10:18
the GitHub actions. And that's when TARS activates
10:22
because Friday doesn't have access to the GitHub,
10:24
TARS does. So TARS goes through the normal Docker
10:28
logs and compares it with the GitHub commits
10:30
and what might have gone wrong. And it gives
10:32
me a proper report as to this is the area that
10:36
I need to focus in. Now, it does ask me for a
10:39
response as to what to do next, because after
10:42
all, something is wrong in the system. So it's
10:44
waiting for my approval or some kind of message
10:47
that I need it to get. So I can again ask it
10:50
to do some kind of comprehensive check. But if
10:53
I'm good with whatever information it has given
10:56
me, what I'll do is I can tell it to, you know,
10:59
run the CI CD pipeline to redeploy the previous
11:02
stable version, the previous GitHub commit. So
11:06
what it does is that based on this specific phrase,
11:09
it activates the CI CD pipeline because in the
11:13
sub workflow of the CI CD. pipeline, I have actually
11:17
written that in order to run this, you need the
11:20
specific phrase, run the CSED pipeline to redeploy
11:24
the previous stable version. So without this
11:26
phrase, so if this phrase involves like, okay,
11:29
run the pipeline without the CSED or run the
11:33
pipeline that I last deployed in the GitHub actions,
11:36
it won't ever run because it does not match that
11:38
specific phrase. And that's kind of rudimentary,
11:41
but it gives me more, you know. believable focus
11:45
as to, you know, it wouldn't just run anything
11:48
just because I wrote it. Like, it is not up to
11:51
its comprehension. That makes sense. Okay, so
11:55
let's say that a team wants to adopt human -in
11:58
-the -loop AI SRE automation. Where do you think
12:02
they should start? Okay, they should start, honestly,
12:06
where they feel that a lot of steps that they
12:09
do are not just rudimentary, but repetitive in
12:12
tasks. Because that's where I started. I saw
12:15
that the incident response procedure in itself
12:18
has a lot of manual and repetitive steps. And
12:21
not that I wanted to automate all of it, because
12:24
after all, it requires access to SSH and a lot
12:27
of system. And I can't just let an LLM run wild
12:32
into it. So kind of segmenting. Yeah, that's
12:35
the right word. Segmenting the procedure into
12:37
different steps. converting those steps into
12:41
sub workflows and integrating all of those sub
12:44
workflows into a pipeline one by one where i
12:48
have the control between the input of one sub
12:51
workflow to the output of other one so that i
12:54
get in i keep in control as to what the whole
12:57
picture is without actually letting one llm know
13:00
what the entire big picture is so that it can
13:02
manipulate is the go -to for integrating these
13:06
type of tools. How would you pitch human -in
13:09
-the -loop AI SRE automation to someone that's
13:12
maybe more skeptical? All right. So if a person
13:17
has their own skepticism about AI actions, LLMs,
13:24
thinking on their own, having access to all of
13:27
these tools, I get that. I completely get that
13:30
because that's where I started as well. But when
13:33
you delve deep into how an LLM works, it works
13:36
on an input and an output. So as long as you
13:39
can use that particular stage of giving an input
13:44
to an LLM and phrasing that output in a way that
13:48
I want to see and using that output into another
13:52
LLM so that that LLM only gets the input that
13:56
I am giving it, keeps the big picture to myself.
13:59
and never the AI letting it know what the big
14:03
picture is. So you obviously introduced this
14:05
because you're trying to, you know, reduce toil.
14:07
You're trying to iterate quicker. What was the
14:10
first thing that you used this AI tool to help
14:14
solve? Okay, so... Where'd you start? Yeah, it
14:17
never started as a fixer bot. It actually started
14:20
just on the basis of, you know, go into the system
14:23
and tell me what went wrong and why did the Docker
14:26
container stop? That's all. What I used to do
14:29
is creating a simple LLM workflow in which I
14:33
copied the output of the Docker logs, pasted
14:37
it into the LLM, letting it know whatever the
14:40
data I have, I'm giving it to you, tell me what
14:42
went wrong. So then it explains me what went
14:45
wrong one by one. And that's when I thought that,
14:48
okay, this is one portion of it. Why not I use
14:52
the same portion for every single container that
14:54
I have, but I want to automate it so that all
14:58
I tell it is that, okay, go to this container
15:00
and check it out rather than me copy pasting
15:03
the logs and putting it and whatever it is. So
15:05
for that, I needed to give it access to the terminal,
15:08
right? So that's where the SSH access came in.
15:11
And that became a sub workflow because I still
15:14
needed to keep it safe so that the LLM just can't,
15:17
you know, write anything. I put the guardrail
15:19
over there so that it can only write stuff like
15:23
Docker PS or Docker logs. And it did the same
15:26
thing for me, told me what is going wrong or
15:29
how the system is working. And that's where it
15:31
started. And that's where it saved tons of time,
15:35
me just copy pasting stuff, me just looking in
15:37
the wrong places, because I have to look to all
15:40
of those containers to see what is going wrong.
15:43
So instead of that, I know exactly where I need
15:46
to narrow my focus. That's where it all started.
15:49
And then you just iterated on that and started
15:51
adding more features. Yeah, yeah, absolutely.
15:54
What does future development look like for TARS?
15:57
Like, let's say the next three, six months, assuming
16:00
you have the runway and you have the time, what
16:02
features would you like to add? That is a really
16:04
good question. Well, TARS in itself is, I don't
16:07
want to say limited. Limited is not the right
16:09
word because two, three months ago, I was pretty
16:12
astonished as to what TARS could do with the
16:14
tools that it had. And that itself was a huge
16:17
accomplishment over there. But now after a couple
16:19
of months when I use it regularly, I feel that
16:22
I could go more. So I could give it access to
16:25
more tools. I can give it access to the infrastructure,
16:27
maybe Terraform and go from understanding what
16:32
resources my system needs. And if it's lagging
16:35
somewhere, what is the focus that I should maybe
16:39
increase some resources somewhere or maybe give
16:42
it access to. change stuff in Terraform or something
16:46
like that. That's Terraform side of things, infrastructure.
16:48
And that's one more tool to think about other
16:51
than just, you know, GitHub actions and SSH access.
16:54
Let's say you have teams that are interested
16:55
in experimenting with agents in CICD, but they're
16:58
apprehensive. Do you have any advice for how
17:01
to get started or where to start? Yep, yep. So
17:04
it never starts with an LLM, I'll tell you that.
17:06
Yes, we are trying to embed AI into it, but it
17:10
never starts with an LLM. It always starts with...
17:12
actually getting your thoughts and putting it
17:15
into a paper, putting it into steps, logical
17:17
steps that I can program first and then take
17:22
that program and understand how an LLM will be
17:25
effective in what positions. Because if you don't
17:28
really have a work plan or a game plan to begin
17:31
with, then the AI will also be as confused as
17:34
you were. Yeah. That's where I'll start. So what's
17:38
your take on fully autonomous AI ops agents?
17:41
It's scary. Because I myself am not a very big
17:46
fan of, you know, I don't want to say it out
17:50
loud, but I don't really believe a lot in experience
17:54
counts. Because I believe if you're smart enough
17:57
to do the smart thing, then sometimes even you
18:00
can, you know, outgrow experienced people. If
18:04
you know where to look and how things work very
18:06
logically instead of, you know, just gut feeling.
18:09
Because experienced people tend to do that more.
18:11
And that's where the AI and LLM lacks. It's kind
18:16
of a hypocritical thing to say that thinking
18:19
about the AI and LLM, always taking the smart
18:22
choice instead of, you know, just having lots
18:24
of experience, just a human like a human has.
18:27
But what I'm trying to say is that I still believe
18:29
in the gut feeling that I truly hate in a lot
18:33
of people. And that gut feeling is never going
18:36
to be at least. With the information that I know
18:38
about AI and LLMs, it has never proved to have
18:44
a conscience or a gut feeling that even if it
18:47
sees all the data pointing to one thing, maybe
18:50
a human knows that where exactly the error is,
18:54
but the LLM will always go towards the probable
18:56
cause instead of the gut feeling. So that's what
19:00
the... That's where LLMs lack. And that's why
19:03
they can never replace humans because we have
19:06
that creative aspect to us. And it's just a machine
19:09
at the end of the day. Yeah, that's fair. Okay,
19:12
so going back to guardrails, what do you think?
19:15
We've talked about guardrails a lot. We've talked
19:17
about AI and... autonomy a lot. What would you
19:20
say is the single most important guardrail for
19:22
an AI fixer bot? In terms of maybe single most,
19:27
I would say validating every output that an LLM
19:31
gives before you use that output for anything
19:34
else. Yeah, that's fair. Okay, so closing thoughts,
19:37
just curious, what would be like one of the main
19:40
reasons to start using AI and CICD? What can
19:44
that buy you? Well, AI is like an assistant.
19:48
And it works to automate a lot of repetitive
19:51
tasks that humans don't want to do. And that
19:54
primarily saves time. And time saved is money
19:57
saved. But whenever we do tend to automate these
20:01
things, we tend to get carried away in a way
20:04
that, okay, I can automate this. I can automate
20:06
that. I don't need you. I don't need that. I
20:09
can put all of this into code. Well and good,
20:11
but... Anything can go wrong when an AI hallucinates.
20:16
Yeah, for sure. All right, so closing out, what
20:19
advice do you have for people using AI in their
20:22
pipelines? Well, I understand there are a lot
20:25
of disadvantages for a full -scale AI taking
20:28
over a traditional system, but don't let that
20:31
hamper you. Don't let that stop you from understanding
20:35
your own process, understanding your own thoughts,
20:38
writing it down. writing a program and, you know,
20:41
running along with it because everyone gets their
20:44
own time and you know how to utilize that time.
20:47
So you can always stop at a point, introspect
20:50
as to what you have done till now, how it's going
20:53
to affect the future, how it has changed your
20:56
past, how it's going to help other people, how
20:59
it's going to hamper other people and take all
21:03
of these things into confidence and building
21:05
the next step. Makes sense. Appreciate it. Thank
21:08
you, Gracious, for coming on. Thanks a lot, Brian.
21:10
It was great being on this podcast with you.
21:20
All right. That's the conversation with Gracious.
21:24
I really liked his angle of designing the process
21:26
first and only then dropping LLMs in as helpers,
21:30
not magic. If this episode was useful, share
21:34
it with a platform, SRE, or DevOps friend who's
21:37
been playing with AI agents and trying not to
21:39
blow up prod. Hit follow wherever you listen
21:42
so you don't miss the weekly news recaps plus
21:45
these guest interviews. We'll be back with a
21:47
regular Ship It Weekly news episode later this
21:50
week. See you then. Thank you.