0:00
A lot of teams say they want better CICD. What
0:03
they usually mean is they want fewer weird failures,
0:07
less shared state nonsense, less tribal knowledge,
0:10
and way less time wasted fighting the delivery
0:14
system itself. Because the problem is not just
0:17
that Jenkins is old. The problem is when your
0:19
pipelines are noisy, Fragile, hard to reuse,
0:23
hard to observe, and tightly coupled to a bunch
0:25
of infrastructure decisions developers should
0:28
not have to think about. And when you finally
0:31
fix the problem, you get a different problem.
0:33
Success. Because once delivery gets fast and
0:37
easy, people start using it for everything. Bots,
0:40
maintenance changes, dependency bumps, mass rollouts,
0:44
and now the real question becomes, how do you
0:46
keep the system smooth when everybody finally
0:49
trusts it? Like and subscribe! Hey, I'm Brian.
1:08
I work in DevOps and SRE, and I run Tellers Tech.
1:11
Ship It Weekly is where I filter the noise and
1:13
focus on what actually matters when you are the
1:15
one running infrastructure and owning reliability.
1:18
Most weeks, it's a quick news recap. In between
1:20
those, I do interview episodes with people who
1:23
have actually built things, migrated real systems,
1:26
and learned what works the hard way. Today is
1:28
one of those conversations. I'm joined by Stefan
1:31
Moser from Pipedrive. he helped lead a big move
1:34
from Jenkins to GitHub Actions, built a self
1:37
-hosted runner platform on Kubernetes, moved
1:40
delivery towards GitOps with Argo CD, and helped
1:43
roll that model out across a large internal estate
1:46
with hundreds of services. And what I like about
1:49
this one is it's not just tool talk. We get into
1:52
why Jenkins had become painful, from groovy fiction
1:55
to noisy neighbor problems on shared VMs. Why
1:59
GitHub Actions ended up fitting better. How reusable
2:02
workflows and custom actions helped. Why they
2:05
chose Argo CD over other deployment options.
2:08
And how they had to build better internal observability
2:11
because GitHub alone was not enough at their
2:14
scale. We also talk about the migration strategy.
2:17
which honestly is one of the best parts. Dogfooding
2:20
first, migrating in batches, using internal teams
2:24
as the first proving ground, letting the process
2:27
get polished before pushing it wider, and building
2:30
something self -service enough that teams eventually
2:33
started migrating on their own. And then there
2:36
is the mobile story, which is its own thing.
2:38
Mac minis, messy runner drift, different toolchains
2:42
and the surprisingly practical path they landed
2:45
on after testing a few different options for
2:48
stabilizing mobile CI. If you care about CI CD
2:51
architecture, platform engineering, GitHub actions
2:54
at scale, or how to do a migration like this
2:57
without setting your org on fire, this one is
3:00
worth your time. All right, let's jump in. Today,
3:08
I'm joined by Stefan Moser from Pipedrive. He
3:11
helped lead a big move from Jenkins to GitHub
3:13
Actions and built a self -hosted runner platform
3:15
on Kubernetes, plus a GitOps CD flow with Argo
3:19
CD. And we're going to talk about what worked,
3:21
what broke, and what's worth copying. Stefan,
3:24
thanks for joining me. Thank you. Thank you for
3:26
the opportunity to talk about this adventure.
3:30
Basically, it's not the first time I'm talking
3:33
about this. I already had... meetup session that
3:37
was recorded on YouTube, two blog posts, and
3:40
basically just sharing again these adventures.
3:44
But now with a little sparking because something
3:49
already after one year already changed. So I
3:52
have more stuff to add to these blog posts and
3:57
meetup that we had. Awesome. Well, I'm excited
4:00
to learn more. So starting off, can you give
4:03
me a thesis? Why did Jenkins stop working for
4:06
you, and what were you trying to optimize with
4:08
this new system? So basically, the first problem
4:11
that we had with Jenkins is Jenkins decides being
4:14
the pipelines being written in Groovy. That was
4:16
the first problem. In Pipedrive, we are mainly
4:19
working with TypeScript and Go. Groovy was really
4:22
not very needed for us, so it was a bearer. for
4:27
the DevOps teams that we had at that time, plus
4:29
the engineers trying to do writing something.
4:31
That was the first issue. The second issue was
4:34
basically the setup. So we had VMs, and that
4:37
VMs were not isolated. So that means if a big
4:42
pipeline lands in one VM, and we had other pipelines
4:47
like to build the Docker containers, then we
4:50
had the issue with the nice... noisy neighborhood
4:54
so basically we get a starvation of the resource
4:57
so it's really not predictable we tried to to
5:01
improve that so we had a crazy idea first to
5:04
build a internal ci engine so that was really
5:07
the most crazy thing they did so basically copy
5:10
the idea of yaml from github's gitlab ci we build
5:15
the engine and then start to work on that but
5:18
then we had another issues People need to learn
5:21
and use syntax. And after GitHub Land launched
5:24
GitHub Actions, people started using it, some
5:27
developers started using it in their app stores.
5:29
And then we were thinking, why not experimenting?
5:32
And the first idea was basically, besides Jenkins,
5:35
we had another product, CodeChip, that was to
5:38
do the pull request validation. So every time
5:40
the pull request is open, we had a typical lint
5:44
and test work. So basically linting to building
5:47
the... Code, leading the code, and doing the
5:52
unit tests. They were running in the code chip.
5:55
And I can tell it was worse than Jenkins because
5:59
every project was individual. So imagine if I
6:02
had to change or to share something, it's very
6:05
difficult. So we had an idea. Well, let's try
6:08
to win the GitHub Actions. And basically, I team
6:12
up with a colleague, Gregor. And basically, it
6:17
was the idea that... Let's replace CodeChip with
6:20
GitHub. Actually, let's see what happens. And
6:22
that was really the kickoff. So we get a green
6:27
card to export idea. So the first thing was thinking
6:31
that is, okay, I need to run this somewhere.
6:34
Probably GitHub, even we had a free package,
6:39
will not be enough. So let's go to find solutions
6:43
for self -hosting. We found, after some ideas,
6:48
we found a community project. It is ARC, Actions
6:53
Run Controller. At this moment, it's already
6:54
belonging to or maintained by Gita, but at that
6:57
time was really purely community -based. One
7:01
thing that it was very... Important for us, or
7:04
very convenient for us, it was a Kubernetes controller.
7:08
So we spent the last years working with Kubernetes.
7:12
We are understanding the Kubernetes API, these
7:15
CRDs, all talking in the Kubernetes way. So it
7:20
was more easy to work with that. So we define
7:23
resource, get runners running, fine. One thing
7:27
that is also awesome, we can grab a pre -built
7:30
image and then... and put that customization
7:33
that you want. So basically, developer needs
7:36
HGK, IKH, or different tooling. Just put it inside
7:41
of that custom image and then we ship it. It
7:48
was even for easy. Then we already know how to
7:52
restrict and monitoring Kubernetes in production.
7:55
Just apply the same idea in the... CI environment
7:59
so basically I did a thing that is normally not
8:02
normal standard so basically I don't I didn't
8:05
set the requests to be the smallest possible
8:09
most necessary no I just was brute force so I
8:13
say for example set the request and the limits
8:16
to be almost the same so that I could have one
8:20
thing that is pretty bit in the So that every
8:25
time a developer gets a runner, it's always the
8:31
same CPU and memory. So that avoids the problem
8:35
of noisy neighborhoods. Then we had already ideas
8:40
how to, we already know how to see and control
8:43
the resource in Kubernetes. So apply the same
8:46
metrics. So basically we bring back the, of the
8:49
really tools that we had in production to. the
8:52
ci cluster but then we also did want to be more
8:56
straightforward in in the process to maintain
9:02
the clusters basically instead of building the
9:05
cluster from scratch we use ets to be more easy
9:09
and then we want to to have scalability in terms
9:12
of the nodes we decide then to go to the new
9:15
project at that time was carpenter so basically
9:18
the the magic way of AWS to spawn nodes faster
9:24
than a cluster of auto -scaling. So basically,
9:29
then we get this solution. So a controller that
9:31
was listing the GitHub app books to spawn jobs,
9:34
spawn when the instance creates a new pod that
9:38
represents a runner. If that runner doesn't have
9:40
a space, the carpenter then creates a new node.
9:43
and that was really bring the flexibility okay
9:45
i just need to set the max number of thoughts
9:49
or the runners and the ecosystem for the cluster
9:53
scale up and scale down when when necessary and
9:56
this was really basically the why we we moved
10:02
to uh two changes and first steps we then basically
10:06
did the migration of code chip to github actions
10:09
at that time also was released the reusable workflow
10:13
so that means we are able to reusable to create
10:16
a workflow and spread that workflow to the old
10:19
repositories so reducing the the repetition that
10:26
we had and the manual configuration we had with
10:28
Jenkins with the code chip and Basically, after
10:32
we dropped out of CodeChip, it was the time that
10:35
we had the opportunity to revamp CI -CD. So we
10:39
bring a group of engineers, I think four engineers
10:45
plus a developer experience product manager.
10:52
So basically, we had a product manager dedicated
10:55
for the developer experience. And then we decided
10:58
to revamp. uh necessity and they basically said
11:01
to put some kind of competition in in terms of
11:04
tooling so i remember that the first contender
11:07
was basically github actions in terms of ci but
11:10
then also we bring argo workflows and tecton
11:16
because were two projects that i was curious
11:19
about using And in terms of deploying, I'm thinking
11:23
Argo CD plus Flux. We even tried with Spinnaker,
11:28
but it was so messy to spin up that system that
11:35
we just dropped it. And how we chose, basically,
11:38
first was... readability or feasibility or the
11:42
easiness to create the workflows and the specific
11:46
customizations so and that is really shine for
11:49
us the kit of actions basically people complain
11:52
about the action but really the fact that the
11:54
actions are written in javascript and then we
11:56
are using typescript so it's natural to create
11:59
custom logic with typescript even then we can
12:02
go to creating composite actions with some more
12:05
for some batch scripting. Scripts, we just use
12:09
that language to create the customization. And
12:12
then it was really easy to pack up. So basically,
12:15
we create action, we have a package, like a package,
12:18
and then can put it a block that we can put in
12:21
different place. And then the other point was
12:23
basically about how we can expose or show the
12:28
workflows to the developers. And that is... The
12:32
less clicks the developer does, find lots better.
12:37
And that, of course, makes the GitHub action
12:40
to be a big content. So the fact that the workflows
12:44
was next to the repository, the execution was
12:47
next to the repository, the developer didn't
12:50
need to switch to the platforms to see was really
12:54
a big plus. Yeah. And that's really the reason
12:59
why we went to... to GitHub Actions. Even at
13:02
some point after this migration, what we had,
13:05
it's basically a monorep of GitHub Actions. And
13:07
I think we have 50 or 60 GitHub Actions. So basically,
13:11
we just build our actions and even it's our monorep
13:15
of all the code base that you want to reuse in
13:18
the organization. In terms of development, of
13:22
deployment, one thing that we want to make. clear
13:26
is that we want to avoid push code. So one of
13:29
the issues that we had always is that with Jenkins
13:31
and this push idea is that fact that I need to
13:35
have a runner in the clustered push code. That
13:38
means that I need to give the Kubernetes credentials
13:42
to a runner to be able to write stuff. And that
13:46
means that in some way I can try to capture that
13:49
runner and do malicious stuff. let's do the reverse
13:53
so that is that the cluster is really in isolation
13:57
and we get what we have it's someone inside that
14:01
check something that it's really it's already
14:03
trusted and use it to um to apply change in the
14:09
cluster and that is really why you want to go
14:11
to the to do githubs and of course in githubs
14:18
we have two two tools to choose argo cd or flags
14:20
yeah And in that case, it was basically one of
14:24
the reasons why Argos CDF had a good UI comparing
14:30
with Flux at that time. It was like two, three
14:33
years ago. So that was really the big reason
14:38
that we picked up with Argos CDF Flux for the
14:41
UI. And then we had already a way to show to
14:44
developers what has happened in a more easy way.
14:48
Because again, that is that it's easy for developers
14:51
to visualize doesn't mean that i allow them to
14:55
change manifest with rcd yeah i know what i allow
14:58
is to basically scope in inside of the application
15:02
saying okay this is your application see your
15:05
status you can go see the status of your resource
15:08
you can see the logs if you need it you can see
15:12
if the content the pod is it's restarting or
15:14
not really um we added that idea. Yeah, you can
15:22
show. Yeah, after that idea, I think that picking
15:27
up the GitHub Actions took me like two days to
15:32
create MVP of the deployment flow, basically
15:36
because I was reusing everything that I already
15:38
did in the CodeChip migration, and we decided
15:44
to have that MVP, and then... see what are the
15:50
gaps. So one thing that I forgot to mention is
15:53
that initially in the first step or first iteration
15:57
that we had with migrating to CodeChip, we had
16:00
a problem with a lack of observability in terms
16:06
of actions. GitHub at that time didn't provide
16:08
any statistics about the actions. So if you want
16:12
to know how long was the runner, we have any
16:17
idea. You can just... see at the repository and
16:19
was not good enough. But especially in our organization
16:22
that we already have, at that time we had like
16:25
700 service. We have architecture of microservice
16:30
plus libraries. It was impossible to track at
16:34
the repository level. So what we did is basically
16:36
we create a service that lists these events and
16:39
then starting putting everything inside of a
16:42
database so that could build. a source that we
16:46
can then do queries and then find uh what are
16:50
the the performance of of the the workflow because
16:54
at some point a manager a director or even cto
16:57
can comes to us it says what is the failure rate
17:02
of these uh workflows where what are the workflows
17:04
that are taking or these guys what are the deployment
17:07
flows that are taking more time what are the
17:09
steps that we need to to optimize and then basically
17:12
this is really our gut feeling that we need that
17:14
information and we are storing that information.
17:17
And until now, we're still using that service
17:20
to track, even now to tracking more billing information,
17:24
but it's really the source of truth. Okay, that
17:29
was the phase one and phase two, phase three.
17:33
It's basically making the product ready for the
17:37
consumption. And one thing that we discovered
17:41
that we lack, it's basically, again, organization
17:45
-wide of the visibility. GitHub actually is great
17:50
to see your workflows, to see the organization.
17:53
It's bad because we don't have anything. So we
17:57
create a service. That is basically a registration
18:01
service plus a UI that we have in our back office.
18:05
So basically we have a UI in our back office
18:08
that shows all the deployments, but they need
18:11
to have some source. And basically we create
18:13
a service that consume events that we are producing
18:15
in our workflows. And after consuming these events
18:20
can do the mapping of what it's app. And basically
18:23
we have basically our own interpretation of the
18:27
events of. Because they have a different interpretation
18:31
of what is deployment, what is a deploy, what
18:34
is a region. And then basically we add that internal
18:38
logic or that internal understanding what is
18:41
build, test, unit test, functional test, end
18:44
-to -end test, deploy to a region, deploy to
18:47
a... deployment at a whole scale. And we have
18:52
that understanding inside of a service. And basically
18:56
this was initial, the service that was consuming
18:58
everything to create observability of what we
19:03
need to do or what we have in terms of deployments
19:07
to production. And after that, we have that,
19:10
after having that UI, we are more confident to
19:14
go. Now goes the question, how you do this kind
19:18
of release? 700 service. I think we have five
19:22
teams or 10 teams at that time. It was really
19:28
a big migration. How you start migrations? I
19:33
think like any product should migrate or should
19:39
be released, you first try it yourself. So basically...
19:42
And we decided to migrate our own service. My
19:46
team doesn't only produce CI, CD. We have a couple
19:49
of service that supports the deployment flows
19:53
and support the developer experience. So for
19:56
example, we had one service that creates and
19:59
do all the lifecycle of remote dev environment.
20:03
So basically we have remote dev environments
20:05
that developers can use that it's a replica of
20:08
production. And then we have a system that creates
20:09
that. For example, this is one of these servers
20:12
that we need to migrate. And that is really basically
20:15
like this. So I have scripts, I have workflow,
20:19
I have documentation to do it. So I deliver to
20:22
my teammates so that my teammates can migrate.
20:25
These bring two good things. First, we test our
20:29
process to someone that doesn't know about the
20:32
process. And second, we allow... to share knowledge.
20:36
So basically, knowledge that was restricted to
20:39
that five people that create systems starting
20:41
to be spread to the team because I'm sharing
20:43
that knowledge by forcing the people to do the
20:46
migration. And basically, it was basically this
20:49
alpha grouping. Then we had a different team.
20:53
So basically, it was a platform team that it's
20:56
more experienced than the normal developers around
20:59
stuff with Kubernetes. and the deployment process
21:02
and they have more out of the box or out of the
21:06
standard service, then we go to them and say,
21:09
okay, we have this Polish workflow. Let's now
21:13
teach you, do a session and then allow them to
21:18
migrate. And then it was basically like the open
21:22
beta of the system, of the migration. Again,
21:26
the idea was to... make everything more polished,
21:30
more clear for the users. After they give us
21:34
our feedback and then we improve the workflows,
21:36
define, we start doing to think how we do all
21:39
the role. In this case, we pick up the prioritization
21:46
that SRE team define for the service. So basically
21:51
SRE define tiers. So tier one is a service that's
21:56
very critical. Tier two, it's more or less. And
21:59
then tier three is less critical. So we decided,
22:01
okay, let's split this in batch. Let's go team
22:05
by team, starting in tier three. Then they can
22:09
move to tier two and then tier one. And that
22:12
was the idea. And then the plan was basically
22:16
split it in batch and assign a DevOps engineer
22:20
to each batch, not to do the... the migration
22:26
but to be the assistant so basically the person
22:30
who does the introduction of the new way ui the
22:34
new workflow and then this is the the the interaction
22:39
goal that we have to to do your mind and also
22:43
to make them so basically if i want everyone
22:47
one person that's responsible to achieve a goal
22:50
of migrate x number of of service until the end
22:55
of the month it will keep the momentum in the
22:58
team and then we started doing that so team by
23:01
team we started moving we have a schedule we
23:04
have a limit number of of engineers so basically
23:06
we have a queue until some moment we had guys
23:10
from the middle of the queue at the end of saying
23:13
okay i already saw the other guys doing migration.
23:16
I think I can do it alone. Okay, just check the
23:20
recording that we had. Try yourself. Yeah, and
23:24
they decide to try alone and starting doing migration
23:27
alone. So basically, the system was already so
23:29
well oiled and everything was moving smoothly.
23:34
They were able to migrate alone. And that's really,
23:38
I think, the real sex story of doing these migrations
23:41
by batch. doing your dog food and all these things
23:46
in a way that makes everything automatic or with
23:49
less human intervention to allow people to use
23:53
it. And then you also can think this is really
23:55
the idea of maybe platform engineering, making
23:58
everything self -service. When you have a product
24:01
that's already usable for a normal product engineer
24:05
that doesn't need to have the context of the
24:08
CICD and how to deploy to Kubernetes. It's a
24:12
service that he can use and can work and can
24:15
do his work autonomously. And yeah, that's then
24:20
after five months, I think we migrate everything.
24:23
Yes, we found what we get as issues, basically
24:29
our success starting break things. So the process
24:34
was so smoothly. The deployment flow was so good.
24:38
that then we're starting an introduction of bots
24:40
to look at deployment for the maintenance tasks.
24:43
So basically, the SRA team developed a service
24:46
to create pull requests to adjust the resource
24:48
usage in production and then automatically deploy
24:51
that to production. So basically, it creates
24:53
a pull request, we introduce a set of resources,
24:55
and this pull request is already pre -approved,
24:58
so it moves by all the normal pipeline flow and
25:04
goes to production. Then we have the PandaBot.
25:07
creating pull requests for the pendants and in
25:10
some case the developers were already confident
25:12
enough in the unit test they have in the functional
25:15
test they have allowed that obligations or updates
25:20
of of dependence to go without any approval from
25:24
a human so we had then a situation that we have
25:27
multiple um departments happen and then was struggling
25:31
or doing impact in our system. Even we can grow,
25:36
grow, grow. Then we add some bottlenecks. At
25:40
that point, we decide to improve that service
25:44
that was the deployment registrar to have a queue.
25:48
So the idea is that you then have a way that
25:52
you, when you start the deployment, you send
25:54
event, I have a deployment. And then what happens
25:57
is that the deployment registrar register that
26:00
deployment and put in the queue. evaluate the
26:03
queue size so basically we define that we have
26:06
the number of 50 deployments per available in
26:09
parallel and also we do a tricky thing that is
26:14
we only allow 10 of that queue to be used by
26:18
bots so basically we always want to have some
26:22
kind of free space for the moments to develop
26:25
that develop features to ship and basically that's
26:28
a and so we have a queue with all the idea to
26:31
put a margin for humans. And then if everything
26:35
is okay, then we rerun the workflow that allows
26:40
to execute everything. Yeah, it was basically
26:46
that thing that we had to improve. Also, one
26:49
of the bottlenecks that we had was we had issues
26:52
with how we commit to an environment state. So
26:57
basically, it was one of our... really bottlenecks
27:01
is when you have multiple deployments, you need
27:05
to be, only can commit once, one at a time, the
27:10
deployment. So basically, we had a strategy that
27:13
we create one commit per region. So we have a
27:17
lot, a pile of commits to push to that, this
27:23
environment site repository. And that was really
27:26
a bottleneck. Fortunately, we were not very clever
27:29
at that time to create the queue for that specific
27:32
step. And then we had to re -engineering all
27:35
the process because we had the issue that we
27:38
didn't do like a FIFO. So basically, we did implement
27:42
a hot hobby. And that means that some guys was
27:45
very unlucky that it was the first one to arrive
27:48
and was not the first one to get their stuff
27:52
deployed. Yeah. This is really the story that
27:55
we have. What was missing at that time was in
27:58
the presentation and blog post was really the
28:00
migration to the mobile team. I don't know if
28:07
you want to go to already that story. It's a
28:09
new story that I have. Or you want any other
28:12
questions? Let's actually get to the mobile story
28:16
in a second. So going back over your CICD process,
28:20
I'm curious. First off, are you worried about
28:23
The GitHub change to private self -hosted runners,
28:27
they're changing the pricing model. So I guess
28:29
they're going to charge for self -hosted as well
28:31
now. Is that going to change your implementation
28:33
at all? That's the problem. I can think that
28:36
if they're starting to charge that, then I need
28:40
to start to thinking, if they start charging
28:42
like that, then I need to really be very picky
28:45
in their SLAs. so i they how they can charge
28:51
me a fee for the control plane when their control
28:54
plane it's not really 99 or doesn't yeah meet
29:01
the slas it certainly hasn't been lately that's
29:04
for sure that that is really the thing even even
29:07
we don't even we get that uh we more or less
29:10
exclude that from our matrix of of um I already
29:14
starting to think already seeing some page that
29:17
shows the SLA in the last couple of months. And
29:22
then I'm starting to thinking if you charge me
29:24
for this, I need to have a better SLA. I cannot
29:27
stop working. And then for the size of what we
29:32
have, it's basically or I need to find a different
29:37
CI tooling and going to that path or even can
29:41
go more crazy. Don't forget that, okay, we are
29:46
a special case, or probably not a normal case.
29:49
That is, we are already in enterprise level.
29:53
And we have GitHub hosted enterprise server.
29:58
So basically, you can bring yourself. And if
30:02
this is starting to be expensive, probably that's
30:05
the thing I need to do. I bring down and then
30:08
try to be less dependent of... the ability of
30:12
the cloud version and then be my concern. And
30:15
then I can then point to myself. Yeah, your GitHub
30:18
is down because of myself and not because of
30:20
some change in the cloud version. But yeah, it
30:25
will be really a concern. And if they force us,
30:28
it's basically now I will go to my legal team
30:31
and then check the SLA and let's see if they
30:33
don't like the SLA. They're starting to get some
30:37
notice from our legal team to get a recharge
30:40
back or something. So also you had mentioned,
30:42
that's a fair statement for sure. You had mentioned
30:45
looking at Flux originally and then going with
30:49
Argo. Are you using anything like Crossplane
30:52
with Argo yet or no? No. So basically the fact
30:57
that we don't use Crossplane is basically because
30:59
it's not in our domain of work. Okay. From what
31:04
I understand, really, the idea is that cross
31:06
-plane is a good way to provision a resource
31:11
using Kubernetes as a native language. Even I
31:16
try to understand from Victor, Victor Werczek,
31:19
that's working with cross -plane, the difference
31:22
between Terraform and cross -plane, and that's
31:25
really the reason that... We didn't go deep in
31:28
crossfire because the main player of Terraform
31:31
and provision infrastructure is the infrastructure
31:34
team, not the engineering excellence department.
31:36
Basically, we are the middle layer. Imagine it's
31:40
like a lasagna or a burger. Basically, the team
31:43
is the lowest band and I am the lettuce. And
31:49
then we have even up on me, we have the burger
31:52
and then we have the tomato and they have the
31:54
other. So basically we have. these layers and
31:56
i'm in that layer that i consume service from
32:00
infrastructure team and then i deliver service
32:02
to to to the to the platform team and to the
32:07
to the developers so that's the reason that we
32:09
don't look for crossplane that's thing that i
32:12
would like to to experiment but then i need to
32:14
really have a good case of why did a terraform
32:18
to use crossplan yeah it's more i guess if you
32:21
want the infrastructure definitions closer to
32:25
the actual service right so if you want that
32:28
all defined together and there's there's pros
32:30
and cons for both ways i would say honestly even
32:33
in our organization we kind of or i've worked
32:35
in organizations where we've done both yeah even
32:38
even listen uh what victor said about the idea
32:41
of cross plan and how to use cross plan and that
32:43
year for example is one of the examples that
32:45
i can tell or i can get a model in terraform
32:49
that's getting a postgres to be from Mother bless.
32:54
But probably if I want to talk with the developer,
32:57
developer wants to have the minimal settings
32:59
to change. So basically, I think the WinCross
33:02
plan can abstract that with their internal resources
33:05
and say, okay, define this YAML manifest or that
33:08
YAML resource or, sorry, that CRD in YAML and
33:13
then the controls and everything will set up
33:16
everything for you. So at this moment, we don't
33:19
add this kind of... need in terms of organization
33:21
to have exposed so much the infrastructure to
33:25
developers. And that's real. Without need, I
33:28
don't have a way to force a tech to be used.
33:33
Oh, that's fair. It's always a balance. Yes,
33:37
it's cool, but I need to really, it makes sense
33:40
to the case. So tell me about this, the new mobile
33:44
deployments. How is that going? And how did you
33:47
set that up? The mobile team was using Jenkins,
33:51
but even with more strange setup. So basically,
33:54
they had a farm of Mac minis, and they basically
33:58
reconnected with the Jenkins controller master.
34:02
And they were using that. So basically, they
34:05
were using groovy, pretty groovy with... in their
34:11
checking pipelines plus with fast line for one
34:14
for the ios team the android team will use a
34:17
different tooling and that was really that massive
34:20
because part of the team needs to be operations
34:23
to understand to upgrade the nodes to fix issues
34:28
and also again the same issue of isolation when
34:32
we had cases that the nodes were not very identical
34:37
and a job was if lands in one machine was passing
34:40
if lands in the other machine was failing or
34:43
less force between artifacts between was screwing
34:47
up with node versions hubi versions and was really
34:51
a mess so lack of productivity and then that
34:54
was really the idea We need to move to GitHub
34:57
actions, but plus also find a way to make their
35:01
compute power or compute resource be very stable,
35:05
having the same identity that we had in the CI
35:11
-CD for the microservice. So that was a multiple
35:14
deployment and isolate. And at that point, it
35:18
was really a surprise for me, the end solution,
35:21
but basically it was this case. So I went to
35:24
research. that basically I put 4Ks on top of
35:28
the table. First, using VMs inside of Mac. So
35:32
basically a product from CircleCI. So basically
35:40
a company that's already providing GitHub Action
35:43
Runners for Macs and have the system to work.
35:48
And basically it's on top of the one tooling
35:50
at the start. And basically it's a... a nice
35:54
tooling to spawn VMs inside of Mac. Then the
35:58
idea, okay, let's try to use Nix for some independence
36:02
in isolation. And the other idea was basically
36:05
also to use AWS. So why not spawn Macs in AWS
36:11
and use it as a GitHub address? And at last I
36:15
was thinking, I need to at least think in the
36:18
way of outsourcing this to a company. This guy
36:21
is GitHub Actions, but... I could also think
36:23
to other GitHub action providers or first to
36:26
GitHub itself and then to other providers because
36:31
you can see that Blacksmith and I think Depot
36:35
are providers of GitHub action trends that you
36:40
can offload and don't depend of GitHub to have
36:44
the best performance in your machines. Basically,
36:47
it was really a poor scientific research. I have
36:52
three, four hypotheses. I have one month to test
36:55
it. And I set like one week for each hypothesis
37:00
and then try to go. And this is really the first.
37:03
It was in summer of last year. And also that
37:07
culminate in the appearance of the... AI native
37:13
mindset. So in this experimental research, after
37:17
I collided with teachers, with tooling, I found
37:20
that I had to use start to create VMs. So first
37:24
hypothesis. And then I would decide, well, if
37:27
I try to create a controller, like I have the
37:29
idea or already the use case that to have a control
37:32
for Kubernetes, to run runners. So why I don't
37:36
have an action runner control for that? And that
37:39
was really the idea. starting to think i did
37:42
a poc with shell script because it was a very
37:45
easy command but then i decided okay i have a
37:49
nice shell script let me do a really nice mvp
37:52
and then i did my first specification development
37:55
project so basically i used this idea i then
37:59
signed to define my specifications in a markdown
38:03
file what I want, what were the toolings, what
38:06
were the constraints. And then use, in this case,
38:09
was already still using GitHub Copilot and say,
38:12
I have this idea in this file, let's make a plan.
38:15
And starting elaborating the plan, creating the
38:18
plan. And then the process was really that way
38:20
that after I have the plan, we need to have to
38:22
-do lists or to -dos for each point. And then
38:25
basically I force the Copilot to use, go for
38:30
each to -do or each step. do the implementation,
38:33
I review it. Okay, it's fine. Let's move to the
38:36
next one. Fine. And then also ask to do a summary
38:40
of each implementation. So basically to have
38:42
a history of what I did. Just important for me,
38:45
but also important to share with the team all
38:47
this process. And then after one day and a half,
38:53
I got a control. So basically I had a Mac mini
38:57
in my desk. I put the control there and it was
39:00
spawning. and doing the lifecycle of the VM.
39:05
So basically, a workflow, it's my runner, execute
39:10
a niche, drop, tear down the VM, start a new
39:13
one, register against GitHub, like normal flow.
39:16
And I will say, yeah, nice. I just need to make
39:19
some improvements. Then I moved to the Nix. AI,
39:22
in this case, also it was Copilot, helped me
39:26
a lot. how to build the recipes with Nix, but
39:30
it fails tremendously just because of the way
39:33
Xcode works. It was very annoying to work. And
39:37
then I had the issue of how to distribute Xcode.
39:40
At that time, I was running out of time. AWS
39:42
was not really an option to investigate. And
39:45
then I started doing some calculations about
39:48
cost if I use GitHub as a provider of headers.
39:52
And surprise, surprise. It was cheap for our
39:56
use case. So it was basically a matter of after
39:59
spending three or four months collecting metrics
40:02
in Jenkins, I say, yeah, we can use it. And basically
40:05
it was really the idea. I said, okay, this is
40:08
the amount of money and comparing the working
40:12
hours that it's necessary for an engineer to
40:14
fix this, it's a good balance. And then after
40:16
I convinced my director that it really makes
40:19
sense in terms of financial terms. you approve
40:22
and then we move and then we move to migration
40:25
and that this migration was i did with the junior
40:27
the first thing that we did when this migration
40:29
was really sit down with the developers and i
40:34
asked anoint us to to to my junior in this case
40:38
it was to internities sorry but it was really
40:41
important to that yeah i asked him go for each
40:44
um jenkins pipeline they have and start doing
40:50
a flowchart so basically we had a flowchart for
40:53
each Jenkins pipeline with steps and the steps
40:58
in the way that what is supposed to do and what
41:01
are the commands that are executed and then i
41:03
sit down with the with each uh team from android
41:08
and and ios and then let's go forward for each
41:12
step and then try to understand as this makes
41:14
sense this flow i don't care about what is the
41:17
command that is executed. Does this step make
41:20
sense? Does this test make sense? Does this fork
41:23
in the logic make sense? And then we also understand
41:27
some good things that is some workflows or some
41:31
checking job that we have was already redundant.
41:33
We could refactor the input and combine in one
41:38
single pipeline. That was really the idea. And
41:42
then that was the second time we used the AI
41:44
to speed up. And this time I already added the
41:47
session of agenting coding with my engineering
41:51
department. So basically the engineering excellence
41:53
create agenting sessions to teachers all to use
41:57
AI tools in a more agentic way. So not to auto
42:01
-complete features, but to give context, to give
42:04
a goal. have this guy, this AI, as really a partner
42:09
to execute. And then at this time, already we
42:12
was using cloud code. And then, okay, let's bring
42:15
these flowcharts, convert to something that is
42:17
more digestible by AI. So I was able to export
42:21
as a CSV file. And then, okay, these are CSV
42:25
files, contains flowcharts of our workflows.
42:28
Let's build GitHub Action workflows. and documentation.
42:32
And he's starting doing the old workflow. Was
42:35
not really exactly what we want, but was close
42:38
enough. Imagine that this was really a good best
42:42
draft, a good first draft. This was really, now
42:45
just adjust this step, this step, this step,
42:47
and then we just starting building on that. And
42:50
then, of course, this really make the work very
42:53
easy. So we add like 17 workflows to migrate.
42:58
What is the difference from this migration to
43:01
the other migration? The other migration, we
43:02
add like one or two workflows for all the process.
43:08
So basically, the issue was to replicate that
43:11
to use workflows to 700 service. In this case,
43:14
we have only two service, two repositories, the
43:17
Android and iOS, but we have multiple workflows
43:19
to migrate. And then we had to rewrite a lot
43:22
of stuff. And that was the way that we use AI
43:26
to basically at the pace of each day we migrate
43:29
a workflow and then cloud code was able to digest
43:34
some part of the code base. So I had two issues.
43:37
First, parts of the customization that they have
43:40
in the iOS team was using FastLine that is written
43:44
in Ruby. I don't know Ruby. So I used it to understand
43:49
what was that. what was the logic behind that
43:55
ruby scripts and had extra features then i'm
44:00
not very well versatile in the ios test and compiler
44:06
so i never i'm not irs developer i don't know
44:10
all the quirks about the the process to compile
44:13
language compile ios app And I use protocol for
44:17
that. So basically, it was already in a way that
44:20
I had an issue in my pipeline. And I tell them,
44:23
OK, I have problems like this. I have an issue
44:27
in my pipeline. This is the idea of the workflow.
44:31
This is the idea of job. This is a step that's
44:33
failing. Fetch. using github cli in that time
44:36
even we using mcp probably it's best if you have
44:40
a cli to tell the ai model to use that cli to
44:43
fetch the information instead of uh bloat the
44:49
contacts with mcps use the cli fetch the logs
44:53
in that section and that let's go investigate
44:55
what it's what it's failing and it was really
44:58
able at some point i get get surprised because
45:01
when i go in deep mode of troubleshooting the
45:04
model in this case even the the agent that is
45:07
called cloud cloud was able to go to the internet
45:10
and find github issues about the problem pointing
45:14
out i have found this issue probably it's about
45:17
this let's double check and then i double check
45:19
yeah probably makes sense let's try this change
45:22
and it was really basically it was a more family
45:26
language or i cannot what i can say it was like
45:31
while we were in the past doing Googling. So
45:33
you put the problem, you try to find the issues.
45:35
Here, I have the problem. Also find the issues
45:38
in the internet and get me back the information
45:41
and then validate with me and then explore. And
45:43
basically, after four weeks, we migrate everything.
45:47
We even add extra features they want. And they
45:51
were very happy. And still, they are very happy.
45:55
Of course, the initial costs failed because it
45:58
was more than I expected. But for one reason,
46:01
developers were delivering more. So it was really
46:04
a situation that the workflows and run are so
46:09
stable, they can focus more in future and then
46:12
increase the cost. But it's because they are
46:14
shipping more features than they did in the past.
46:19
That's a good cost problem to have, you know?
46:21
Yeah. Yeah. I think that's the thing I have.
46:26
Do you have any questions about that topic? I
46:29
did want to ask, wrapping up, if someone that's
46:34
listening, they wanted to pull off a big CICD
46:37
switch like you did, are there some lessons learned
46:40
that you could give that they could follow so
46:42
they don't set their org on fire? I mean, because
46:44
this is a complex... lift and shift, right? Going
46:48
from Jenkins to GitOps and introducing Argo and
46:52
all the complexities around, you know, the Mac
46:54
mini pipelines, like be interested. There's some
46:57
like core lessons learned that you could impart.
47:00
So the first thing is, I think you need to, even
47:03
you have a big pipeline, I bet that you have
47:06
a small part. So a niche. So try to find. a small
47:12
part that you can replace and do it in isolation.
47:15
In my case, it was basically the pull request
47:17
validation. It's detached in some way of the
47:21
big flow. Try on that. If you cannot do that,
47:25
try to find segmentations that you have in your
47:28
organization in terms of teams. That's a good
47:30
way to approach so that you can move parts of
47:38
your team. Basically, if you have 10 teams or
47:41
five teams, pick one team and try to go in that
47:47
way. So this is a way that you can try to reduce
47:50
the buster radius. And of course, I think that
47:53
was a thing that was very important for us, basically
47:55
doing dogfood. I think it's very unfair for someone
47:59
that is developing tools for developers not using
48:02
that in their work. So I think this is really
48:05
the most important thing. It's doing dogfooding.
48:07
And then if you want, try to find in each place
48:10
if you don't have anything think do you have
48:13
internal tooling that doesn't uh provide for
48:16
your final customers that can be bad but for
48:18
trying to apply this to your internal tooling
48:22
yeah that makes sense cool where can people find
48:25
your uh your posts and and where can they reach
48:27
out to you okay so my posts are in the medium
48:30
probably you can find them have the links in
48:32
the description also i have that both was based
48:35
in the talk that I did. So also in YouTube, I
48:41
will have that talk. Also, sometimes I do some
48:45
publishing in LinkedIn. So you can go there,
48:48
try to reach me in LinkedIn. Awesome. I'll leave
48:51
the links for your Medium posts and your LinkedIn
48:53
and anything else in the show notes. Stefan,
48:55
thanks for coming on. Really appreciate it. Okay.
48:57
Thank you. All right. That's my conversation
49:00
with Stefan Moser. My biggest takeaway from this
49:03
one is that good CICD is not just about picking
49:06
a newer tool. It is about building a delivery
49:09
system that is predictable. observable, isolated,
49:13
and usable enough that engineers can trust it
49:16
without needing constant help from platform teams.
49:19
That is really the thread running through this
49:21
whole episode. They did not just swap Jenkins
49:24
for GitHub Actions. They reduced noisy neighbor
49:27
problems. They standardized runners. They leaned
49:30
into reusable workflows. They moved deployment
49:33
towards GitOps. They built their own visibility
49:36
layer when the platform was not giving them enough,
49:39
and they rolled it out in a way that let teams
49:42
build confidence instead of forcing a giant overnight
49:45
cutover. I also liked that he was honest about
49:48
what happens when the new system works. Once
49:51
deploys get easier, people use them more. Bots
49:54
start shipping changes, automation starts piling
49:57
up, and then you discover the next bottleneck,
49:59
whether that is queuing, fairness, or protecting
50:02
enough room for humans to still get work out.
50:05
That is the real platform lesson. Success creates
50:08
new load. And the better your self -service story
50:11
gets, the more you have to think about throughput,
50:13
guardrails, and the system behavior under trust.
50:16
The other part I liked was his migration advice
50:19
at the end. Start with a niche. Reduce blast
50:22
radius. Dog food your own system first. And if
50:25
you are building tools for developers, use them
50:28
yourself first before asking everyone else to
50:31
bet on them. That is probably the cleanest takeaway
50:33
from the whole episode. If you enjoyed this episode,
50:36
follow Ship It Weekly wherever you listen to
50:39
podcasts. If you want the show notes, links to
50:41
Stefan, his write -ups, and the resources we
50:44
talked about, head over to shipitweekly .fm.
50:47
Thanks for listening, and I'll see you later
50:49
this week. Thank you.