Ship It Conversations: Stephane Moser on Pipedrive’s Jenkins-to-GitHub Actions Migration, Argo CD, and CI/CD at Scale

Transcript

0:00 A lot of teams say they want better CICD. What

0:03 they usually mean is they want fewer weird failures,

0:07 less shared state nonsense, less tribal knowledge,

0:10 and way less time wasted fighting the delivery

0:14 system itself. Because the problem is not just

0:17 that Jenkins is old. The problem is when your

0:19 pipelines are noisy, Fragile, hard to reuse,

0:23 hard to observe, and tightly coupled to a bunch

0:25 of infrastructure decisions developers should

0:28 not have to think about. And when you finally

0:31 fix the problem, you get a different problem.

0:33 Success. Because once delivery gets fast and

0:37 easy, people start using it for everything. Bots,

0:40 maintenance changes, dependency bumps, mass rollouts,

0:44 and now the real question becomes, how do you

0:46 keep the system smooth when everybody finally

0:49 trusts it? Like and subscribe! Hey, I'm Brian.

1:08 I work in DevOps and SRE, and I run Tellers Tech.

1:11 Ship It Weekly is where I filter the noise and

1:13 focus on what actually matters when you are the

1:15 one running infrastructure and owning reliability.

1:18 Most weeks, it's a quick news recap. In between

1:20 those, I do interview episodes with people who

1:23 have actually built things, migrated real systems,

1:26 and learned what works the hard way. Today is

1:28 one of those conversations. I'm joined by Stefan

1:31 Moser from Pipedrive. he helped lead a big move

1:34 from Jenkins to GitHub Actions, built a self

1:37 -hosted runner platform on Kubernetes, moved

1:40 delivery towards GitOps with Argo CD, and helped

1:43 roll that model out across a large internal estate

1:46 with hundreds of services. And what I like about

1:49 this one is it's not just tool talk. We get into

1:52 why Jenkins had become painful, from groovy fiction

1:55 to noisy neighbor problems on shared VMs. Why

1:59 GitHub Actions ended up fitting better. How reusable

2:02 workflows and custom actions helped. Why they

2:05 chose Argo CD over other deployment options.

2:08 And how they had to build better internal observability

2:11 because GitHub alone was not enough at their

2:14 scale. We also talk about the migration strategy.

2:17 which honestly is one of the best parts. Dogfooding

2:20 first, migrating in batches, using internal teams

2:24 as the first proving ground, letting the process

2:27 get polished before pushing it wider, and building

2:30 something self -service enough that teams eventually

2:33 started migrating on their own. And then there

2:36 is the mobile story, which is its own thing.

2:38 Mac minis, messy runner drift, different toolchains

2:42 and the surprisingly practical path they landed

2:45 on after testing a few different options for

2:48 stabilizing mobile CI. If you care about CI CD

2:51 architecture, platform engineering, GitHub actions

2:54 at scale, or how to do a migration like this

2:57 without setting your org on fire, this one is

3:00 worth your time. All right, let's jump in. Today,

3:08 I'm joined by Stefan Moser from Pipedrive. He

3:11 helped lead a big move from Jenkins to GitHub

3:13 Actions and built a self -hosted runner platform

3:15 on Kubernetes, plus a GitOps CD flow with Argo

3:19 CD. And we're going to talk about what worked,

3:21 what broke, and what's worth copying. Stefan,

3:24 thanks for joining me. Thank you. Thank you for

3:26 the opportunity to talk about this adventure.

3:30 Basically, it's not the first time I'm talking

3:33 about this. I already had... meetup session that

3:37 was recorded on YouTube, two blog posts, and

3:40 basically just sharing again these adventures.

3:44 But now with a little sparking because something

3:49 already after one year already changed. So I

3:52 have more stuff to add to these blog posts and

3:57 meetup that we had. Awesome. Well, I'm excited

4:00 to learn more. So starting off, can you give

4:03 me a thesis? Why did Jenkins stop working for

4:06 you, and what were you trying to optimize with

4:08 this new system? So basically, the first problem

4:11 that we had with Jenkins is Jenkins decides being

4:14 the pipelines being written in Groovy. That was

4:16 the first problem. In Pipedrive, we are mainly

4:19 working with TypeScript and Go. Groovy was really

4:22 not very needed for us, so it was a bearer. for

4:27 the DevOps teams that we had at that time, plus

4:29 the engineers trying to do writing something.

4:31 That was the first issue. The second issue was

4:34 basically the setup. So we had VMs, and that

4:37 VMs were not isolated. So that means if a big

4:42 pipeline lands in one VM, and we had other pipelines

4:47 like to build the Docker containers, then we

4:50 had the issue with the nice... noisy neighborhood

4:54 so basically we get a starvation of the resource

4:57 so it's really not predictable we tried to to

5:01 improve that so we had a crazy idea first to

5:04 build a internal ci engine so that was really

5:07 the most crazy thing they did so basically copy

5:10 the idea of yaml from github's gitlab ci we build

5:15 the engine and then start to work on that but

5:18 then we had another issues People need to learn

5:21 and use syntax. And after GitHub Land launched

5:24 GitHub Actions, people started using it, some

5:27 developers started using it in their app stores.

5:29 And then we were thinking, why not experimenting?

5:32 And the first idea was basically, besides Jenkins,

5:35 we had another product, CodeChip, that was to

5:38 do the pull request validation. So every time

5:40 the pull request is open, we had a typical lint

5:44 and test work. So basically linting to building

5:47 the... Code, leading the code, and doing the

5:52 unit tests. They were running in the code chip.

5:55 And I can tell it was worse than Jenkins because

5:59 every project was individual. So imagine if I

6:02 had to change or to share something, it's very

6:05 difficult. So we had an idea. Well, let's try

6:08 to win the GitHub Actions. And basically, I team

6:12 up with a colleague, Gregor. And basically, it

6:17 was the idea that... Let's replace CodeChip with

6:20 GitHub. Actually, let's see what happens. And

6:22 that was really the kickoff. So we get a green

6:27 card to export idea. So the first thing was thinking

6:31 that is, okay, I need to run this somewhere.

6:34 Probably GitHub, even we had a free package,

6:39 will not be enough. So let's go to find solutions

6:43 for self -hosting. We found, after some ideas,

6:48 we found a community project. It is ARC, Actions

6:53 Run Controller. At this moment, it's already

6:54 belonging to or maintained by Gita, but at that

6:57 time was really purely community -based. One

7:01 thing that it was very... Important for us, or

7:04 very convenient for us, it was a Kubernetes controller.

7:08 So we spent the last years working with Kubernetes.

7:12 We are understanding the Kubernetes API, these

7:15 CRDs, all talking in the Kubernetes way. So it

7:20 was more easy to work with that. So we define

7:23 resource, get runners running, fine. One thing

7:27 that is also awesome, we can grab a pre -built

7:30 image and then... and put that customization

7:33 that you want. So basically, developer needs

7:36 HGK, IKH, or different tooling. Just put it inside

7:41 of that custom image and then we ship it. It

7:48 was even for easy. Then we already know how to

7:52 restrict and monitoring Kubernetes in production.

7:55 Just apply the same idea in the... CI environment

7:59 so basically I did a thing that is normally not

8:02 normal standard so basically I don't I didn't

8:05 set the requests to be the smallest possible

8:09 most necessary no I just was brute force so I

8:13 say for example set the request and the limits

8:16 to be almost the same so that I could have one

8:20 thing that is pretty bit in the So that every

8:25 time a developer gets a runner, it's always the

8:31 same CPU and memory. So that avoids the problem

8:35 of noisy neighborhoods. Then we had already ideas

8:40 how to, we already know how to see and control

8:43 the resource in Kubernetes. So apply the same

8:46 metrics. So basically we bring back the, of the

8:49 really tools that we had in production to. the

8:52 ci cluster but then we also did want to be more

8:56 straightforward in in the process to maintain

9:02 the clusters basically instead of building the

9:05 cluster from scratch we use ets to be more easy

9:09 and then we want to to have scalability in terms

9:12 of the nodes we decide then to go to the new

9:15 project at that time was carpenter so basically

9:18 the the magic way of AWS to spawn nodes faster

9:24 than a cluster of auto -scaling. So basically,

9:29 then we get this solution. So a controller that

9:31 was listing the GitHub app books to spawn jobs,

9:34 spawn when the instance creates a new pod that

9:38 represents a runner. If that runner doesn't have

9:40 a space, the carpenter then creates a new node.

9:43 and that was really bring the flexibility okay

9:45 i just need to set the max number of thoughts

9:49 or the runners and the ecosystem for the cluster

9:53 scale up and scale down when when necessary and

9:56 this was really basically the why we we moved

10:02 to uh two changes and first steps we then basically

10:06 did the migration of code chip to github actions

10:09 at that time also was released the reusable workflow

10:13 so that means we are able to reusable to create

10:16 a workflow and spread that workflow to the old

10:19 repositories so reducing the the repetition that

10:26 we had and the manual configuration we had with

10:28 Jenkins with the code chip and Basically, after

10:32 we dropped out of CodeChip, it was the time that

10:35 we had the opportunity to revamp CI -CD. So we

10:39 bring a group of engineers, I think four engineers

10:45 plus a developer experience product manager.

10:52 So basically, we had a product manager dedicated

10:55 for the developer experience. And then we decided

10:58 to revamp. uh necessity and they basically said

11:01 to put some kind of competition in in terms of

11:04 tooling so i remember that the first contender

11:07 was basically github actions in terms of ci but

11:10 then also we bring argo workflows and tecton

11:16 because were two projects that i was curious

11:19 about using And in terms of deploying, I'm thinking

11:23 Argo CD plus Flux. We even tried with Spinnaker,

11:28 but it was so messy to spin up that system that

11:35 we just dropped it. And how we chose, basically,

11:38 first was... readability or feasibility or the

11:42 easiness to create the workflows and the specific

11:46 customizations so and that is really shine for

11:49 us the kit of actions basically people complain

11:52 about the action but really the fact that the

11:54 actions are written in javascript and then we

11:56 are using typescript so it's natural to create

11:59 custom logic with typescript even then we can

12:02 go to creating composite actions with some more

12:05 for some batch scripting. Scripts, we just use

12:09 that language to create the customization. And

12:12 then it was really easy to pack up. So basically,

12:15 we create action, we have a package, like a package,

12:18 and then can put it a block that we can put in

12:21 different place. And then the other point was

12:23 basically about how we can expose or show the

12:28 workflows to the developers. And that is... The

12:32 less clicks the developer does, find lots better.

12:37 And that, of course, makes the GitHub action

12:40 to be a big content. So the fact that the workflows

12:44 was next to the repository, the execution was

12:47 next to the repository, the developer didn't

12:50 need to switch to the platforms to see was really

12:54 a big plus. Yeah. And that's really the reason

12:59 why we went to... to GitHub Actions. Even at

13:02 some point after this migration, what we had,

13:05 it's basically a monorep of GitHub Actions. And

13:07 I think we have 50 or 60 GitHub Actions. So basically,

13:11 we just build our actions and even it's our monorep

13:15 of all the code base that you want to reuse in

13:18 the organization. In terms of development, of

13:22 deployment, one thing that we want to make. clear

13:26 is that we want to avoid push code. So one of

13:29 the issues that we had always is that with Jenkins

13:31 and this push idea is that fact that I need to

13:35 have a runner in the clustered push code. That

13:38 means that I need to give the Kubernetes credentials

13:42 to a runner to be able to write stuff. And that

13:46 means that in some way I can try to capture that

13:49 runner and do malicious stuff. let's do the reverse

13:53 so that is that the cluster is really in isolation

13:57 and we get what we have it's someone inside that

14:01 check something that it's really it's already

14:03 trusted and use it to um to apply change in the

14:09 cluster and that is really why you want to go

14:11 to the to do githubs and of course in githubs

14:18 we have two two tools to choose argo cd or flags

14:20 yeah And in that case, it was basically one of

14:24 the reasons why Argos CDF had a good UI comparing

14:30 with Flux at that time. It was like two, three

14:33 years ago. So that was really the big reason

14:38 that we picked up with Argos CDF Flux for the

14:41 UI. And then we had already a way to show to

14:44 developers what has happened in a more easy way.

14:48 Because again, that is that it's easy for developers

14:51 to visualize doesn't mean that i allow them to

14:55 change manifest with rcd yeah i know what i allow

14:58 is to basically scope in inside of the application

15:02 saying okay this is your application see your

15:05 status you can go see the status of your resource

15:08 you can see the logs if you need it you can see

15:12 if the content the pod is it's restarting or

15:14 not really um we added that idea. Yeah, you can

15:22 show. Yeah, after that idea, I think that picking

15:27 up the GitHub Actions took me like two days to

15:32 create MVP of the deployment flow, basically

15:36 because I was reusing everything that I already

15:38 did in the CodeChip migration, and we decided

15:44 to have that MVP, and then... see what are the

15:50 gaps. So one thing that I forgot to mention is

15:53 that initially in the first step or first iteration

15:57 that we had with migrating to CodeChip, we had

16:00 a problem with a lack of observability in terms

16:06 of actions. GitHub at that time didn't provide

16:08 any statistics about the actions. So if you want

16:12 to know how long was the runner, we have any

16:17 idea. You can just... see at the repository and

16:19 was not good enough. But especially in our organization

16:22 that we already have, at that time we had like

16:25 700 service. We have architecture of microservice

16:30 plus libraries. It was impossible to track at

16:34 the repository level. So what we did is basically

16:36 we create a service that lists these events and

16:39 then starting putting everything inside of a

16:42 database so that could build. a source that we

16:46 can then do queries and then find uh what are

16:50 the the performance of of the the workflow because

16:54 at some point a manager a director or even cto

16:57 can comes to us it says what is the failure rate

17:02 of these uh workflows where what are the workflows

17:04 that are taking or these guys what are the deployment

17:07 flows that are taking more time what are the

17:09 steps that we need to to optimize and then basically

17:12 this is really our gut feeling that we need that

17:14 information and we are storing that information.

17:17 And until now, we're still using that service

17:20 to track, even now to tracking more billing information,

17:24 but it's really the source of truth. Okay, that

17:29 was the phase one and phase two, phase three.

17:33 It's basically making the product ready for the

17:37 consumption. And one thing that we discovered

17:41 that we lack, it's basically, again, organization

17:45 -wide of the visibility. GitHub actually is great

17:50 to see your workflows, to see the organization.

17:53 It's bad because we don't have anything. So we

17:57 create a service. That is basically a registration

18:01 service plus a UI that we have in our back office.

18:05 So basically we have a UI in our back office

18:08 that shows all the deployments, but they need

18:11 to have some source. And basically we create

18:13 a service that consume events that we are producing

18:15 in our workflows. And after consuming these events

18:20 can do the mapping of what it's app. And basically

18:23 we have basically our own interpretation of the

18:27 events of. Because they have a different interpretation

18:31 of what is deployment, what is a deploy, what

18:34 is a region. And then basically we add that internal

18:38 logic or that internal understanding what is

18:41 build, test, unit test, functional test, end

18:44 -to -end test, deploy to a region, deploy to

18:47 a... deployment at a whole scale. And we have

18:52 that understanding inside of a service. And basically

18:56 this was initial, the service that was consuming

18:58 everything to create observability of what we

19:03 need to do or what we have in terms of deployments

19:07 to production. And after that, we have that,

19:10 after having that UI, we are more confident to

19:14 go. Now goes the question, how you do this kind

19:18 of release? 700 service. I think we have five

19:22 teams or 10 teams at that time. It was really

19:28 a big migration. How you start migrations? I

19:33 think like any product should migrate or should

19:39 be released, you first try it yourself. So basically...

19:42 And we decided to migrate our own service. My

19:46 team doesn't only produce CI, CD. We have a couple

19:49 of service that supports the deployment flows

19:53 and support the developer experience. So for

19:56 example, we had one service that creates and

19:59 do all the lifecycle of remote dev environment.

20:03 So basically we have remote dev environments

20:05 that developers can use that it's a replica of

20:08 production. And then we have a system that creates

20:09 that. For example, this is one of these servers

20:12 that we need to migrate. And that is really basically

20:15 like this. So I have scripts, I have workflow,

20:19 I have documentation to do it. So I deliver to

20:22 my teammates so that my teammates can migrate.

20:25 These bring two good things. First, we test our

20:29 process to someone that doesn't know about the

20:32 process. And second, we allow... to share knowledge.

20:36 So basically, knowledge that was restricted to

20:39 that five people that create systems starting

20:41 to be spread to the team because I'm sharing

20:43 that knowledge by forcing the people to do the

20:46 migration. And basically, it was basically this

20:49 alpha grouping. Then we had a different team.

20:53 So basically, it was a platform team that it's

20:56 more experienced than the normal developers around

20:59 stuff with Kubernetes. and the deployment process

21:02 and they have more out of the box or out of the

21:06 standard service, then we go to them and say,

21:09 okay, we have this Polish workflow. Let's now

21:13 teach you, do a session and then allow them to

21:18 migrate. And then it was basically like the open

21:22 beta of the system, of the migration. Again,

21:26 the idea was to... make everything more polished,

21:30 more clear for the users. After they give us

21:34 our feedback and then we improve the workflows,

21:36 define, we start doing to think how we do all

21:39 the role. In this case, we pick up the prioritization

21:46 that SRE team define for the service. So basically

21:51 SRE define tiers. So tier one is a service that's

21:56 very critical. Tier two, it's more or less. And

21:59 then tier three is less critical. So we decided,

22:01 okay, let's split this in batch. Let's go team

22:05 by team, starting in tier three. Then they can

22:09 move to tier two and then tier one. And that

22:12 was the idea. And then the plan was basically

22:16 split it in batch and assign a DevOps engineer

22:20 to each batch, not to do the... the migration

22:26 but to be the assistant so basically the person

22:30 who does the introduction of the new way ui the

22:34 new workflow and then this is the the the interaction

22:39 goal that we have to to do your mind and also

22:43 to make them so basically if i want everyone

22:47 one person that's responsible to achieve a goal

22:50 of migrate x number of of service until the end

22:55 of the month it will keep the momentum in the

22:58 team and then we started doing that so team by

23:01 team we started moving we have a schedule we

23:04 have a limit number of of engineers so basically

23:06 we have a queue until some moment we had guys

23:10 from the middle of the queue at the end of saying

23:13 okay i already saw the other guys doing migration.

23:16 I think I can do it alone. Okay, just check the

23:20 recording that we had. Try yourself. Yeah, and

23:24 they decide to try alone and starting doing migration

23:27 alone. So basically, the system was already so

23:29 well oiled and everything was moving smoothly.

23:34 They were able to migrate alone. And that's really,

23:38 I think, the real sex story of doing these migrations

23:41 by batch. doing your dog food and all these things

23:46 in a way that makes everything automatic or with

23:49 less human intervention to allow people to use

23:53 it. And then you also can think this is really

23:55 the idea of maybe platform engineering, making

23:58 everything self -service. When you have a product

24:01 that's already usable for a normal product engineer

24:05 that doesn't need to have the context of the

24:08 CICD and how to deploy to Kubernetes. It's a

24:12 service that he can use and can work and can

24:15 do his work autonomously. And yeah, that's then

24:20 after five months, I think we migrate everything.

24:23 Yes, we found what we get as issues, basically

24:29 our success starting break things. So the process

24:34 was so smoothly. The deployment flow was so good.

24:38 that then we're starting an introduction of bots

24:40 to look at deployment for the maintenance tasks.

24:43 So basically, the SRA team developed a service

24:46 to create pull requests to adjust the resource

24:48 usage in production and then automatically deploy

24:51 that to production. So basically, it creates

24:53 a pull request, we introduce a set of resources,

24:55 and this pull request is already pre -approved,

24:58 so it moves by all the normal pipeline flow and

25:04 goes to production. Then we have the PandaBot.

25:07 creating pull requests for the pendants and in

25:10 some case the developers were already confident

25:12 enough in the unit test they have in the functional

25:15 test they have allowed that obligations or updates

25:20 of of dependence to go without any approval from

25:24 a human so we had then a situation that we have

25:27 multiple um departments happen and then was struggling

25:31 or doing impact in our system. Even we can grow,

25:36 grow, grow. Then we add some bottlenecks. At

25:40 that point, we decide to improve that service

25:44 that was the deployment registrar to have a queue.

25:48 So the idea is that you then have a way that

25:52 you, when you start the deployment, you send

25:54 event, I have a deployment. And then what happens

25:57 is that the deployment registrar register that

26:00 deployment and put in the queue. evaluate the

26:03 queue size so basically we define that we have

26:06 the number of 50 deployments per available in

26:09 parallel and also we do a tricky thing that is

26:14 we only allow 10 of that queue to be used by

26:18 bots so basically we always want to have some

26:22 kind of free space for the moments to develop

26:25 that develop features to ship and basically that's

26:28 a and so we have a queue with all the idea to

26:31 put a margin for humans. And then if everything

26:35 is okay, then we rerun the workflow that allows

26:40 to execute everything. Yeah, it was basically

26:46 that thing that we had to improve. Also, one

26:49 of the bottlenecks that we had was we had issues

26:52 with how we commit to an environment state. So

26:57 basically, it was one of our... really bottlenecks

27:01 is when you have multiple deployments, you need

27:05 to be, only can commit once, one at a time, the

27:10 deployment. So basically, we had a strategy that

27:13 we create one commit per region. So we have a

27:17 lot, a pile of commits to push to that, this

27:23 environment site repository. And that was really

27:26 a bottleneck. Fortunately, we were not very clever

27:29 at that time to create the queue for that specific

27:32 step. And then we had to re -engineering all

27:35 the process because we had the issue that we

27:38 didn't do like a FIFO. So basically, we did implement

27:42 a hot hobby. And that means that some guys was

27:45 very unlucky that it was the first one to arrive

27:48 and was not the first one to get their stuff

27:52 deployed. Yeah. This is really the story that

27:55 we have. What was missing at that time was in

27:58 the presentation and blog post was really the

28:00 migration to the mobile team. I don't know if

28:07 you want to go to already that story. It's a

28:09 new story that I have. Or you want any other

28:12 questions? Let's actually get to the mobile story

28:16 in a second. So going back over your CICD process,

28:20 I'm curious. First off, are you worried about

28:23 The GitHub change to private self -hosted runners,

28:27 they're changing the pricing model. So I guess

28:29 they're going to charge for self -hosted as well

28:31 now. Is that going to change your implementation

28:33 at all? That's the problem. I can think that

28:36 if they're starting to charge that, then I need

28:40 to start to thinking, if they start charging

28:42 like that, then I need to really be very picky

28:45 in their SLAs. so i they how they can charge

28:51 me a fee for the control plane when their control

28:54 plane it's not really 99 or doesn't yeah meet

29:01 the slas it certainly hasn't been lately that's

29:04 for sure that that is really the thing even even

29:07 we don't even we get that uh we more or less

29:10 exclude that from our matrix of of um I already

29:14 starting to think already seeing some page that

29:17 shows the SLA in the last couple of months. And

29:22 then I'm starting to thinking if you charge me

29:24 for this, I need to have a better SLA. I cannot

29:27 stop working. And then for the size of what we

29:32 have, it's basically or I need to find a different

29:37 CI tooling and going to that path or even can

29:41 go more crazy. Don't forget that, okay, we are

29:46 a special case, or probably not a normal case.

29:49 That is, we are already in enterprise level.

29:53 And we have GitHub hosted enterprise server.

29:58 So basically, you can bring yourself. And if

30:02 this is starting to be expensive, probably that's

30:05 the thing I need to do. I bring down and then

30:08 try to be less dependent of... the ability of

30:12 the cloud version and then be my concern. And

30:15 then I can then point to myself. Yeah, your GitHub

30:18 is down because of myself and not because of

30:20 some change in the cloud version. But yeah, it

30:25 will be really a concern. And if they force us,

30:28 it's basically now I will go to my legal team

30:31 and then check the SLA and let's see if they

30:33 don't like the SLA. They're starting to get some

30:37 notice from our legal team to get a recharge

30:40 back or something. So also you had mentioned,

30:42 that's a fair statement for sure. You had mentioned

30:45 looking at Flux originally and then going with

30:49 Argo. Are you using anything like Crossplane

30:52 with Argo yet or no? No. So basically the fact

30:57 that we don't use Crossplane is basically because

30:59 it's not in our domain of work. Okay. From what

31:04 I understand, really, the idea is that cross

31:06 -plane is a good way to provision a resource

31:11 using Kubernetes as a native language. Even I

31:16 try to understand from Victor, Victor Werczek,

31:19 that's working with cross -plane, the difference

31:22 between Terraform and cross -plane, and that's

31:25 really the reason that... We didn't go deep in

31:28 crossfire because the main player of Terraform

31:31 and provision infrastructure is the infrastructure

31:34 team, not the engineering excellence department.

31:36 Basically, we are the middle layer. Imagine it's

31:40 like a lasagna or a burger. Basically, the team

31:43 is the lowest band and I am the lettuce. And

31:49 then we have even up on me, we have the burger

31:52 and then we have the tomato and they have the

31:54 other. So basically we have. these layers and

31:56 i'm in that layer that i consume service from

32:00 infrastructure team and then i deliver service

32:02 to to to the to the platform team and to the

32:07 to the developers so that's the reason that we

32:09 don't look for crossplane that's thing that i

32:12 would like to to experiment but then i need to

32:14 really have a good case of why did a terraform

32:18 to use crossplan yeah it's more i guess if you

32:21 want the infrastructure definitions closer to

32:25 the actual service right so if you want that

32:28 all defined together and there's there's pros

32:30 and cons for both ways i would say honestly even

32:33 in our organization we kind of or i've worked

32:35 in organizations where we've done both yeah even

32:38 even listen uh what victor said about the idea

32:41 of cross plan and how to use cross plan and that

32:43 year for example is one of the examples that

32:45 i can tell or i can get a model in terraform

32:49 that's getting a postgres to be from Mother bless.

32:54 But probably if I want to talk with the developer,

32:57 developer wants to have the minimal settings

32:59 to change. So basically, I think the WinCross

33:02 plan can abstract that with their internal resources

33:05 and say, okay, define this YAML manifest or that

33:08 YAML resource or, sorry, that CRD in YAML and

33:13 then the controls and everything will set up

33:16 everything for you. So at this moment, we don't

33:19 add this kind of... need in terms of organization

33:21 to have exposed so much the infrastructure to

33:25 developers. And that's real. Without need, I

33:28 don't have a way to force a tech to be used.

33:33 Oh, that's fair. It's always a balance. Yes,

33:37 it's cool, but I need to really, it makes sense

33:40 to the case. So tell me about this, the new mobile

33:44 deployments. How is that going? And how did you

33:47 set that up? The mobile team was using Jenkins,

33:51 but even with more strange setup. So basically,

33:54 they had a farm of Mac minis, and they basically

33:58 reconnected with the Jenkins controller master.

34:02 And they were using that. So basically, they

34:05 were using groovy, pretty groovy with... in their

34:11 checking pipelines plus with fast line for one

34:14 for the ios team the android team will use a

34:17 different tooling and that was really that massive

34:20 because part of the team needs to be operations

34:23 to understand to upgrade the nodes to fix issues

34:28 and also again the same issue of isolation when

34:32 we had cases that the nodes were not very identical

34:37 and a job was if lands in one machine was passing

34:40 if lands in the other machine was failing or

34:43 less force between artifacts between was screwing

34:47 up with node versions hubi versions and was really

34:51 a mess so lack of productivity and then that

34:54 was really the idea We need to move to GitHub

34:57 actions, but plus also find a way to make their

35:01 compute power or compute resource be very stable,

35:05 having the same identity that we had in the CI

35:11 -CD for the microservice. So that was a multiple

35:14 deployment and isolate. And at that point, it

35:18 was really a surprise for me, the end solution,

35:21 but basically it was this case. So I went to

35:24 research. that basically I put 4Ks on top of

35:28 the table. First, using VMs inside of Mac. So

35:32 basically a product from CircleCI. So basically

35:40 a company that's already providing GitHub Action

35:43 Runners for Macs and have the system to work.

35:48 And basically it's on top of the one tooling

35:50 at the start. And basically it's a... a nice

35:54 tooling to spawn VMs inside of Mac. Then the

35:58 idea, okay, let's try to use Nix for some independence

36:02 in isolation. And the other idea was basically

36:05 also to use AWS. So why not spawn Macs in AWS

36:11 and use it as a GitHub address? And at last I

36:15 was thinking, I need to at least think in the

36:18 way of outsourcing this to a company. This guy

36:21 is GitHub Actions, but... I could also think

36:23 to other GitHub action providers or first to

36:26 GitHub itself and then to other providers because

36:31 you can see that Blacksmith and I think Depot

36:35 are providers of GitHub action trends that you

36:40 can offload and don't depend of GitHub to have

36:44 the best performance in your machines. Basically,

36:47 it was really a poor scientific research. I have

36:52 three, four hypotheses. I have one month to test

36:55 it. And I set like one week for each hypothesis

37:00 and then try to go. And this is really the first.

37:03 It was in summer of last year. And also that

37:07 culminate in the appearance of the... AI native

37:13 mindset. So in this experimental research, after

37:17 I collided with teachers, with tooling, I found

37:20 that I had to use start to create VMs. So first

37:24 hypothesis. And then I would decide, well, if

37:27 I try to create a controller, like I have the

37:29 idea or already the use case that to have a control

37:32 for Kubernetes, to run runners. So why I don't

37:36 have an action runner control for that? And that

37:39 was really the idea. starting to think i did

37:42 a poc with shell script because it was a very

37:45 easy command but then i decided okay i have a

37:49 nice shell script let me do a really nice mvp

37:52 and then i did my first specification development

37:55 project so basically i used this idea i then

37:59 signed to define my specifications in a markdown

38:03 file what I want, what were the toolings, what

38:06 were the constraints. And then use, in this case,

38:09 was already still using GitHub Copilot and say,

38:12 I have this idea in this file, let's make a plan.

38:15 And starting elaborating the plan, creating the

38:18 plan. And then the process was really that way

38:20 that after I have the plan, we need to have to

38:22 -do lists or to -dos for each point. And then

38:25 basically I force the Copilot to use, go for

38:30 each to -do or each step. do the implementation,

38:33 I review it. Okay, it's fine. Let's move to the

38:36 next one. Fine. And then also ask to do a summary

38:40 of each implementation. So basically to have

38:42 a history of what I did. Just important for me,

38:45 but also important to share with the team all

38:47 this process. And then after one day and a half,

38:53 I got a control. So basically I had a Mac mini

38:57 in my desk. I put the control there and it was

39:00 spawning. and doing the lifecycle of the VM.

39:05 So basically, a workflow, it's my runner, execute

39:10 a niche, drop, tear down the VM, start a new

39:13 one, register against GitHub, like normal flow.

39:16 And I will say, yeah, nice. I just need to make

39:19 some improvements. Then I moved to the Nix. AI,

39:22 in this case, also it was Copilot, helped me

39:26 a lot. how to build the recipes with Nix, but

39:30 it fails tremendously just because of the way

39:33 Xcode works. It was very annoying to work. And

39:37 then I had the issue of how to distribute Xcode.

39:40 At that time, I was running out of time. AWS

39:42 was not really an option to investigate. And

39:45 then I started doing some calculations about

39:48 cost if I use GitHub as a provider of headers.

39:52 And surprise, surprise. It was cheap for our

39:56 use case. So it was basically a matter of after

39:59 spending three or four months collecting metrics

40:02 in Jenkins, I say, yeah, we can use it. And basically

40:05 it was really the idea. I said, okay, this is

40:08 the amount of money and comparing the working

40:12 hours that it's necessary for an engineer to

40:14 fix this, it's a good balance. And then after

40:16 I convinced my director that it really makes

40:19 sense in terms of financial terms. you approve

40:22 and then we move and then we move to migration

40:25 and that this migration was i did with the junior

40:27 the first thing that we did when this migration

40:29 was really sit down with the developers and i

40:34 asked anoint us to to to my junior in this case

40:38 it was to internities sorry but it was really

40:41 important to that yeah i asked him go for each

40:44 um jenkins pipeline they have and start doing

40:50 a flowchart so basically we had a flowchart for

40:53 each Jenkins pipeline with steps and the steps

40:58 in the way that what is supposed to do and what

41:01 are the commands that are executed and then i

41:03 sit down with the with each uh team from android

41:08 and and ios and then let's go forward for each

41:12 step and then try to understand as this makes

41:14 sense this flow i don't care about what is the

41:17 command that is executed. Does this step make

41:20 sense? Does this test make sense? Does this fork

41:23 in the logic make sense? And then we also understand

41:27 some good things that is some workflows or some

41:31 checking job that we have was already redundant.

41:33 We could refactor the input and combine in one

41:38 single pipeline. That was really the idea. And

41:42 then that was the second time we used the AI

41:44 to speed up. And this time I already added the

41:47 session of agenting coding with my engineering

41:51 department. So basically the engineering excellence

41:53 create agenting sessions to teachers all to use

41:57 AI tools in a more agentic way. So not to auto

42:01 -complete features, but to give context, to give

42:04 a goal. have this guy, this AI, as really a partner

42:09 to execute. And then at this time, already we

42:12 was using cloud code. And then, okay, let's bring

42:15 these flowcharts, convert to something that is

42:17 more digestible by AI. So I was able to export

42:21 as a CSV file. And then, okay, these are CSV

42:25 files, contains flowcharts of our workflows.

42:28 Let's build GitHub Action workflows. and documentation.

42:32 And he's starting doing the old workflow. Was

42:35 not really exactly what we want, but was close

42:38 enough. Imagine that this was really a good best

42:42 draft, a good first draft. This was really, now

42:45 just adjust this step, this step, this step,

42:47 and then we just starting building on that. And

42:50 then, of course, this really make the work very

42:53 easy. So we add like 17 workflows to migrate.

42:58 What is the difference from this migration to

43:01 the other migration? The other migration, we

43:02 add like one or two workflows for all the process.

43:08 So basically, the issue was to replicate that

43:11 to use workflows to 700 service. In this case,

43:14 we have only two service, two repositories, the

43:17 Android and iOS, but we have multiple workflows

43:19 to migrate. And then we had to rewrite a lot

43:22 of stuff. And that was the way that we use AI

43:26 to basically at the pace of each day we migrate

43:29 a workflow and then cloud code was able to digest

43:34 some part of the code base. So I had two issues.

43:37 First, parts of the customization that they have

43:40 in the iOS team was using FastLine that is written

43:44 in Ruby. I don't know Ruby. So I used it to understand

43:49 what was that. what was the logic behind that

43:55 ruby scripts and had extra features then i'm

44:00 not very well versatile in the ios test and compiler

44:06 so i never i'm not irs developer i don't know

44:10 all the quirks about the the process to compile

44:13 language compile ios app And I use protocol for

44:17 that. So basically, it was already in a way that

44:20 I had an issue in my pipeline. And I tell them,

44:23 OK, I have problems like this. I have an issue

44:27 in my pipeline. This is the idea of the workflow.

44:31 This is the idea of job. This is a step that's

44:33 failing. Fetch. using github cli in that time

44:36 even we using mcp probably it's best if you have

44:40 a cli to tell the ai model to use that cli to

44:43 fetch the information instead of uh bloat the

44:49 contacts with mcps use the cli fetch the logs

44:53 in that section and that let's go investigate

44:55 what it's what it's failing and it was really

44:58 able at some point i get get surprised because

45:01 when i go in deep mode of troubleshooting the

45:04 model in this case even the the agent that is

45:07 called cloud cloud was able to go to the internet

45:10 and find github issues about the problem pointing

45:14 out i have found this issue probably it's about

45:17 this let's double check and then i double check

45:19 yeah probably makes sense let's try this change

45:22 and it was really basically it was a more family

45:26 language or i cannot what i can say it was like

45:31 while we were in the past doing Googling. So

45:33 you put the problem, you try to find the issues.

45:35 Here, I have the problem. Also find the issues

45:38 in the internet and get me back the information

45:41 and then validate with me and then explore. And

45:43 basically, after four weeks, we migrate everything.

45:47 We even add extra features they want. And they

45:51 were very happy. And still, they are very happy.

45:55 Of course, the initial costs failed because it

45:58 was more than I expected. But for one reason,

46:01 developers were delivering more. So it was really

46:04 a situation that the workflows and run are so

46:09 stable, they can focus more in future and then

46:12 increase the cost. But it's because they are

46:14 shipping more features than they did in the past.

46:19 That's a good cost problem to have, you know?

46:21 Yeah. Yeah. I think that's the thing I have.

46:26 Do you have any questions about that topic? I

46:29 did want to ask, wrapping up, if someone that's

46:34 listening, they wanted to pull off a big CICD

46:37 switch like you did, are there some lessons learned

46:40 that you could give that they could follow so

46:42 they don't set their org on fire? I mean, because

46:44 this is a complex... lift and shift, right? Going

46:48 from Jenkins to GitOps and introducing Argo and

46:52 all the complexities around, you know, the Mac

46:54 mini pipelines, like be interested. There's some

46:57 like core lessons learned that you could impart.

47:00 So the first thing is, I think you need to, even

47:03 you have a big pipeline, I bet that you have

47:06 a small part. So a niche. So try to find. a small

47:12 part that you can replace and do it in isolation.

47:15 In my case, it was basically the pull request

47:17 validation. It's detached in some way of the

47:21 big flow. Try on that. If you cannot do that,

47:25 try to find segmentations that you have in your

47:28 organization in terms of teams. That's a good

47:30 way to approach so that you can move parts of

47:38 your team. Basically, if you have 10 teams or

47:41 five teams, pick one team and try to go in that

47:47 way. So this is a way that you can try to reduce

47:50 the buster radius. And of course, I think that

47:53 was a thing that was very important for us, basically

47:55 doing dogfood. I think it's very unfair for someone

47:59 that is developing tools for developers not using

48:02 that in their work. So I think this is really

48:05 the most important thing. It's doing dogfooding.

48:07 And then if you want, try to find in each place

48:10 if you don't have anything think do you have

48:13 internal tooling that doesn't uh provide for

48:16 your final customers that can be bad but for

48:18 trying to apply this to your internal tooling

48:22 yeah that makes sense cool where can people find

48:25 your uh your posts and and where can they reach

48:27 out to you okay so my posts are in the medium

48:30 probably you can find them have the links in

48:32 the description also i have that both was based

48:35 in the talk that I did. So also in YouTube, I

48:41 will have that talk. Also, sometimes I do some

48:45 publishing in LinkedIn. So you can go there,

48:48 try to reach me in LinkedIn. Awesome. I'll leave

48:51 the links for your Medium posts and your LinkedIn

48:53 and anything else in the show notes. Stefan,

48:55 thanks for coming on. Really appreciate it. Okay.

48:57 Thank you. All right. That's my conversation

49:00 with Stefan Moser. My biggest takeaway from this

49:03 one is that good CICD is not just about picking

49:06 a newer tool. It is about building a delivery

49:09 system that is predictable. observable, isolated,

49:13 and usable enough that engineers can trust it

49:16 without needing constant help from platform teams.

49:19 That is really the thread running through this

49:21 whole episode. They did not just swap Jenkins

49:24 for GitHub Actions. They reduced noisy neighbor

49:27 problems. They standardized runners. They leaned

49:30 into reusable workflows. They moved deployment

49:33 towards GitOps. They built their own visibility

49:36 layer when the platform was not giving them enough,

49:39 and they rolled it out in a way that let teams

49:42 build confidence instead of forcing a giant overnight

49:45 cutover. I also liked that he was honest about

49:48 what happens when the new system works. Once

49:51 deploys get easier, people use them more. Bots

49:54 start shipping changes, automation starts piling

49:57 up, and then you discover the next bottleneck,

49:59 whether that is queuing, fairness, or protecting

50:02 enough room for humans to still get work out.

50:05 That is the real platform lesson. Success creates

50:08 new load. And the better your self -service story

50:11 gets, the more you have to think about throughput,

50:13 guardrails, and the system behavior under trust.

50:16 The other part I liked was his migration advice

50:19 at the end. Start with a niche. Reduce blast

50:22 radius. Dog food your own system first. And if

50:25 you are building tools for developers, use them

50:28 yourself first before asking everyone else to

50:31 bet on them. That is probably the cleanest takeaway

50:33 from the whole episode. If you enjoyed this episode,

50:36 follow Ship It Weekly wherever you listen to

50:39 podcasts. If you want the show notes, links to

50:41 Stefan, his write -ups, and the resources we

50:44 talked about, head over to shipitweekly .fm.

50:47 Thanks for listening, and I'll see you later

50:49 this week. Thank you.

Ship It Conversations: Stephane Moser on Pipedrive’s Jenkins-to-GitHub Actions Migration, Argo CD, and CI/CD at Scale

Watch this episode here

Transcript

Catch This Episode

Host Commentary

Show Notes

👤 Guest

More from Ship It Weekly

Coinbase Outage, Meta AI Account Recovery, AWS AgentCore Code Injection, Apigee Tenant Isolation, and the Glue That Breaks Production

Kiro CLI Approval Bypass, Amazon Braket Pickle Risk, AWS Org Logging, KEDA Upgrades, and Automation’s Hidden Boundaries

GitHub Supply Chain Attacks, Railway’s GCP Outage, Discord’s Voice Failure, AWS Retry Changes, and Trusted Tool Risk

Ship It Conversations: Jake Warner on Cycle.io, Bare Metal’s Comeback, and Why Private Cloud Is Getting Interesting Again