Ship It Conversations: Stephane Moser on Pipedrive’s Jenkins-to-GitHub Actions Migration, Argo CD, and CI/CD at Scale

Transcript

0:00 A lot of teams say they want better CICD. What

0:03 they usually mean is they want fewer weird failures,

0:07 less shared state nonsense, less tribal knowledge,

0:10 and way less time wasted fighting the delivery

0:14 system itself. Because the problem is not just

0:17 that Jenkins is old. The problem is when your

0:19 pipelines are noisy, Fragile, hard to reuse,

0:23 hard to observe, and tightly coupled to a bunch

0:25 of infrastructure decisions developers should

0:28 not have to think about. And when you finally

0:31 fix the problem, you get a different problem.

0:33 Success. Because once delivery gets fast and

0:37 easy, people start using it for everything. Bots,

0:40 maintenance changes, dependency bumps, mass rollouts,

0:44 and now the real question becomes, how do you

0:46 keep the system smooth when everybody finally

0:49 trusts it? Like and subscribe! Hey, I'm Brian.

1:08 I work in DevOps and SRE, and I run Tellers Tech.

1:11 Ship It Weekly is where I filter the noise and

1:13 focus on what actually matters when you are the

1:15 one running infrastructure and owning reliability.

1:18 Most weeks, it's a quick news recap. In between

1:20 those, I do interview episodes with people who

1:23 have actually built things, migrated real systems,

1:26 and learned what works the hard way. Today is

1:28 one of those conversations. I'm joined by Stefan

1:31 Moser from Pipedrive. he helped lead a big move

1:34 from Jenkins to GitHub Actions, built a self

1:37 -hosted runner platform on Kubernetes, moved

1:40 delivery towards GitOps with Argo CD, and helped

1:43 roll that model out across a large internal estate

1:46 with hundreds of services. And what I like about

1:49 this one is it's not just tool talk. We get into

1:52 why Jenkins had become painful, from groovy fiction

1:55 to noisy neighbor problems on shared VMs. Why

1:59 GitHub Actions ended up fitting better. How reusable

2:02 workflows and custom actions helped. Why they

2:05 chose Argo CD over other deployment options.

2:08 And how they had to build better internal observability

2:11 because GitHub alone was not enough at their

2:14 scale. We also talk about the migration strategy.

2:17 which honestly is one of the best parts. Dogfooding

2:20 first, migrating in batches, using internal teams

2:24 as the first proving ground, letting the process

2:27 get polished before pushing it wider, and building

2:30 something self -service enough that teams eventually

2:33 started migrating on their own. And then there

2:36 is the mobile story, which is its own thing.

2:38 Mac minis, messy runner drift, different toolchains

2:42 and the surprisingly practical path they landed

2:45 on after testing a few different options for

2:48 stabilizing mobile CI. If you care about CI CD

2:51 architecture, platform engineering, GitHub actions

2:54 at scale, or how to do a migration like this

2:57 without setting your org on fire, this one is

3:00 worth your time. All right, let's jump in. Today,

3:08 I'm joined by Stefan Moser from Pipedrive. He

3:11 helped lead a big move from Jenkins to GitHub

3:13 Actions and built a self -hosted runner platform

3:15 on Kubernetes, plus a GitOps CD flow with Argo

3:19 CD. And we're going to talk about what worked,

3:21 what broke, and what's worth copying. Stefan,

3:24 thanks for joining me. Thank you. Thank you for

3:26 the opportunity to talk about this adventure.

3:30 Basically, it's not the first time I'm talking

3:33 about this. I already had... meetup session that

3:37 was recorded on YouTube, two blog posts, and

3:40 basically just sharing again these adventures.

3:44 But now with a little sparking because something

3:49 already after one year already changed. So I

3:52 have more stuff to add to these blog posts and

3:57 meetup that we had. Awesome. Well, I'm excited

4:00 to learn more. So starting off, can you give

4:03 me a thesis? Why did Jenkins stop working for

4:06 you, and what were you trying to optimize with

4:08 this new system? So basically, the first problem

4:11 that we had with Jenkins is Jenkins decides being

4:14 the pipelines being written in Groovy. That was

4:16 the first problem. In Pipedrive, we are mainly

4:19 working with TypeScript and Go. Groovy was really

4:22 not very needed for us, so it was a bearer. for

4:27 the DevOps teams that we had at that time, plus

4:29 the engineers trying to do writing something.

4:31 That was the first issue. The second issue was

4:34 basically the setup. So we had VMs, and that

4:37 VMs were not isolated. So that means if a big

4:42 pipeline lands in one VM, and we had other pipelines

4:47 like to build the Docker containers, then we

4:50 had the issue with the nice... noisy neighborhood

4:54 so basically we get a starvation of the resource

4:57 so it's really not predictable we tried to to

5:01 improve that so we had a crazy idea first to

5:04 build a internal ci engine so that was really

5:07 the most crazy thing they did so basically copy

5:10 the idea of yaml from github's gitlab ci we build

5:15 the engine and then start to work on that but

5:18 then we had another issues People need to learn

5:21 and use syntax. And after GitHub Land launched

5:24 GitHub Actions, people started using it, some

5:27 developers started using it in their app stores.

5:29 And then we were thinking, why not experimenting?

5:32 And the first idea was basically, besides Jenkins,

5:35 we had another product, CodeChip, that was to

5:38 do the pull request validation. So every time

5:40 the pull request is open, we had a typical lint

5:44 and test work. So basically linting to building

5:47 the... Code, leading the code, and doing the

5:52 unit tests. They were running in the code chip.

5:55 And I can tell it was worse than Jenkins because

5:59 every project was individual. So imagine if I

6:02 had to change or to share something, it's very

6:05 difficult. So we had an idea. Well, let's try

6:08 to win the GitHub Actions. And basically, I team

6:12 up with a colleague, Gregor. And basically, it

6:17 was the idea that... Let's replace CodeChip with

6:20 GitHub. Actually, let's see what happens. And

6:22 that was really the kickoff. So we get a green

6:27 card to export idea. So the first thing was thinking

6:31 that is, okay, I need to run this somewhere.

6:34 Probably GitHub, even we had a free package,

6:39 will not be enough. So let's go to find solutions

6:43 for self -hosting. We found, after some ideas,

6:48 we found a community project. It is ARC, Actions

6:53 Run Controller. At this moment, it's already

6:54 belonging to or maintained by Gita, but at that

6:57 time was really purely community -based. One

7:01 thing that it was very... Important for us, or

7:04 very convenient for us, it was a Kubernetes controller.

7:08 So we spent the last years working with Kubernetes.

7:12 We are understanding the Kubernetes API, these

7:15 CRDs, all talking in the Kubernetes way. So it

7:20 was more easy to work with that. So we define

7:23 resource, get runners running, fine. One thing

7:27 that is also awesome, we can grab a pre -built

7:30 image and then... and put that customization

7:33 that you want. So basically, developer needs

7:36 HGK, IKH, or different tooling. Just put it inside

7:41 of that custom image and then we ship it. It

7:48 was even for easy. Then we already know how to

7:52 restrict and monitoring Kubernetes in production.

7:55 Just apply the same idea in the... CI environment

7:59 so basically I did a thing that is normally not

8:02 normal standard so basically I don't I didn't

8:05 set the requests to be the smallest possible

8:09 most necessary no I just was brute force so I

8:13 say for example set the request and the limits

8:16 to be almost the same so that I could have one

8:20 thing that is pretty bit in the So that every

8:25 time a developer gets a runner, it's always the

8:31 same CPU and memory. So that avoids the problem

8:35 of noisy neighborhoods. Then we had already ideas

8:40 how to, we already know how to see and control

8:43 the resource in Kubernetes. So apply the same

8:46 metrics. So basically we bring back the, of the

8:49 really tools that we had in production to. the

8:52 ci cluster but then we also did want to be more

8:56 straightforward in in the process to maintain

9:02 the clusters basically instead of building the

9:05 cluster from scratch we use ets to be more easy

9:09 and then we want to to have scalability in terms

9:12 of the nodes we decide then to go to the new

9:15 project at that time was carpenter so basically

9:18 the the magic way of AWS to spawn nodes faster

9:24 than a cluster of auto -scaling. So basically,

9:29 then we get this solution. So a controller that

9:31 was listing the GitHub app books to spawn jobs,

9:34 spawn when the instance creates a new pod that

9:38 represents a runner. If that runner doesn't have

9:40 a space, the carpenter then creates a new node.

9:43 and that was really bring the flexibility okay

9:45 i just need to set the max number of thoughts

9:49 or the runners and the ecosystem for the cluster

9:53 scale up and scale down when when necessary and

9:56 this was really basically the why we we moved

10:02 to uh two changes and first steps we then basically

10:06 did the migration of code chip to github actions

10:09 at that time also was released the reusable workflow

10:13 so that means we are able to reusable to create

10:16 a workflow and spread that workflow to the old

10:19 repositories so reducing the the repetition that

10:26 we had and the manual configuration we had with

10:28 Jenkins with the code chip and Basically, after

10:32 we dropped out of CodeChip, it was the time that

10:35 we had the opportunity to revamp CI -CD. So we

10:39 bring a group of engineers, I think four engineers

10:45 plus a developer experience product manager.

10:52 So basically, we had a product manager dedicated

10:55 for the developer experience. And then we decided

10:58 to revamp. uh necessity and they basically said

11:01 to put some kind of competition in in terms of

11:04 tooling so i remember that the first contender

11:07 was basically github actions in terms of ci but

11:10 then also we bring argo workflows and tecton

11:16 because were two projects that i was curious

11:19 about using And in terms of deploying, I'm thinking

11:23 Argo CD plus Flux. We even tried with Spinnaker,

11:28 but it was so messy to spin up that system that

11:35 we just dropped it. And how we chose, basically,

11:38 first was... readability or feasibility or the

11:42 easiness to create the workflows and the specific

11:46 customizations so and that is really shine for

11:49 us the kit of actions basically people complain

11:52 about the action but really the fact that the

11:54 actions are written in javascript and then we

11:56 are using typescript so it's natural to create

11:59 custom logic with typescript even then we can

12:02 go to creating composite actions with some more

12:05 for some batch scripting. Scripts, we just use

12:09 that language to create the customization. And

12:12 then it was really easy to pack up. So basically,

12:15 we create action, we have a package, like a package,

12:18 and then can put it a block that we can put in

12:21 different place. And then the other point was

12:23 basically about how we can expose or show the

12:28 workflows to the developers. And that is... The

12:32 less clicks the developer does, find lots better.

12:37 And that, of course, makes the GitHub action

12:40 to be a big content. So the fact that the workflows

12:44 was next to the repository, the execution was

12:47 next to the repository, the developer didn't

12:50 need to switch to the platforms to see was really

12:54 a big plus. Yeah. And that's really the reason

12:59 why we went to... to GitHub Actions. Even at

13:02 some point after this migration, what we had,

13:05 it's basically a monorep of GitHub Actions. And

13:07 I think we have 50 or 60 GitHub Actions. So basically,

13:11 we just build our actions and even it's our monorep

13:15 of all the code base that you want to reuse in

13:18 the organization. In terms of development, of

13:22 deployment, one thing that we want to make. clear

13:26 is that we want to avoid push code. So one of

13:29 the issues that we had always is that with Jenkins

13:31 and this push idea is that fact that I need to

13:35 have a runner in the clustered push code. That

13:38 means that I need to give the Kubernetes credentials

13:42 to a runner to be able to write stuff. And that

13:46 means that in some way I can try to capture that

13:49 runner and do malicious stuff. let's do the reverse

13:53 so that is that the cluster is really in isolation

13:57 and we get what we have it's someone inside that

14:01 check something that it's really it's already

14:03 trusted and use it to um to apply change in the

14:09 cluster and that is really why you want to go

14:11 to the to do githubs and of course in githubs

14:18 we have two two tools to choose argo cd or flags

14:20 yeah And in that case, it was basically one of

14:24 the reasons why Argos CDF had a good UI comparing

14:30 with Flux at that time. It was like two, three

14:33 years ago. So that was really the big reason

14:38 that we picked up with Argos CDF Flux for the

14:41 UI. And then we had already a way to show to

14:44 developers what has happened in a more easy way.

14:48 Because again, that is that it's easy for developers

14:51 to visualize doesn't mean that i allow them to

14:55 change manifest with rcd yeah i know what i allow

14:58 is to basically scope in inside of the application

15:02 saying okay this is your application see your

15:05 status you can go see the status of your resource

15:08 you can see the logs if you need it you can see

15:12 if the content the pod is it's restarting or

15:14 not really um we added that idea. Yeah, you can

15:22 show. Yeah, after that idea, I think that picking

15:27 up the GitHub Actions took me like two days to

15:32 create MVP of the deployment flow, basically

15:36 because I was reusing everything that I already

15:38 did in the CodeChip migration, and we decided

15:44 to have that MVP, and then... see what are the

15:50 gaps. So one thing that I forgot to mention is

15:53 that initially in the first step or first iteration

15:57 that we had with migrating to CodeChip, we had

16:00 a problem with a lack of observability in terms

16:06 of actions. GitHub at that time didn't provide

16:08 any statistics about the actions. So if you want

16:12 to know how long was the runner, we have any

16:17 idea. You can just... see at the repository and

16:19 was not good enough. But especially in our organization

16:22 that we already have, at that time we had like

16:25 700 service. We have architecture of microservice

16:30 plus libraries. It was impossible to track at

16:34 the repository level. So what we did is basically

16:36 we create a service that lists these events and

16:39 then starting putting everything inside of a

16:42 database so that could build. a source that we

16:46 can then do queries and then find uh what are

16:50 the the performance of of the the workflow because

16:54 at some point a manager a director or even cto

16:57 can comes to us it says what is the failure rate

17:02 of these uh workflows where what are the workflows

17:04 that are taking or these guys what are the deployment

17:07 flows that are taking more time what are the

17:09 steps that we need to to optimize and then basically

17:12 this is really our gut feeling that we need that

17:14 information and we are storing that information.

17:17 And until now, we're still using that service

17:20 to track, even now to tracking more billing information,

17:24 but it's really the source of truth. Okay, that

17:29 was the phase one and phase two, phase three.

17:33 It's basically making the product ready for the

17:37 consumption. And one thing that we discovered

17:41 that we lack, it's basically, again, organization

17:45 -wide of the visibility. GitHub actually is great

17:50 to see your workflows, to see the organization.

17:53 It's bad because we don't have anything. So we

17:57 create a service. That is basically a registration

18:01 service plus a UI that we have in our back office.

18:05 So basically we have a UI in our back office

18:08 that shows all the deployments, but they need

18:11 to have some source. And basically we create

18:13 a service that consume events that we are producing

18:15 in our workflows. And after consuming these events

18:20 can do the mapping of what it's app. And basically

18:23 we have basically our own interpretation of the

18:27 events of. Because they have a different interpretation

18:31 of what is deployment, what is a deploy, what

18:34 is a region. And then basically we add that internal

18:38 logic or that internal understanding what is

18:41 build, test, unit test, functional test, end

18:44 -to -end test, deploy to a region, deploy to

18:47 a... deployment at a whole scale. And we have

18:52 that understanding inside of a service. And basically

18:56 this was initial, the service that was consuming

18:58 everything to create observability of what we

19:03 need to do or what we have in terms of deployments

19:07 to production. And after that, we have that,

19:10 after having that UI, we are more confident to

19:14 go. Now goes the question, how you do this kind

19:18 of release? 700 service. I think we have five

19:22 teams or 10 teams at that time. It was really

19:28 a big migration. How you start migrations? I

19:33 think like any product should migrate or should

19:39 be released, you first try it yourself. So basically...

19:42 And we decided to migrate our own service. My

19:46 team doesn't only produce CI, CD. We have a couple

19:49 of service that supports the deployment flows

19:53 and support the developer experience. So for

19:56 example, we had one service that creates and

19:59 do all the lifecycle of remote dev environment.

20:03 So basically we have remote dev environments

20:05 that developers can use that it's a replica of

20:08 production. And then we have a system that creates

20:09 that. For example, this is one of these servers

20:12 that we need to migrate. And that is really basically

20:15 like this. So I have scripts, I have workflow,

20:19 I have documentation to do it. So I deliver to

20:22 my teammates so that my teammates can migrate.

20:25 These bring two good things. First, we test our

20:29 process to someone that doesn't know about the

20:32 process. And second, we allow... to share knowledge.

20:36 So basically, knowledge that was restricted to

20:39 that five people that create systems starting

20:41 to be spread to the team because I'm sharing

20:43 that knowledge by forcing the people to do the

20:46 migration. And basically, it was basically this

20:49 alpha grouping. Then we had a different team.

20:53 So basically, it was a platform team that it's

20:56 more experienced than the normal developers around

20:59 stuff with Kubernetes. and the deployment process

21:02 and they have more out of the box or out of the

21:06 standard service, then we go to them and say,

21:09 okay, we have this Polish workflow. Let's now

21:13 teach you, do a session and then allow them to

21:18 migrate. And then it was basically like the open

21:22 beta of the system, of the migration. Again,

21:26 the idea was to... make everything more polished,

21:30 more clear for the users. After they give us

21:34 our feedback and then we improve the workflows,

21:36 define, we start doing to think how we do all

21:39 the role. In this case, we pick up the prioritization

21:46 that SRE team define for the service. So basically

21:51 SRE define tiers. So tier one is a service that's

21:56 very critical. Tier two, it's more or less. And

21:59 then tier three is less critical. So we decided,

22:01 okay, let's split this in batch. Let's go team

22:05 by team, starting in tier three. Then they can

22:09 move to tier two and then tier one. And that

22:12 was the idea. And then the plan was basically

22:16 split it in batch and assign a DevOps engineer

22:20 to each batch, not to do the... the migration

22:26 but to be the assistant so basically the person

22:30 who does the introduction of the new way ui the

22:34 new workflow and then this is the the the interaction

22:39 goal that we have to to do your mind and also

22:43 to make them so basically if i want everyone

22:47 one person that's responsible to achieve a goal

22:50 of migrate x number of of service until the end

22:55 of the month it will keep the momentum in the

22:58 team and then we started doing that so team by

23:01 team we started moving we have a schedule we

23:04 have a limit number of of engineers so basically

23:06 we have a queue until some moment we had guys

23:10 from the middle of the queue at the end of saying

23:13 okay i already saw the other guys doing migration.

23:16 I think I can do it alone. Okay, just check the

23:20 recording that we had. Try yourself. Yeah, and

23:24 they decide to try alone and starting doing migration

23:27 alone. So basically, the system was already so

23:29 well oiled and everything was moving smoothly.

23:34 They were able to migrate alone. And that's really,

23:38 I think, the real sex story of doing these migrations

23:41 by batch. doing your dog food and all these things

23:46 in a way that makes everything automatic or with

23:49 less human intervention to allow people to use

23:53 it. And then you also can think this is really

23:55 the idea of maybe platform engineering, making

23:58 everything self -service. When you have a product

24:01 that's already usable for a normal product engineer

24:05 that doesn't need to have the context of the

24:08 CICD and how to deploy to Kubernetes. It's a

24:12 service that he can use and can work and can

24:15 do his work autonomously. And yeah, that's then

24:20 after five months, I think we migrate everything.

24:23 Yes, we found what we get as issues, basically

24:29 our success starting break things. So the process

24:34 was so smoothly. The deployment flow was so good.

24:38 that then we're starting an introduction of bots

24:40 to look at deployment for the maintenance tasks.

24:43 So basically, the SRA team developed a service

24:46 to create pull requests to adjust the resource

24:48 usage in production and then automatically deploy

24:51 that to production. So basically, it creates

24:53 a pull request, we introduce a set of resources,

24:55 and this pull request is already pre -approved,

24:58 so it moves by all the normal pipeline flow and

25:04 goes to production. Then we have the PandaBot.

25:07 creating pull requests for the pendants and in

25:10 some case the developers were already confident

25:12 enough in the unit test they have in the functional

25:15 test they have allowed that obligations or updates

25:20 of of dependence to go without any approval from

25:24 a human so we had then a situation that we have

25:27 multiple um departments happen and then was struggling

25:31 or doing impact in our system. Even we can grow,

25:36 grow, grow. Then we add some bottlenecks. At

25:40 that point, we decide to improve that service

25:44 that was the deployment registrar to have a queue.

25:48 So the idea is that you then have a way that

25:52 you, when you start the deployment, you send

25:54 event, I have a deployment. And then what happens

25:57 is that the deployment registrar register that

26:00 deployment and put in the queue. evaluate the

26:03 queue size so basically we define that we have

26:06 the number of 50 deployments per available in

26:09 parallel and also we do a tricky thing that is

26:14 we only allow 10 of that queue to be used by

26:18 bots so basically we always want to have some

26:22 kind of free space for the moments to develop

26:25 that develop features to ship and basically that's

26:28 a and so we have a queue with all the idea to

26:31 put a margin for humans. And then if everything

26:35 is okay, then we rerun the workflow that allows

26:40 to execute everything. Yeah, it was basically

26:46 that thing that we had to improve. Also, one

26:49 of the bottlenecks that we had was we had issues

26:52 with how we commit to an environment state. So

26:57 basically, it was one of our... really bottlenecks

27:01 is when you have multiple deployments, you need

27:05 to be, only can commit once, one at a time, the

27:10 deployment. So basically, we had a strategy that

27:13 we create one commit per region. So we have a

27:17 lot, a pile of commits to push to that, this

27:23 environment site repository. And that was really

27:26 a bottleneck. Fortunately, we were not very clever

27:29 at that time to create the queue for that specific

27:32 step. And then we had to re -engineering all

27:35 the process because we had the issue that we

27:38 didn't do like a FIFO. So basically, we did implement

27:42 a hot hobby. And that means that some guys was

27:45 very unlucky that it was the first one to arrive

27:48 and was not the first one to get their stuff

27:52 deployed. Yeah. This is really the story that

27:55 we have. What was missing at that time was in

27:58 the presentation and blog post was really the

28:00 migration to the mobile team. I don't know if

28:07 you want to go to already that story. It's a

28:09 new story that I have. Or you want any other

28:12 questions? Let's actually get to the mobile story

28:16 in a second. So going back over your CICD process,

28:20 I'm curious. First off, are you worried about

28:23 The GitHub change to private self -hosted runners,

28:27 they're changing the pricing model. So I guess

28:29 they're going to charge for self -hosted as well

28:31 now. Is that going to change your implementation

28:33 at all? That's the problem. I can think that

28:36 if they're starting to charge that, then I need

28:40 to start to thinking, if they start charging

28:42 like that, then I need to really be very picky

28:45 in their SLAs. so i they how they can charge

28:51 me a fee for the control plane when their control

28:54 plane it's not really 99 or doesn't yeah meet

29:01 the slas it certainly hasn't been lately that's

29:04 for sure that that is really the thing even even

29:07 we don't even we get that uh we more or less

29:10 exclude that from our matrix of of um I already

29:14 starting to think already seeing some page that

29:17 shows the SLA in the last couple of months. And

29:22 then I'm starting to thinking if you charge me

29:24 for this, I need to have a better SLA. I cannot

29:27 stop working. And then for the size of what we

29:32 have, it's basically or I need to find a different

29:37 CI tooling and going to that path or even can

29:41 go more crazy. Don't forget that, okay, we are

29:46 a special case, or probably not a normal case.

29:49 That is, we are already in enterprise level.

29:53 And we have GitHub hosted enterprise server.

29:58 So basically, you can bring yourself. And if

30:02 this is starting to be expensive, probably that's

30:05 the thing I need to do. I bring down and then

30:08 try to be less dependent of... the ability of

30:12 the cloud version and then be my concern. And

30:15 then I can then point to myself. Yeah, your GitHub

30:18 is down because of myself and not because of

30:20 some change in the cloud version. But yeah, it

30:25 will be really a concern. And if they force us,

30:28 it's basically now I will go to my legal team

30:31 and then check the SLA and let's see if they

30:33 don't like the SLA. They're starting to get some

30:37 notice from our legal team to get a recharge

30:40 back or something. So also you had mentioned,

30:42 that's a fair statement for sure. You had mentioned

30:45 looking at Flux originally and then going with

30:49 Argo. Are you using anything like Crossplane

30:52 with Argo yet or no? No. So basically the fact

30:57 that we don't use Crossplane is basically because

30:59 it's not in our domain of work. Okay. From what

31:04 I understand, really, the idea is that cross

31:06 -plane is a good way to provision a resource

31:11 using Kubernetes as a native language. Even I

31:16 try to understand from Victor, Victor Werczek,

31:19 that's working with cross -plane, the difference

31:22 between Terraform and cross -plane, and that's

31:25 really the reason that... We didn't go deep in

31:28 crossfire because the main player of Terraform

31:31 and provision infrastructure is the infrastructure

31:34 team, not the engineering excellence department.

31:36 Basically, we are the middle layer. Imagine it's

31:40 like a lasagna or a burger. Basically, the team

31:43 is the lowest band and I am the lettuce. And

31:49 then we have even up on me, we have the burger

31:52 and then we have the tomato and they have the

31:54 other. So basically we have. these layers and

31:56 i'm in that layer that i consume service from

32:00 infrastructure team and then i deliver service

32:02 to to to the to the platform team and to the

32:07 to the developers so that's the reason that we

32:09 don't look for crossplane that's thing that i

32:12 would like to to experiment but then i need to

32:14 really have a good case of why did a terraform

32:18 to use crossplan yeah it's more i guess if you

32:21 want the infrastructure definitions closer to

32:25 the actual service right so if you want that

32:28 all defined together and there's there's pros

32:30 and cons for both ways i would say honestly even

32:33 in our organization we kind of or i've worked

32:35 in organizations where we've done both yeah even

32:38 even listen uh what victor said about the idea

32:41 of cross plan and how to use cross plan and that

32:43 year for example is one of the examples that

32:45 i can tell or i can get a model in terraform

32:49 that's getting a postgres to be from Mother bless.

32:54 But probably if I want to talk with the developer,

32:57 developer wants to have the minimal settings

32:59 to change. So basically, I think the WinCross

33:02 plan can abstract that with their internal resources

33:05 and say, okay, define this YAML manifest or that

33:08 YAML resource or, sorry, that CRD in YAML and

33:13 then the controls and everything will set up

33:16 everything for you. So at this moment, we don't

33:19 add this kind of... need in terms of organization

33:21 to have exposed so much the infrastructure to

33:25 developers. And that's real. Without need, I

33:28 don't have a way to force a tech to be used.

33:33 Oh, that's fair. It's always a balance. Yes,

33:37 it's cool, but I need to really, it makes sense

33:40 to the case. So tell me about this, the new mobile

33:44 deployments. How is that going? And how did you

33:47 set that up? The mobile team was using Jenkins,

33:51 but even with more strange setup. So basically,

33:54 they had a farm of Mac minis, and they basically

33:58 reconnected with the Jenkins controller master.

34:02 And they were using that. So basically, they

34:05 were using groovy, pretty groovy with... in their

34:11 checking pipelines plus with fast line for one

34:14 for the ios team the android team will use a

34:17 different tooling and that was really that massive

34:20 because part of the team needs to be operations

34:23 to understand to upgrade the nodes to fix issues

34:28 and also again the same issue of isolation when

34:32 we had cases that the nodes were not very identical

34:37 and a job was if lands in one machine was passing

34:40 if lands in the other machine was failing or

34:43 less force between artifacts between was screwing

34:47 up with node versions hubi versions and was really

34:51 a mess so lack of productivity and then that

34:54 was really the idea We need to move to GitHub

34:57 actions, but plus also find a way to make their

35:01 compute power or compute resource be very stable,

35:05 having the same identity that we had in the CI

35:11 -CD for the microservice. So that was a multiple

35:14 deployment and isolate. And at that point, it

35:18 was really a surprise for me, the end solution,

35:21 but basically it was this case. So I went to

35:24 research. that basically I put 4Ks on top of

35:28 the table. First, using VMs inside of Mac. So

35:32 basically a product from CircleCI. So basically

35:40 a company that's already providing GitHub Action

35:43 Runners for Macs and have the system to work.

35:48 And basically it's on top of the one tooling

35:50 at the start. And basically it's a... a nice

35:54 tooling to spawn VMs inside of Mac. Then the

35:58 idea, okay, let's try to use Nix for some independence

36:02 in isolation. And the other idea was basically

36:05 also to use AWS. So why not spawn Macs in AWS

36:11 and use it as a GitHub address? And at last I

36:15 was thinking, I need to at least think in the

36:18 way of outsourcing this to a company. This guy

36:21 is GitHub Actions, but... I could also think

36:23 to other GitHub action providers or first to

36:26 GitHub itself and then to other providers because

36:31 you can see that Blacksmith and I think Depot

36:35 are providers of GitHub action trends that you

36:40 can offload and don't depend of GitHub to have

36:44 the best performance in your machines. Basically,

36:47 it was really a poor scientific research. I have

36:52 three, four hypotheses. I have one month to test

36:55 it. And I set like one week for each hypothesis

37:00 and then try to go. And this is really the first.

37:03 It was in summer of last year. And also that

37:07 culminate in the appearance of the... AI native

37:13 mindset. So in this experimental research, after

37:17 I collided with teachers, with tooling, I found

37:20 that I had to use start to create VMs. So first

37:24 hypothesis. And then I would decide, well, if

37:27 I try to create a controller, like I have the

37:29 idea or already the use case that to have a control

37:32 for Kubernetes, to run runners. So why I don't

37:36 have an action runner control for that? And that

37:39 was really the idea. starting to think i did

37:42 a poc with shell script because it was a very

37:45 easy command but then i decided okay i have a

37:49 nice shell script let me do a really nice mvp

37:52 and then i did my first specification development

37:55 project so basically i used this idea i then

37:59 signed to define my specifications in a markdown

38:03 file what I want, what were the toolings, what

38:06 were the constraints. And then use, in this case,

38:09 was already still using GitHub Copilot and say,

38:12 I have this idea in this file, let's make a plan.

38:15 And starting elaborating the plan, creating the

38:18 plan. And then the process was really that way

38:20 that after I have the plan, we need to have to

38:22 -do lists or to -dos for each point. And then

38:25 basically I force the Copilot to use, go for

38:30 each to -do or each step. do the implementation,

38:33 I review it. Okay, it's fine. Let's move to the

38:36 next one. Fine. And then also ask to do a summary

38:40 of each implementation. So basically to have

38:42 a history of what I did. Just important for me,

38:45 but also important to share with the team all

38:47 this process. And then after one day and a half,

38:53 I got a control. So basically I had a Mac mini

38:57 in my desk. I put the control there and it was

39:00 spawning. and doing the lifecycle of the VM.

39:05 So basically, a workflow, it's my runner, execute

39:10 a niche, drop, tear down the VM, start a new

39:13 one, register against GitHub, like normal flow.

39:16 And I will say, yeah, nice. I just need to make

39:19 some improvements. Then I moved to the Nix. AI,

39:22 in this case, also it was Copilot, helped me

39:26 a lot. how to build the recipes with Nix, but

39:30 it fails tremendously just because of the way

39:33 Xcode works. It was very annoying to work. And

39:37 then I had the issue of how to distribute Xcode.

39:40 At that time, I was running out of time. AWS

39:42 was not really an option to investigate. And

39:45 then I started doing some calculations about

39:48 cost if I use GitHub as a provider of headers.

39:52 And surprise, surprise. It was cheap for our

39:56 use case. So it was basically a matter of after

39:59 spending three or four months collecting metrics

40:02 in Jenkins, I say, yeah, we can use it. And basically

40:05 it was really the idea. I said, okay, this is

40:08 the amount of money and comparing the working

40:12 hours that it's necessary for an engineer to

40:14 fix this, it's a good balance. And then after

40:16 I convinced my director that it really makes

40:19 sense in terms of financial terms. you approve

40:22 and then we move and then we move to migration

40:25 and that this migration was i did with the junior

40:27 the first thing that we did when this migration

40:29 was really sit down with the developers and i

40:34 asked anoint us to to to my junior in this case

40:38 it was to internities sorry but it was really

40:41 important to that yeah i asked him go for each

40:44 um jenkins pipeline they have and start doing

40:50 a flowchart so basically we had a flowchart for

40:53 each Jenkins pipeline with steps and the steps

40:58 in the way that what is supposed to do and what

41:01 are the commands that are executed and then i

41:03 sit down with the with each uh team from android

41:08 and and ios and then let's go forward for each

41:12 step and then try to understand as this makes

41:14 sense this flow i don't care about what is the

41:17 command that is executed. Does this step make

41:20 sense? Does this test make sense? Does this fork

41:23 in the logic make sense? And then we also understand

41:27 some good things that is some workflows or some

41:31 checking job that we have was already redundant.

41:33 We could refactor the input and combine in one

41:38 single pipeline. That was really the idea. And

41:42 then that was the second time we used the AI

41:44 to speed up. And this time I already added the

41:47 session of agenting coding with my engineering

41:51 department. So basically the engineering excellence

41:53 create agenting sessions to teachers all to use

41:57 AI tools in a more agentic way. So not to auto

42:01 -complete features, but to give context, to give

42:04 a goal. have this guy, this AI, as really a partner

42:09 to execute. And then at this time, already we

42:12 was using cloud code. And then, okay, let's bring

42:15 these flowcharts, convert to something that is

42:17 more digestible by AI. So I was able to export

42:21 as a CSV file. And then, okay, these are CSV

42:25 files, contains flowcharts of our workflows.

42:28 Let's build GitHub Action workflows. and documentation.

42:32 And he's starting doing the old workflow. Was

42:35 not really exactly what we want, but was close

42:38 enough. Imagine that this was really a good best

42:42 draft, a good first draft. This was really, now

42:45 just adjust this step, this step, this step,

42:47 and then we just starting building on that. And

42:50 then, of course, this really make the work very

42:53 easy. So we add like 17 workflows to migrate.

42:58 What is the difference from this migration to

43:01 the other migration? The other migration, we

43:02 add like one or two workflows for all the process.

43:08 So basically, the issue was to replicate that

43:11 to use workflows to 700 service. In this case,

43:14 we have only two service, two repositories, the

43:17 Android and iOS, but we have multiple workflows

43:19 to migrate. And then we had to rewrite a lot

43:22 of stuff. And that was the way that we use AI

43:26 to basically at the pace of each day we migrate

43:29 a workflow and then cloud code was able to digest

43:34 some part of the code base. So I had two issues.

43:37 First, parts of the customization that they have

43:40 in the iOS team was using FastLine that is written

43:44 in Ruby. I don't know Ruby. So I used it to understand

43:49 what was that. what was the logic behind that

43:55 ruby scripts and had extra features then i'm

44:00 not very well versatile in the ios test and compiler

44:06 so i never i'm not irs developer i don't know

44:10 all the quirks about the the process to compile

44:13 language compile ios app And I use protocol for

44:17 that. So basically, it was already in a way that

44:20 I had an issue in my pipeline. And I tell them,

44:23 OK, I have problems like this. I have an issue

44:27 in my pipeline. This is the idea of the workflow.

44:31 This is the idea of job. This is a step that's

44:33 failing. Fetch. using github cli in that time

44:36 even we using mcp probably it's best if you have

44:40 a cli to tell the ai model to use that cli to

44:43 fetch the information instead of uh bloat the

44:49 contacts with mcps use the cli fetch the logs

44:53 in that section and that let's go investigate

44:55 what it's what it's failing and it was really

44:58 able at some point i get get surprised because

45:01 when i go in deep mode of troubleshooting the

45:04 model in this case even the the agent that is

45:07 called cloud cloud was able to go to the internet

45:10 and find github issues about the problem pointing

45:14 out i have found this issue probably it's about

45:17 this let's double check and then i double check

45:19 yeah probably makes sense let's try this change

45:22 and it was really basically it was a more family

45:26 language or i cannot what i can say it was like

45:31 while we were in the past doing Googling. So

45:33 you put the problem, you try to find the issues.

45:35 Here, I have the problem. Also find the issues

45:38 in the internet and get me back the information

45:41 and then validate with me and then explore. And

45:43 basically, after four weeks, we migrate everything.

45:47 We even add extra features they want. And they

45:51 were very happy. And still, they are very happy.

45:55 Of course, the initial costs failed because it

45:58 was more than I expected. But for one reason,

46:01 developers were delivering more. So it was really

46:04 a situation that the workflows and run are so

46:09 stable, they can focus more in future and then

46:12 increase the cost. But it's because they are

46:14 shipping more features than they did in the past.

46:19 That's a good cost problem to have, you know?

46:21 Yeah. Yeah. I think that's the thing I have.

46:26 Do you have any questions about that topic? I

46:29 did want to ask, wrapping up, if someone that's

46:34 listening, they wanted to pull off a big CICD

46:37 switch like you did, are there some lessons learned

46:40 that you could give that they could follow so

46:42 they don't set their org on fire? I mean, because

46:44 this is a complex... lift and shift, right? Going

46:48 from Jenkins to GitOps and introducing Argo and

46:52 all the complexities around, you know, the Mac

46:54 mini pipelines, like be interested. There's some

46:57 like core lessons learned that you could impart.

47:00 So the first thing is, I think you need to, even

47:03 you have a big pipeline, I bet that you have

47:06 a small part. So a niche. So try to find. a small

47:12 part that you can replace and do it in isolation.

47:15 In my case, it was basically the pull request

47:17 validation. It's detached in some way of the

47:21 big flow. Try on that. If you cannot do that,

47:25 try to find segmentations that you have in your

47:28 organization in terms of teams. That's a good

47:30 way to approach so that you can move parts of

47:38 your team. Basically, if you have 10 teams or

47:41 five teams, pick one team and try to go in that

47:47 way. So this is a way that you can try to reduce

47:50 the buster radius. And of course, I think that

47:53 was a thing that was very important for us, basically

47:55 doing dogfood. I think it's very unfair for someone

47:59 that is developing tools for developers not using

48:02 that in their work. So I think this is really

48:05 the most important thing. It's doing dogfooding.

48:07 And then if you want, try to find in each place

48:10 if you don't have anything think do you have

48:13 internal tooling that doesn't uh provide for

48:16 your final customers that can be bad but for

48:18 trying to apply this to your internal tooling

48:22 yeah that makes sense cool where can people find

48:25 your uh your posts and and where can they reach

48:27 out to you okay so my posts are in the medium

48:30 probably you can find them have the links in

48:32 the description also i have that both was based

48:35 in the talk that I did. So also in YouTube, I

48:41 will have that talk. Also, sometimes I do some

48:45 publishing in LinkedIn. So you can go there,

48:48 try to reach me in LinkedIn. Awesome. I'll leave

48:51 the links for your Medium posts and your LinkedIn

48:53 and anything else in the show notes. Stefan,

48:55 thanks for coming on. Really appreciate it. Okay.

48:57 Thank you. All right. That's my conversation

49:00 with Stefan Moser. My biggest takeaway from this

49:03 one is that good CICD is not just about picking

49:06 a newer tool. It is about building a delivery

49:09 system that is predictable. observable, isolated,

49:13 and usable enough that engineers can trust it

49:16 without needing constant help from platform teams.

49:19 That is really the thread running through this

49:21 whole episode. They did not just swap Jenkins

49:24 for GitHub Actions. They reduced noisy neighbor

49:27 problems. They standardized runners. They leaned

49:30 into reusable workflows. They moved deployment

49:33 towards GitOps. They built their own visibility

49:36 layer when the platform was not giving them enough,

49:39 and they rolled it out in a way that let teams

49:42 build confidence instead of forcing a giant overnight

49:45 cutover. I also liked that he was honest about

49:48 what happens when the new system works. Once

49:51 deploys get easier, people use them more. Bots

49:54 start shipping changes, automation starts piling

49:57 up, and then you discover the next bottleneck,

49:59 whether that is queuing, fairness, or protecting

50:02 enough room for humans to still get work out.

50:05 That is the real platform lesson. Success creates

50:08 new load. And the better your self -service story

50:11 gets, the more you have to think about throughput,

50:13 guardrails, and the system behavior under trust.

50:16 The other part I liked was his migration advice

50:19 at the end. Start with a niche. Reduce blast

50:22 radius. Dog food your own system first. And if

50:25 you are building tools for developers, use them

50:28 yourself first before asking everyone else to

50:31 bet on them. That is probably the cleanest takeaway

50:33 from the whole episode. If you enjoyed this episode,

50:36 follow Ship It Weekly wherever you listen to

50:39 podcasts. If you want the show notes, links to

50:41 Stefan, his write -ups, and the resources we

50:44 talked about, head over to shipitweekly .fm.

50:47 Thanks for listening, and I'll see you later

50:49 this week. Thank you.

Ship It Conversations: Stephane Moser on Pipedrive’s Jenkins-to-GitHub Actions Migration, Argo CD, and CI/CD at Scale

Watch this episode here

Transcript

Catch This Episode

Host Commentary

Show Notes

Meet Brian Teller