Doing CI/CD at Sainsbury's | Pablo Maceda | Software Engineer in London, UK

During my time at Sainsbury's, I've worked on a small team that takes care of several projects. We were a fully CI/CD team as in Continuous Integration/Continuous Deployment. This means that developers merge their changes into the main branch as often as possible and all the changes merged into the main branch are automatically released to the customers, without human intervention.

This is not an article about the benefits and trade-offs of doing CI/CD environment, just a list of things and ideas that we did to succeed. It may not work on every kind of project but it worked extremely well for us.

Main statements

These are the main ideas that we try to apply everywhere.

Make everything as simple as possible

It may look like an obvious statement, but simplicity is key.

Keep an agile approach, challenge requirements coming from the business and other teams, etc. Reducing complexities makes everything cheaper, easier to maintain, and difficult to break.

A common misconception of CI/CD is that it increases the risks of bugs affecting end-users. CI/CD doesn't mean we ship changes without enough testing. We put all the effort and time spent doing releases, bugfixes, etc. into making simpler designs and maintaining a resilient process.

Best engineers love simplicity and are capable. Second best engineers love complexity and are capable. Bad engineers are not capable. Someone on HackerNews

Automate all the things

Engineers should focus on solving business problems and delivering value. If something can be automated it should be automated. This reduces human errors, speeds up the process, reduces context switching, and overall it improves the developer experience.

For example:

No manual deployments.

Automatically open a Request For Changes (RFC) just before deploying to prod. Fail the RFC if the deployment failed.

Use git hooks to check the code is properly formatted/linted on commit. On top of it, you can also run unit tests on related files. We use husky + lint-staged to run these validations only on changed files.

Automation should be combined with a good monitoring.

Everything as code

Absolutely everything, except secrets.

Including:

Configurations

Infrastructure (IaC)

Database schema (migrations)

Documentation and diagrams

This allows creating new environments very easily.

It can be used as documentation which helps a lot to new joiners.

Single source of truth.

💡

I needed a copy of the dev environment to spike NextJS in Lambda@Edge. With only two extra lines of code and running one command I was able to create my own environment that later I destroyed with a single command.

How we do it

Development

Monorepo

The whole project resides in one place.

Massively helps communication and collaboration within the technical team.

Reduces configurations.

Simplifies the pipeline.

Avoids the necessity of coordinating multiple deployments for example while deploying a new feature involving front-end and back-end. This can be improved even more by doing blue/green deployments.

Yarn workspaces allows sharing code between the front-end and the back-end.

Serverless

Serverless helps a lot to achieve simplicity. In principle, it forces you to split the business logic into small functions that will achieve only one thing. It also frees you from maintaining servers or the complexities of using containers.

The downside is that doing complex things can be a little bit more complex, which forces you into pushing for simple requirements.

Tests

Strong code linting and formatting

High unit coverage (but not enforced)

E2E tests are a must

TDD helps, even though is not suitable for everyone

Branches

master - deployed to dev and prod

short-lived feature branches

Small Pull Requests - 1 line patches are the best PRs

Create the smallest Pull Request possible as long as it works and it makes sense for the review. You may need to split your feature into different PRs that will get deployed independently.

This makes the code easy to review for people that might not have enough context about the feature.

👉

Delivering very small changes dramatically reduces the chances of breaking things.

Every PR must be approved by at least one other developer. It's a great idea if this dev was not involved on de development as it helps knowledge sharing.

Every PR should include the tests, including E2E tests if applicable.

💡

On the Homepage we use GitHub Actions to comment the PRs with screenshots taken from different browsers/resolutions to give front-end devs more confidence. This feature was added after an incident with the Christmas video, it was not working on IE due to missing polyfills. The incident was quickly solved and we're now running E2E tests in IE using browserstack.

Squash merge

Squash merge keeps the git log clean and linear. It also adds a link to the PR in the title, where you can get more context about the change:

Merge Pull Request means Deploy to production

If something goes wrong you will know instantly after merging, before moving to a different task. This allows you to quickly code and release the fix.

Don't rollback, fix forward, unless you know that fixing the issue will take too long.

Roadmaps / estimations

Kanban works better in a truly agile environment. It empowers the team and allows more flexibility.

This doesn't mean there is no clear roadmap or that there is no way to forecast when a feature will be delivered. Doing frequent backlog refinements helps the team to understand the big picture and have clear goals.

This may be a controversial statement: Developers shouldn't give estimations, there are other ways to calculate them based on metrics (e.g. kanban velocity) with the same inaccuracy. There can be some situations where estimations are useful but it shouldn't be the norm.

Depending on the product you may need to spend some time on creating a good communication channel with the end-users or stakeholders. New features are released every day and you should keep them informed.

Creating small tasks

Big tasks should be split into very small tickets that bring value independently. This helps estimations and allows different team members to collaborate on the same task. There should be a ticket for everything.

A good example could be splitting the back-end and the front-end work into different tickets, using feature flags if necessary.

Definition of done

The acceptance criteria should be met.

It should be deployed to prod and QA'd by either the technical team or PO.

The feature shouldn't be under a feature flag.

Keep dependencies updated

Not updating dependencies with enough frequency is the same as doing monthly releases.

This is especially important in the JavaScript ecosystem due to the huge amount of transient dependencies and tiny libraries.

Pipeline

Environments

We have only 3 environments. Adding more would increase complexities, costs and slow down the pipelines. It also increases the chances of having differences between environments.

Prod

It's a prod clone (it runs the same code) with just configuration changes.
It should not use prod database, but a clone (Overnight clone overwriting PII data).
It's used as a playground for developers and by the pipeline to confirm the build works before deploying to prod. It's especially useful in OnePlan since we can't access most of the production features due to PII data.

Local

Keep it as similar as possible to production.
Aim to work offline (for instance mock AWS with serverless-s3-local, serverless-offline-ssm, serverless-offline, etc).

CircleCI workflows

There are only two workflows:

feature branch (<10min)

test + build local + deploy local + e2e

master branch (20-30min)

test + build dev + build prod + deploy dev + e2e + RFC + deploy prod

deploy includes the frontend (copy react build to s3), backend (serverless deploy), and infrastructure (terraform apply).

Keep it fast and reliable. This is especially important on bigger teams. Creating small PRs means the pipeline will run a lot. If it's slow it will become a bottleneck. Also, keep an eye on new technologies or new CircleCI features that can help to maintain this goal.

Automated RFCs. ServiceNow is the tool used by Sainsbury's to share visibility of all the issues and deployments within the whole business in one place. Every time we deploy something on product, a Change Request is automatically opened with the commit details and a link to the pipeline. If the deployment is successful the RFC is closed. In case of any error, it will be flagged as failed.

Only the pipeline can make changes in production and only after the RFC has been created.

Deploying and testing something that is not finished

Since you can't deploy your build to any specific environment, unfinished features should be deployed under feature flags. This allows enabling the feature only on dev or for specific users.

This approach also allows to demo unreleased features to stakeholders.

Monitoring

There is a single place where we get all the notifications, Slack.

Two channels: non-prod and prod alerts.

It's very important to keep it clean, no false positives so every time you get alerted you know it's important. Any follow-up discussion, other than confirmations on who is taking care of the alert, should be done on the appropriate channel to avoid spamming.

NewRelic Synthetics - Scripted browser

Scripted browser that simulates the user journey:

NewRelic Synthetics - Pings

Allows checking the endpoint more frequently from different locations but it only checks for 200 status:

NewRelic Browser script

Runs in the browser. It checks for JS error, monitors performance, etc.

Serverless CloudWatch for lambdas

Every time a new lambda is deployed, all the logs are automatically forwarded to our slack channel without any human intervention:

CircleCI integration

Monitors CircleCI master pipeline failures:

Wrapping up

Pros

Feedback comes almost in real-time instead of writing the feature and deploying it 1 month later, so you still remember all the details. This is especially useful in case of any issue, there is no context switch.

The whole team owns all the code, instead of having silos of a single branch or repository.

Easier refactoring. The rest of the team can keep delivering value as they will continually create small PRs using a fresh master branch.

Empowers developers and makes them conscious of the consequences of their changes, merge means deploy. You get used to not breaking things and keep master functional.

Better visibility of what everyone is doing. Everything happens on the same repo, no silos.

Reduces manual testing.

No bugfix releases.

The business is constantly getting value every day.

Good developer experience, no frustrating/complex releases.

Cons

Code freezes are bad, really bad, and pointless. Once the code freeze is over you need to do a big release with all the implied risks.

Requires a big effort on challenging the business and solutions proposed by the architects and other teams.

Requires a continuous care on the process, pipelines, development experience, etc.

Switching to Kanban and no developer estimations can cause some friction with people outside the development team until they get used to the new process.

CircleCI doesn't provide a way to handle race conditions to avoid two master deployments running in parallel. As we're deploying code all the time the chances of this happening increase a lot. There are ways to create infinite loops waiting until other workflows have finished but we prefer to communicate through Slack every time before deploying.