Cloud Platform Engineering, DevOps, SRE, Kubernetes, KOPS

Are you production ready? Going live!

Releasing a brand new IT project or introducing a major change to an existing IT system is usually a challenge. Such tasks require careful planning and preparation. Successful release may have a big, positive impact on the organisation. A release which results in unstable, flakey or unavailable system can negatively impact reputation, sales, performance of the organisation and so on.

Going live doesn’t need to be scary or very risky task provided enough though, preparation. You cannot completely eliminate all risks or make release fail-proof but you can mitigate negative impact, get prepared and be ready to roll back in a smooth manner.

Based on my experience I wanted to share what I consider a set of good practices and things to look for in order to increase probability of a successful release – from technical point of view.

Disclaimer

Remember, no two projects are same, no two teams work in exactly same way or have same release process in place. You may be thinking, hold on, services/systems/applications should be tested and auto-deployed on each PR being merged… Yes, in ideal world. However, there are situations where things are not always that straightforward – going live or introducing a new major release of software is bit more complicated than making sure software is successfully deployed to environment X, Y or Z.

This post is full of questions. That’s intentional – when a major change is underway, I want you to ask as many questions as possible and think of best available and feasible solutions.

Communicate with all stakeholders

Communication is key – don’t keep your release secret. It is essential though that people in your team, organisation, partners, suppliers, etc are aware of the planned change and when it is going to happen (cross out what does not apply). A release date may not be suitable for your marketing team – they may have a campaign lined up for release date and it would be too risky to introduce major change on that day. Also, one of your partners (service provider, supplier, advertisers, etc) may have their services potentially affected due to their releases or maintenance windows. It may seem bit exaggerated but it’s better not to make any assumptions that everything related to the system or project would work as usual on the release date. Also don’t assume that by some kind of magic everyone will know about your release. Communicate with key parties and make sure they cascade the information down the chain.

Identify show-stoppers

As part of a release planning identify what would stop the release from happening or being successful? Would an absence of a particular individual (sickness, etc.) stop go-live? If you have data migration to be executed, would pro-longed data transformation or migration failure prevent go-live? Do you have any downstream dependencies such as services or 3rd party systems you depend on? Would a failure of CI/CD infrastructure (Jenkins, artifact repository, Docker registry, etc) be an obstacle?

Do you know what’s going on in the system?

Visibility and ability to quickly assess what’s happening in the system is key to identifying issues and assessing behaviour and health of the system. Do you have comprehensive monitoring and logging infrastructure in place? Are those resilient and would cope with potential, unexpected load? Do you staff have easy access to to those systems? All of those are essential to diagnosing your service. Also, do you have a way of easily run a test suite against production system, such as execute typical user journey (be it website or service)?

Do your team and support engineers have good understanding how system works as a whole?

IT systems can be of various complexity. Regardless of that, do your team (developers, testers, DevOps, etc) have a good understanding how the system works as a whole and how components depend on each other? Do they know how to diagnose common faults and problems? If your system is composed of multiple micro-services not every member of the team would be equally proficient in understanding how each of the services works inside out (probably). That’s fine as long as there is a general understanding how each of the services/components work – and team members know who to talk to in relation to specific micro-service or component. Does everyone know which components of the system can fail gracefully (causing minor degradation) and which components would render whole system useless? In an example of travel search engine a recommendation engine’s outage would cause degradation of the system while search component would render system pretty much useless. You can classify them as critical, important and nice to have, tier 1, 2 and 3, etc. If your operation is 24/7 are your engineers comfortable with identifying faults, fixing them or knowing where to escalate issues?

Identify risks, assess likeliness of fault and impact

Things can go wrong, and most likely they will. When you are releasing new system the likelihood may be higher than usually. It’s important to identify risks (i.e. data migration not completing, data corruption, service failure for some expected but unlikely reasons) – it’s crucial to highlight that to the business and have an open discussion about cost of mitigation, accepting the risk or executing plan B.

Do you have plan B?

One thing you should definitely do is to have a plan for when all hell breaks loose. If you have an existing system in place and you are introducing major change with data migration, etc you need to plan for a rollback. If it’s a new system then you are having a different kind of battle to fight. Part of “plan B” may be to include PR or incident management team and get their help if things go bad. Whatever your release is, brand new system or replacement of old, whether it is internal or external system, you should always acknowledge that things may go wrong and you should really have a “plan B”.

Are you ready for the load?

That’s one of my favourites and quite often overlooked. I’m a fan of load and performance testing from early stages of a project – the more automated the better. If performance testing didn’t happen for whatever reason until pre-release stage then it should definitely happen. The performance and load tests should be executed with exact configuration and version of the system that’s going to production. Speak with stakeholders and establish what’s the expected user journey and expected load. If possible, aim above the expected load. Make sure you can scale on reasonably short notice. Your system should ideally auto-scale and gracefully deal with spiky traffic. Also, in case the system cannot scale for whatever reasons, it should gracefully degrade and affect only subset of clients and serve its purpose to others. Generally speaking, do not underestimate this task and do not leave for it for a last minute. From my experience, it takes more time and effort than anticipated and during perf testing can uncover some nasty surprises.

Dry run if you can

If the release process is quite a complex and involved, such as requires data migration, mid-deployment testing, etc, it may be worth doing a dry run as an exercise for release. If your release process is rather straight forward and not different than simply releasing to QA or staging environment then it’s of a lesser importance.

Security and pen-testing

Security of system is quite often overlooked. Ideally, security should be everyones concern and be built into development cycle. Having security of data and integrity of the system in mind should be a focus of every person on the project. Sometimes things get missed out and that’s why it’s a good idea to get additional focus on trying to find security holes in the system.

Regulatory compliance

Be it a new system or a new release, consider if you are bound by regulatory compliance rules, such as storing personal data in a secure manner or being required to anonymise it.

Go/No-go meeting and a check list

You are getting there, right? 🙂 Exciting times. Last thing I do recommend before deciding on the big release is to call a “Go/No-go” meeting across wide range of stakeholders – be it team leads of other related projects/services, senior business and technical people across and other related people. This meeting is the time to confirm readiness, lack of obstacles, risks highlighted and accepted, plan for release discussed and approved, such as order of technical tasks, expected ETA, etc.

Summary

Communication with stakeholders, openness about risk, testing and giving enough time for preparation are factors contributing to successful major releases. Not always things go as smoothly as expected. If things go wrong, we should be ready to take action and at least learn from experience not to repeat same mistakes next time! It is great if you have a smooth deployment and release process but doing a major release goes beyond having code running in a specific environment.

Good luck!