Published on: Wed Dec 01 2021
Last Updated on: Tue Dec 28 2021
Thanks to the guests for contributing to this topic!
What is production ready or production level code ?
If you’d had asked me years back. I would probably just say it is code that lives in production environment.
My definition of production ready code has evolved over the years. I’d like to take that time to share some my learnings which I think would be helpful for individuals trying to grasp this idea.
As a bonus, I have invited two guests (Shawn and Mike) to contribute and share their views on this topic to get a diverse perspective on what it means to be production ready.
I think if you ask different individuals ‘what production ready means’, you’d likely get different answers depending on the project, domain and other requirements on that project.
Which is a great point brought up by Shawn who mentions “There isn't a one size fits all answer. Rather it needs to be scaled to the project.”
I totally agree because each project will have different requirements and varying focus on what is important for successful delivery on their project.
However, I still think there are categories that one should be mindful of when deciding whether or not code is production ready.
I’d like to explore some of those categories today.
As engineers, we are hired to solve business problems. One way to quantify that is to measure the impact or success.
More often than not, you are also collaborating with other teams (business, design, project management, customer support, data etc) to drive a certain outcome or achieve some goal.
I believe this is one of the core tenent of being production ready which is to ensure the “solutions” to the problems being deployed to productions are quantifiable.
The impact of the code should be measured and tracked.
Likewise, as we are measuring how our code is being used, we would also understand it better, and can in turn make improvements to better optimize it for our users. This doesn’t replace actually surveying users and getting feedback but its a good start.
Some metrics to track may include (and are not limited to):
Also be mindful that these measurements are only useful if they can quantify the business impact (directly or indirectly).
Ultimately, you want to be able to tie this to your business goals like:
They don’t all need to be about revenue and costs. A lot of these metrics can also indirectly impact the business, meaning it indirectly contribute to the bottom line which is to save cost or increase revenue.
This can be something as simple making a change to help out the customer support teams which in turn improves metrics like decreases support tickets being created.
Finally, the metrics one can measure, can also be what Google define as SDO (Software Delivery and operational) metrics, which includes:
These metrics are a good indicator of the strength and process excellence of your product delivery.
According to The DORA DevOps report (2019), high performing teams deliver 208 times more and 106 times faster than low performing teams.
The faster you can iterate and learn while maintaining the quality, the more competitive you are in the market.
Shawn agrees a lot with this, and mentions this is why FinOps is rising and so dominant in the industry.
So, not everything needs to be directly tied to the bottom line, and there are many way one can quantify and measure the impact. At the end of the day, business drives technology, and not the other way around.
What’s risk management in software ? this likely deserves a whole post in itself.
However, to me that means any risk associated with further development, on-going support and maintenance of the software in every stage of the lifecycle.
After code goes to production, that is not the end of it. It requires on-going support and improvements, otherwise, it just deteriorate over time.
Identifying and addressing the major risks would be ideal before going to production.
Of course, this exercise is done in consultation with the stakeholders involved with the project because many times there will trade-offs that need to be made to meet a budget or timeline or some other constraint.
Often, you’ll end up with a solutions with a risk profile that is good enough to accept for every stakeholder involved rather than an ideal one. That is just reality.
These can also be other risk not related to software, like people, team and project, these may include:
As for the software, beyond just evaluating risk before project goes to production, it is just as important to monitor the other risks factors in an on-going basis.
These may include:
These all indirectly impact the business metrics that we are measuring. So, it would be good if way we can easily identity and evaluate these risks as they appear in production.
Quality Assurance or Software testing is a big part of software risk reduction but it does not end there.
I like to think of it as a tool which provides the 80/20 confidence you need in your code. Then in production, you truly capture the last 20% in production.
This is because there is just no way to test every use-case across devices and platforms. Not only that, the code is running in a very complex environment with edge locations and multiple instances.
This is why I am a firm believer in having good observability. Not only can you monitor behaviours of your code, often times, you can get feedback which you can use to improve the product or take the learnings back to improve your overall process, which many refer to as “shifting left”.
Similarly to measuring success metrics, we also need to quantify the system or components that make up the solution which solves our problems and keep tabs on the risks factors.
As Mike mentions, to be production ready our “our services should be monitored and ideally have SLO (Service-Level objective)”.
I totally agree, and this is the most important area that I have added into my definition of production ready code over the years.
You would think if you have good test coverage, everything will be fine in production but when you go into production. The complexity increases by 10 to 100 fold.
You are dealing with tens and hundreds of internal and external systems each one can fail a contract at any time.
For example, one time I noticed an intermittent error in a production service due to another service running two different versions of the software on the instances because of a bad deploy.
Only 10% of customers were impacted but I was able to catch it only because we had metrics available.
These are just things are much more difficult to test for.
Ultimately, setting these SLO objectives allow us to tie that to the SLA (service-level agreement) which is the goal or success outcome we are measuring.
Example:
we need the payment api to be 99.999% reliable so we can ensure we are meeting the goals of 80% success rate on the purchase flow.
In addition, I think just having insights or visibility into the different parts the systems running in production is so empowering when things do go wrong.
Tools that can help with observability:
I share a lot of the same opinions as Mike on this particular topic where the goal here is to ensure that the code can be debugged in production.
Essentially, you want to set yourself for success when things do go badly in production. With good observability you should be able to methodically isolate and resolve an issue rather than shoot a dart into the dark and hoping things are fixed.
This may not be something you worry about if you hand off code to an Ops team. However, if you are that person, then your future self would thank you for doing that (or at least your Ops team would), especially at 3AM fixing a live bug impacting many customers.
Finally, observability doesn’t have to be just for when things go wrong. It is also a big part of security and tracking bad actors in the production environment (ie which endpoints are being abused, DDOS attacks).
So, by constantly observing our system in production, this allows us to easily track, identify and resolve risks factors as they appear in real time.
This is every engineer’s favourite part.
The simple one is speed but that is not the only area. There are also other areas that Mike believes to be just as important for evaluating code for production readiness.
These include:
This is a great list of things to consider for production readiness. I would also add to the list:
of course, this isn’t a comprehensive list of everything but I think it covers a good chunk of it.
Beyond just thinking about optimizing everything, I think it is just as important to understand limitation and constraints of the code holistically which we intend to release to production.
Ideally, we also have strategies in place to mitigate or get past those limits and constraints.
I believe this is a topic that the guests and I all agree on to be one of the core tenets of production readiness.
Beyond just having everything working, optimized and observable, we also need to ensure we perform security audits in our code, and typically you would be working closely with a security team to do this.
However, if that is not possible a quick audit of the OWASP Top 10 is a good start but it is not the end all be all of it.
Some examples include:
Beyond just security hardening of code, we also need to be mindful of Personal Identifiable Information (PII) and meeting compliance (ie GDPR Compliance for EU users, HIPAA, PCI).
For GDPR compliance, these include properly handling of data (consent, retrieval, storage and deletion) of a user’s data. Also, you must ensure confidentiality so that the data is not leaked and gets in the hands of bad actors.
For More information please visit - gdpr.eu.
So, part of being production ready is ensuring that the code has been auditted for security flaws and data compliance.
This is one I personally added in as something I feel is a part of being production ready, and that is considering the pre-production checklist of items to be done before we go live.
Sometimes there will be no changes required, but other times, there will be many items that needs to be updated and changed in the production environment.
I have noticed over the years when taking a feature to production, there are always a bunch of things that needed to be done in the production environment before the code can be deployed to production.
So, I definitely think it’s worth having a checklist of items you need done before going to production (ie updating configurations, configuring resources, CMS changes, configuring some infrastructure).
You definitely don’t want to find out you missed something critical once the code is already in production.
This may be something small but it is easy to forget if you don’t note it down, especially if you are jumping between projects, stories and bug tickets.
This one is a little bit of a tangent from the rest of the points we already covered, and that topic is on impact.
At the core, before we move to production, Shawn mentions “we need to consider whether we are solving the problems we set forth to solve in the first place”.
I agree, and this is a process that is typically done in consultation with key stakeholders involved with the project.
This could be something as simple as having a quick demo of the finished product with the group to get some consensus, tying up any loose ends, and ultimately to gauge how ready the whole group feels about going to production.
Also, we need to consider the down stream implications and impact for adjacent teams like customer and technical support teams. For example, the changes we make can have a direct impact on number of support tickets being created if things do go wrong.
So, it is good to be mindful of the type of impact that your change may introduce. The support team, or any other teams, will likely need to be looped in to get context and training into the change (especially if it is for a new feature).
So, documentating the change will likely be very highly valuable to the other teams! This is a great segway into the next section.
This one may be viewed as “nice to have” by many people but I believe this is part of the core tenent of having code that is production ready.
Documentation is a pretty broad term but to Mike, this could be something as simple as adding some documentation describing the new feature being added.
I like to think of documentation as giving the next person enough context to be productive or carry-on what you left off. In terms of production readiness, that typically means being productive within the whole software development life cycle (SDLC).
Someone needs to be able to jump in, and be able to get enough context to build, test, deploy and monitor your services. So, ideally, you’d have documentation about each of those areas.
The documentation may include:
This will very likely vary from team to team but I would just think back to the question of what information would someone else need to successfully run and operate this service, from feature development all the way to production.
This is becoming more and more important as teams scale, and become more distributed across time zones.
Finally, we tie it all together with code reviews. Code reviews are a great time to evaluate many if not all of the things listed above to determine if code is production ready.
Not everything may be applicable as every project is different but it is a good reference to build upon.
To review for production readiness, Mike also mentions reviewing the list in 12 Factor Application. It is a great list to review!
Other than the items listed above, the guests and I agree on reviewing for code maintainability and readability. Once the code is in production, we will need to ensure we make it easy to make changes but also ensure that we don’t break any contracts for the existing business logic in our code.
So, that means having the right tools in place (testing, dev tool automation - linters, formatting, continuous integration and delivery) to set us up for success as we make more changes in the future.
So, to wrap everything up. These are the categories that the guest and I look for when we are reviewing code for production readiness.
Of course, you may find that you won’t need everything on the list or you have other areas you evaluate for production readiness in your project.
If you are unsure feel free to use this list as a good starting point.
Ultimately, its about solving the business problems you set forth to solve and you are able to consistently meet that goal by measuring and observing it. Of course, this is done with collaboration with many other teams members outside of engineering!
This isn’t a must have list but more of a list to items to strive for in order to achieve long term and predictable value.
I believe we should also strive for pragamatism in our approach. There will always be constraints in time, budget and quality.
Rather than striving for perfection, we should always to strive for effectiveness — meaning creating the most value from the resources that we have. As Shawn emphasizes, developing software is an exercise of constraint management.
Then consider signing up to get notified when new content arrives!