Guests

Thanks to the guests for contributing to this topic!

Mike Nikes, Software Architect, Educator and Author building Webstone
Shawn MacIntyre, Software Entrepreneur & AWS Consultant specialized in building cloud native solutions and leading teams

Content

Introduction
Production readiness
Measurable
Risk Management
Observability
Optimization
Security & Compliance
Pre-production checklist
Impact
Documentation
Conclusion

Introduction

What is production ready or production level code ?

If you’d had asked me years back. I would probably just say it is code that lives in production environment.

My definition of production ready code has evolved over the years. I’d like to take that time to share some my learnings which I think would be helpful for individuals trying to grasp this idea.

As a bonus, I have invited two guests (Shawn and Mike) to contribute and share their views on this topic to get a diverse perspective on what it means to be production ready.

Production readiness

I think if you ask different individuals ‘what production ready means’, you’d likely get different answers depending on the project, domain and other requirements on that project.

Which is a great point brought up by Shawn who mentions “There isn't a one size fits all answer. Rather it needs to be scaled to the project.”

I totally agree because each project will have different requirements and varying focus on what is important for successful delivery on their project.

However, I still think there are categories that one should be mindful of when deciding whether or not code is production ready.

I’d like to explore some of those categories today.

Measurable

As engineers, we are hired to solve business problems. One way to quantify that is to measure the impact or success.

More often than not, you are also collaborating with other teams (business, design, project management, customer support, data etc) to drive a certain outcome or achieve some goal.

I believe this is one of the core tenent of being production ready which is to ensure the “solutions” to the problems being deployed to productions are quantifiable.

The impact of the code should be measured and tracked.

Likewise, as we are measuring how our code is being used, we would also understand it better, and can in turn make improvements to better optimize it for our users. This doesn’t replace actually surveying users and getting feedback but its a good start.

Some metrics to track may include (and are not limited to):

Google analytics
Custom tracking (in-house)
API endpoints (# of successful purchases etc)
E-mail campaign tracking
Customer rentention
Number of support tickets
Customer Churn rate (ie delete or unsubscribe)
any many more

Also be mindful that these measurements are only useful if they can quantify the business impact (directly or indirectly).

Ultimately, you want to be able to tie this to your business goals like:

Sales and growth optimization
Improve Customer satisfaction
Improves number of leads
Improve Community engagement & satisfaction
and many more

They don’t all need to be about revenue and costs. A lot of these metrics can also indirectly impact the business, meaning it indirectly contribute to the bottom line which is to save cost or increase revenue.

This can be something as simple making a change to help out the customer support teams which in turn improves metrics like decreases support tickets being created.

Finally, the metrics one can measure, can also be what Google define as SDO (Software Delivery and operational) metrics, which includes:

Lead time
Deployment Frequency
Mean time to restore (MTTR)
Change fail percentage

These metrics are a good indicator of the strength and process excellence of your product delivery.

According to The DORA DevOps report (2019), high performing teams deliver 208 times more and 106 times faster than low performing teams.

The faster you can iterate and learn while maintaining the quality, the more competitive you are in the market.

Shawn agrees a lot with this, and mentions this is why FinOps is rising and so dominant in the industry.

So, not everything needs to be directly tied to the bottom line, and there are many way one can quantify and measure the impact. At the end of the day, business drives technology, and not the other way around.

Risk Management

What’s risk management in software ? this likely deserves a whole post in itself.

However, to me that means any risk associated with further development, on-going support and maintenance of the software in every stage of the lifecycle.

After code goes to production, that is not the end of it. It requires on-going support and improvements, otherwise, it just deteriorate over time.

Identifying and addressing the major risks would be ideal before going to production.

Of course, this exercise is done in consultation with the stakeholders involved with the project because many times there will trade-offs that need to be made to meet a budget or timeline or some other constraint.

Often, you’ll end up with a solutions with a risk profile that is good enough to accept for every stakeholder involved rather than an ideal one. That is just reality.

These can also be other risk not related to software, like people, team and project, these may include:

Do we have enough people to continue the operations ?
Do we have only one person on-call all the time ?
Knowledge silos (only few people have knowldge into critical parts of the code)
- Any documentation available ?
- Strategies for knowledge sharing ?
Based on the team, are we able to meet the on-going goals of the stakeholders
The amount of work does not line up with the number of people for future work
Team morale
Project Technical Debt
any many more

As for the software, beyond just evaluating risk before project goes to production, it is just as important to monitor the other risks factors in an on-going basis.

These may include:

Security & Compliance
- What is the status of our customer’s data ? any un-wanted access or leaks ?
Quality
- Does our software result meet our business requirement most of the time ?
Reliability
- How much downtime are we experiencing ?
Performance & Usability
- Is our performance acceptable for customer’s use case ?
- How is the usability across devices and platforms ?
any many other areas

These all indirectly impact the business metrics that we are measuring. So, it would be good if way we can easily identity and evaluate these risks as they appear in production.

Quality Assurance or Software testing is a big part of software risk reduction but it does not end there.

I like to think of it as a tool which provides the 80/20 confidence you need in your code. Then in production, you truly capture the last 20% in production.

This is because there is just no way to test every use-case across devices and platforms. Not only that, the code is running in a very complex environment with edge locations and multiple instances.

This is why I am a firm believer in having good observability. Not only can you monitor behaviours of your code, often times, you can get feedback which you can use to improve the product or take the learnings back to improve your overall process, which many refer to as “shifting left”.

Observability

Similarly to measuring success metrics, we also need to quantify the system or components that make up the solution which solves our problems and keep tabs on the risks factors.

As Mike mentions, to be production ready our “our services should be monitored and ideally have SLO (Service-Level objective)”.

I totally agree, and this is the most important area that I have added into my definition of production ready code over the years.

You would think if you have good test coverage, everything will be fine in production but when you go into production. The complexity increases by 10 to 100 fold.

You are dealing with tens and hundreds of internal and external systems each one can fail a contract at any time.

For example, one time I noticed an intermittent error in a production service due to another service running two different versions of the software on the instances because of a bad deploy.

Only 10% of customers were impacted but I was able to catch it only because we had metrics available.

These are just things are much more difficult to test for.

Ultimately, setting these SLO objectives allow us to tie that to the SLA (service-level agreement) which is the goal or success outcome we are measuring.

Example:

we need the payment api to be 99.999% reliable so we can ensure we are meeting the goals of 80% success rate on the purchase flow.

In addition, I think just having insights or visibility into the different parts the systems running in production is so empowering when things do go wrong.

Tools that can help with observability:

Logging points (structured and well thought out ones)
Metrics (ie higher level metrics on conversion flows, % of success vs % failure)
Tracing (how the different systems interact with each other)
Alerting
APM (application performance monitoring)
Client side error tracking for websites (ie Sentry)

I share a lot of the same opinions as Mike on this particular topic where the goal here is to ensure that the code can be debugged in production.

Essentially, you want to set yourself for success when things do go badly in production. With good observability you should be able to methodically isolate and resolve an issue rather than shoot a dart into the dark and hoping things are fixed.

This may not be something you worry about if you hand off code to an Ops team. However, if you are that person, then your future self would thank you for doing that (or at least your Ops team would), especially at 3AM fixing a live bug impacting many customers.

Finally, observability doesn’t have to be just for when things go wrong. It is also a big part of security and tracking bad actors in the production environment (ie which endpoints are being abused, DDOS attacks).

So, by constantly observing our system in production, this allows us to easily track, identify and resolve risks factors as they appear in real time.

Optimization

This is every engineer’s favourite part.

The simple one is speed but that is not the only area. There are also other areas that Mike believes to be just as important for evaluating code for production readiness.

These include:

Lighthouse score, at least green (ideally, 90+)
No errors in console
Assets are optimized for delivery (Images, minified JS, minified CSS for websites)
- Measure with web.dev measure and web page test
UI is optimized across devices (Mobile, Tablet, Desktop)
The services have been security tested

This is a great list of things to consider for production readiness. I would also add to the list:

Load testing (if there is a hard requirement for performance)
Optimize for resiliency
- Feature flag to toggle off feature for graceful degradation
- A small feature failing in the app doesn’t blow up the whole app (error is contained)
- Testing expected behaviour during success and failure in the code
Optimize feature for recovery
- Thinking about strategies for when things go wrong (rollback, fix foward or toggle off feature if possible)
Accessibility

of course, this isn’t a comprehensive list of everything but I think it covers a good chunk of it.

Beyond just thinking about optimizing everything, I think it is just as important to understand limitation and constraints of the code holistically which we intend to release to production.

Ideally, we also have strategies in place to mitigate or get past those limits and constraints.

Security & Compliance

I believe this is a topic that the guests and I all agree on to be one of the core tenets of production readiness.

Beyond just having everything working, optimized and observable, we also need to ensure we perform security audits in our code, and typically you would be working closely with a security team to do this.

However, if that is not possible a quick audit of the OWASP Top 10 is a good start but it is not the end all be all of it.

Some examples include:

XSS prevention - sanitizing html injected into the page
open redirects
Spam and botting mitigtaion (especially for endpoint with sms, email and other spam sensitive features)
CSRF is in place for session hijacking (for websites)
Sensitive endpoints are gated by authentication
- core data mutation endpoints (withdraw funds, or large purchase) are gated by 2FA
SQL injection or injection of any kind to the API (inputs should be sanitized and validated)
API doesn’t expose more data than it needs to (or even sensitive information)
- Good to also ensure the API does not unintentionally leak internal details or states (stack traces or personal info)
Ensure data (at rest and in-transit) is properly encrypted
and many more

Beyond just security hardening of code, we also need to be mindful of Personal Identifiable Information (PII) and meeting compliance (ie GDPR Compliance for EU users, HIPAA, PCI).

For GDPR compliance, these include properly handling of data (consent, retrieval, storage and deletion) of a user’s data. Also, you must ensure confidentiality so that the data is not leaked and gets in the hands of bad actors.

For More information please visit - gdpr.eu.

So, part of being production ready is ensuring that the code has been auditted for security flaws and data compliance.

Pre-production Checklist

This is one I personally added in as something I feel is a part of being production ready, and that is considering the pre-production checklist of items to be done before we go live.

Sometimes there will be no changes required, but other times, there will be many items that needs to be updated and changed in the production environment.

I have noticed over the years when taking a feature to production, there are always a bunch of things that needed to be done in the production environment before the code can be deployed to production.

So, I definitely think it’s worth having a checklist of items you need done before going to production (ie updating configurations, configuring resources, CMS changes, configuring some infrastructure).

You definitely don’t want to find out you missed something critical once the code is already in production.

This may be something small but it is easy to forget if you don’t note it down, especially if you are jumping between projects, stories and bug tickets.

Impact

This one is a little bit of a tangent from the rest of the points we already covered, and that topic is on impact.

At the core, before we move to production, Shawn mentions “we need to consider whether we are solving the problems we set forth to solve in the first place”.

I agree, and this is a process that is typically done in consultation with key stakeholders involved with the project.

This could be something as simple as having a quick demo of the finished product with the group to get some consensus, tying up any loose ends, and ultimately to gauge how ready the whole group feels about going to production.

Also, we need to consider the down stream implications and impact for adjacent teams like customer and technical support teams. For example, the changes we make can have a direct impact on number of support tickets being created if things do go wrong.

So, it is good to be mindful of the type of impact that your change may introduce. The support team, or any other teams, will likely need to be looped in to get context and training into the change (especially if it is for a new feature).

So, documentating the change will likely be very highly valuable to the other teams! This is a great segway into the next section.

Documentation

This one may be viewed as “nice to have” by many people but I believe this is part of the core tenent of having code that is production ready.

Documentation is a pretty broad term but to Mike, this could be something as simple as adding some documentation describing the new feature being added.

I like to think of documentation as giving the next person enough context to be productive or carry-on what you left off. In terms of production readiness, that typically means being productive within the whole software development life cycle (SDLC).

Someone needs to be able to jump in, and be able to get enough context to build, test, deploy and monitor your services. So, ideally, you’d have documentation about each of those areas.

The documentation may include:

Service architecture
Continuous integration
Deployment workflow
Disaster recovery
Links to key indicators (ie dashboards, for SLA, SLO, APM, log filters)
Core alerts and Paging setup
Caveats or problematic areas
- Areas to watch out for (and how to resolve it if problem arises)
- Design decisions, constraints and limitations

This will very likely vary from team to team but I would just think back to the question of what information would someone else need to successfully run and operate this service, from feature development all the way to production.

This is becoming more and more important as teams scale, and become more distributed across time zones.

Code reviews

Finally, we tie it all together with code reviews. Code reviews are a great time to evaluate many if not all of the things listed above to determine if code is production ready.

Not everything may be applicable as every project is different but it is a good reference to build upon.

To review for production readiness, Mike also mentions reviewing the list in 12 Factor Application. It is a great list to review!

Other than the items listed above, the guests and I agree on reviewing for code maintainability and readability. Once the code is in production, we will need to ensure we make it easy to make changes but also ensure that we don’t break any contracts for the existing business logic in our code.

So, that means having the right tools in place (testing, dev tool automation - linters, formatting, continuous integration and delivery) to set us up for success as we make more changes in the future.

Conclusion

So, to wrap everything up. These are the categories that the guest and I look for when we are reviewing code for production readiness.

Of course, you may find that you won’t need everything on the list or you have other areas you evaluate for production readiness in your project.

If you are unsure feel free to use this list as a good starting point.

Ultimately, its about solving the business problems you set forth to solve and you are able to consistently meet that goal by measuring and observing it. Of course, this is done with collaboration with many other teams members outside of engineering!

This isn’t a must have list but more of a list to items to strive for in order to achieve long term and predictable value.

I believe we should also strive for pragamatism in our approach. There will always be constraints in time, budget and quality.

Rather than striving for perfection, we should always to strive for effectiveness — meaning creating the most value from the resources that we have. As Shawn emphasizes, developing software is an exercise of constraint management.

Jerry Chang

Go Back

Production Ready Code

Enjoy the content ?