Production features vs fixing nuts, bolts and tools

Deepti Mittal
5 min readOct 8, 2024

--

Photo by Nina Mercado on Unsplash

Programming languages have progressed a lot with advanced framework availability and heavy lifting mainly being done by cloud service providers for scaling, storage, security, messaging and many more areas. It takes a month or little more to develop and deploy basic functionality for a new product/idea.

A good development team can wow company management with turning idea into reality in few months, which used to take atleast a year few years back.

Don’t mistake me with developing POC code here, I am considering production grade code with first feature in it and something we achieved in my recent project.

What most of the teams fail to think and plan is supporting the feature in production, which I call nut, bolts and tools to fix the issues.

Below is the list of nuts, bolts and tools which I thought is essential and have been focusing on for last 2 months.

Build, deployment and test pipelines

Not just having pipeline which helps in deploying code to production but a smooth pipeline, free of false negatives/positives. Pipelines which can give us feedback about every commit. Having ability through pipelines to run experiments or performance tests are crucial to speedup development work and running these through pipelines provide visibility and traceability as well.

Code quality checks in local

No one can deny that issues found before committing code have 60% less turn around time than issues found after committing code. It’s always a good idea to run quality profiles, unit tests and if possible e2e tests as well before code commit. Some of this can easily be configured through git pre push hooks.

Performance of the system

I witnessed this debate lot of time lets build the feature first and then we will look into performance of it, I believe this is where software industry differ from hardware industry. Can you ever imagine someone saying lets first launch the car and then we will see if it can really run at the speed of 100 kmph. Even home appliance like oven, will they take decision like lets first release it with feature just warming up food and we will worry about speed or temperature intensity later.

Its relatively easier to change software development and fix bugs later after release and that makes us to take decisions like performance testing later. One thing to note its takes lot more time and mental energy to fix issues later and cool down the fire when issues arises from production.

For software performance testing is a term referring to not just latency test, but load testing, stress testing, scalability testing. Considering all of it before the 1st release might be not be feasible all the time but at least the non negotiable one based on use cases should be considered and then informed call involving all the stakeholders.

Observability & Metrics

I also call it tools for production support and health checks. This area does not get any better treatment than performance or sometime worse. Issues in production is inevitable but what makes me worried is not having right tool in place to perform RCA. To not know how our services are doing in terms of overall health. How would we know if servers starts going bad and from when it was showing symptoms of going bad. Is it hardware issues or did it get too much load suddenly? Without right metrics in place we might not be able to answer these questions and sleep peacefully when code is running in production.

Logging

People who has worked on production issues also understands the importance of logging and how helpless we feel if there are not adequate logs or we logs are not traceable. Logs uses lot of network bandwidth, performs IO operation hence identifying crucial points for logging is also importance. To know if we have adequate logging in place, don’t use debug breakpoints but rely on logging while analysing issue in local development setup. If we have to debug application through breakpoint, we don’t have right logging in place.

Just having logs are not enough it needs to be readable, searchable and traceable.

Error Handling & Resilience

“Lets build the happy path first”, is a statement which is used in iteration 1 of every project.

“This is very edge case scenario, will not happen in production” is a statement which is used in iteration 4–5 of project.

Right time to take care of error handling is while analysing, designing, implementing feature for first time. In every phase we need to consider error handling, at least evaluate that analysis and design outcome can be extended to support error handling. Sometime handling error scenario might require major refactoring or framework change causing rework.

Also thinking through how your system will recover from failure. Can it recover by itself or would it need manual intervention.

If it needs manual intervention, do we have right alerting in place for same?

Developer experience

One of the most underrated requirement for any project, because no one captures metrics around how much time devs are spending for local set up every time, testing code in local to get faster feedback and ability for faster experiments.

Making changes to run the code in local once in a while is okay but if i have to do it every time, here are few issues which arise:

  • New developer will not know about it, will face issues and then someone will help them. I know you are thinking about ReadMe files. How many of us read READMe as first thing, when things are not working as expected then only we look if there are any instructions in READMe.
  • Making changes for local set up makes commit also risky and painful. Have to be cautious while committing only required changes. So many times I have witnessed people committing credentials by mistake, which are not even realised for a while. Thanks to git push hooks it gets prevented. Other way to prevent these accidental commits is working with configurations and environment variables.
  • I recently faced an issue where code was working through terminal but to it run through IntelliJ required code changes. It was inconvenience to make code changes every time we take pull and then made sure these change was not committed. So working code from local is not enough, it is important that it works in development environment, which includes build tools, IDEs and anything else devs uses for development. Someone in hurry or Friday evening might just commit those unwanted changes, unwanted secrets and we might not realise it till late.

All of these are very broad topics but a list to consider while planning for phase 1 release of your project. As a development team if you sign up for just building functionality and give it for integration or release, may be team can celebrate on time release but also pray the celebration is not disturbed with issues coming from production.

Working on some of these areas might not be one time but will require continuous evolvement as product grows and it is important to keep this list ready and plan the enhancement along with feature planning.

--

--

Deepti Mittal

I am working as software engineer in Bangalore for over a decade. Love to solve core technical problem and then blog about it.