Have you ever planned and developed a software product for months, only to have it break down when you start getting real users? It happens more often than you would think and is a hard lesson to learn. I started talking to people who had faced such situations across all levels and worked towards finding a solution by learning from the experience of multiple product teams
For most teams, it started with the ambition to venture into a new market or product category. The revenue projections excited the team or there was FOMO (Fear of Missing Out). This made them think they had to get the product out quickly before someone else comes in. They made the decision to develop the product in-house or onboard a vendor.
On a strict schedule, the developers start coding in and the business team was happy when they started to see their UI/UX wireframes turning into beautiful screens. Hi-fi’s all around, and the team is happy to see the progress, so they push the tech team harder. UI screens was getting done faster. The app looked amazing and the team was sure their users would love it! Marketing team started acquiring a lot of new users. And then, disaster struck!
The app started crashing. User complaints started increasing ten fold. The basic features stopped working. And all this in front of live users, which was affecting the brand. This led to tensions and in-fighting but didn’t really solve the problem. So what went wrong?
The builder’s dilemma
When constructing a new building, it goes through a lot of steps which lead to the completed structure.
Usually a design firm and a construction firm (hence called Design-Build process) jointly take up the project. The design firm has interior and exterior designers. The designers will ensure the building looks good. The construction firm has architects and structural engineers who work on risk assessment and mitigation.
They do this by estimating the forces that the building will have to endure in a real world scenario. They draw a structure using precise equations of mechanics which will help the building withstand outside forces. The hard working labourers and workers start the actual construction work after both the design and structure are approved.
What will happen if you give the design of the building directly to the labourers without going through the architect or structural engineer? They will build it, it will look great and will definitely save a lot of time. But there is no guarantee that the building will survive the next earthquake, cyclone…or a breeze.
In software terminology, you can think of the design firm as the UI/UX team and the construction firm as the development team. Even if the outer frame i.e. the user interface looks good, that is not a guarantee that the product is scalable and robust.
You cannot evaluate robustness on the basis of how good the product looks. Then how should you ensure that your product will scale?
In the software world, there exists a dedicated designation called Software Architect. The job of the architect is to work on risk assessment and mitigate those risks. The 3 principles of scalability are Availability, Performance and Reliability. A good architect will be able to ensure that your app guarantees 2 out of the 3 principles at any given moment.
You can hire an in-house Software Architect or hire a software architecture firm BEFORE starting the actual development work. The technical development firm will have to adhere to the architecture laid down by the architectural firm. This ensures that the risk mitigation planned by the architect is actually implemented in the project.
After the project implementation is done, you can verify the scalability of the project by measuring key metrics. Think of these as the health metrics of your application. Any change in these numbers will help you quickly detect the root cause and save the life of your app!
Response Time: This is the time in milliseconds that the server takes to respond to given client requests. The lower the better (tending to 0). If your response time is perfect in a low user internal environment but spikes on a live environment with real users, you would definitely want to study and fix it.
Memory Usage: The RAM (Random Access Memory) allocated to your servers is used to as a temporary storage to process information. Any spikes here would mean you are either not allocating enough memory or memory is not being freed up even after usage is no longer needed.
Throughput: The number of requests that your servers can serve per second. Response time should not be affected by your throughput. If you response time is directly proportional to throughput, you might want to have the issue checked and fixed.
Users / <Metric>: This ratio is used to quickly arrive at the metric which is getting affected by the increase in number of users. This is a performance metric which can help you optimize your system for a smoother user experience.