Technical Excellences

12 min readJun 23, 2021

Problem

As a technical leader, you may ask yourself — what should I do to be sure that our software can successfully achieve all stakeholder’s requests? In other words, let’s determine several important areas and best practices, which you can use to sleep well at night while the company product is making money. This article is inspired by the book Designing Data-Intensive Applications by Martin Kleppmann, and here are my thoughts based on my years of experience in the software industry.

Important disclaimer: we know that “premature optimization is the root of all evil” — thanks to the great Donald Knuth. At the same time, we should not forget the second part of this quote — “Yet we should not pass up our opportunities in that critical 3%.” The hardest part is to distinguish “premature optimizations” and “opportunities” to save money and time tomorrow.

Reliability

The first question we should ask about our application is — how well can our system handle failures? Try to think in advance about possible issues your software product can have and what will be the possible results. Based on your load profile and business specifics, you should determine which types of issues have the highest damage potential. It makes sense to start mitigating risks from these.

Hardware

The hardware issues are the first type of problem you can have. There are a lot of things that can go wrong with hardware — overheating, network break, power outage, hard disk drive failure, and even fire. Obviously, the first thought which should come to mind here is backups. Any project of any size ought to do it from the very beginning of its life. Backups have to be regular and automatic. If they are manual — Murphy and I will guarantee that one day you will forget to take a dump right before your disk kicks the bucket. Another crucial piece of advice regarding backups — restore your system from backups periodically. There is no other way to know that your backup system is working properly. Ideally, the restore process should also be automated — one button and the rest is done by hardworking “robots.” To increase reliability, it makes sense to store your backups in a different data center.

Let’s say your solution should handle hardware failures imperceptibly for your customers. At this point, the software redundancy is joining the chat. It means that, at least for your database storage, you should have several machines. Ideally, replicas have to be on different physical storages or even in different data centers and regions, but having redundancy even inside the same provider will improve reliability. It can be that data replication is not enough for your SLA. In this case, you have to think about having several instances of your application on different physical servers or even in several data centers. This is very closely related to scalability, and there is a good software design principle that can help you to improve both of them in the future — “make two.” It means that when designing your solution, you need to establish the possibility of having more than one instance of key elements — i.e., two databases, two workers, two network connections, two IP addresses, etc. As a result, even though at the moment you might only have one, later it will be much easier to add more and scale at this point.

Software

The next class of problems is software issues. It is inevitable that from time to time, your application meets state handling, which has not been properly programmed — as a result, different things may happen: the program can stop responding, fail with an exception, or even corrupt the data.

First of all, you should know the fact that this has happened in your system. Because of that, different kinds of monitoring are really indispensable. The easiest type of them is just log collectors. With them, you should be able to analyze problematic situations that happened a relatively long time ago — like days, or even weeks and months. Systems for business analytics which are collecting users’ actions, are also useful for investigating past incidents and customer complaints. In the end, the most important tool is the alert system. It is crucial to have an iron guard which will call you somehow automatically and immediately if something is not going well at that moment . I.e., some services are not responding, a bunch of new dangerous exceptions appeared in logs, or resources utilization on your server is close to 100%. It can be very annoying, but eventually, if you configure it properly, it can save you a lot of money and stress.

Testing is the most powerful technique to prevent software issues. This is a big problem, and there are a ton of books that are written solely on this topic. Here we can only mention that you need to automate regression checks as much as possible. Otherwise, you will get stuck with a relatively small amount of manual checks, but the number of issues you miss will grow. At the same time, you should try to keep a good balance between different types of tests. This is important because end-to-end tests are effective but very expensive — they require a lot of resources to be run and written. Based on my experience, it is much more important what is written in tests than how many tests you have and even how good the coverage is. In other words, having smart quality assurance engineers who can really challenge your software and find weak spots in dark hidden corners of your app is a key factor in having success fighting bugs.

Let’s say you have done everything described above. The bad news is the issues can still happen and will happen. You don’t have a choice, and you have to include into the architecture of your system scenarios to handle issues in runtime gracefully and automatically. First of all, it means that your software should be prepared to handle failure. For example, getting an exception it needs to revert a partial transaction, close all opened resources, and restart itself. Moreover, if you have only one service process, your users will have to wait till it is restarted. It means that if you don’t want to make your customers wait, you need to distribute the load splitting of your system into several processes which can substitute each other. Another important advantage of multi-worker systems is that you can deploy software updates gradually. In other words, you can update a small number of processes in the system first, check their work and then continue updates for the rest of them, or shut down and revert the updated ones if something goes wrong.

Human factor

The difficulty which is last but not least is people. The main problem is that we are unpredictable. A lot of things can affect us, and as a result, we may make mistakes. Fortunately or unfortunately, we can’t avoid human participation in some processes, but what we can do is to help people do the right things.

The first thing you can do to reduce the possibility of human mistakes is design proper interfaces. If there are some dangerous actions a user can do, you should think twice — should this button exist at all? At least, probably this control should be available only for some specialists. At the same time, you may hide it somewhere or complicate the process to achieve this functionality. Besides that, the good idea may be for some operations to request additional approvals from other members — the second set of eyes can notice the mistake of the first one and prevent a disaster. In the end, in your control interfaces, you should show explicit and clear information about the outcomes of the particular act. To summarize here, motivate doing good things and make doing dangerous things complicated.

It is good if people have good practical knowledge about the system they use. To get that, they should try to do everything by their own hands — click all the buttons, imitate all possible crazy situations, and basically just break all possible things. For sure, you can’t allow them to do it in a production environment. Because of that, it is a very popular and useful practice to create some sort of “sandboxes.” They are also often called “stagings,” and these are system instances that are similar to production environments as much as possible. It can be even just a clone of your main product but with obfuscated customer data.

It is worth mentioning that the techniques described above for software and hardware issues are also useful to prevent human errors—especially backups and testing. The first one is like a save in a computer game — ideally, you should be able to revert any unwary move of your staff member. The last one should check that there are no such sequences of clicks in the result of which some parts of the project or the whole system will go to hell.

The most complicated and the most impactful part of the human factor problem is managing. This topic alone includes tons of best practices, recommendations, and exercises. If you can remember only one idea from this article, let it be the rule: if something went wrong — check the management system first. In other words, every time even some technical thing is broken, start from analyzing human relationships around this problem. Additionally, it may help to improve your management practices constantly, and at the same time, when properly tuned, a good culture can motivate people to solve all other problems.

Scalability

Is your application scalable? This is a tricky question, and to answer it, you first need to determine — scalable for what? The problem is you can’t and shouldn’t even try to design a system that is suitable for any workload. In other words, there is a huge difference between architecture for a program that is designed to handle 10 requests per day with a 5 second processor time each and a system to handle 30k requests per second with the same calculation complexity. So, to talk about scalability, you should first describe your current load properly — i.e., how many read and write requests per time, what is the size of them, how much data you should store and for how long, which types of requests you have and how quickly each of them should be handled. Then determine your goals in the same terms — what are you going to have tomorrow, next quarter, next year, and the next five years? Only by having this on the table, we can say that your application is scalable or not.

Service level objectives

When describing performance, it is crucial to understand that we always deal not with single numbers but with the distribution of the numbers. Let’s say you want a particular request to be handled quicker than 5 seconds. Does it mean that all such requests should be shorter than this time? Usually, there are many factors that affect the resulting latency — such as network latency, client-side process performance, current server-side load, and even mechanical vibrations. Because of this, it is super hard and sometimes even impossible to provide a 100% guarantee. That is why it is very popular to use percentiles to describe SLO (service level objectives).

Performance testing

Talking about scalability, we usually want to be able to cope with the load which may happen tomorrow. If we have already described our current performance requirements, understood the current load profile, and set our next goals, the next step is testing. Performance testing is probably one of the most complicated types of testing. There are so many hidden aspects, but let’s highlight some of them. First of all, when you are trying to check the performance of your system with the increased load, you have to imitate the load profile as close as possible to the real one. It means that you have to make a lot of parallel requests just as your users do. They should query and modify different data — not the same all the time. They should use data fetched from the previous ones for the next queries. They should not only ask for recent data but the old records as well. These all mean that your performance testing system should be aware of the business logic of your product. As a result, building it is a very complex task, but there are no other ways to avoid service outages after a load increase. Of course, it will not give you a 100% guarantee, as usual, but you can hope that this will prevent at least some issues, and in most cases, your service will be able to handle a new influx of users.

Microservices

When you know that there are performance problems with your system and load testing results are not good enough to say that you will handle a new wave, what can you do? Based on my experience, I can say that there are two parallel directions in which you should work: decomposition and “statelessification.” Both of them are needed to scale your solution not only vertically but also horizontally. Adding the whole new instance of your application onto another machine may be too wasteful. I bet that in your app there are parts that are handling most of the load. As a result, it makes sense to add more power to them exclusively. To do that, you have to be able to run them independently, and that is where you need decomposition. You do not necessarily have to dive headlong into the microservices ocean, it can be too expensive and too heavy to convert the whole system onto this approach, but you can start with a small step and detach at least the hottest parts. At the same time, scaling stateful services is not only tricky and dangerous — it can even be impossible. Even if, after a giant amount of pain and deaths, you achieve the custom distribution of the state between the detached service, because of complexity, this solution will be very error-prone and unmaintainable. That is why it is much better to do your best to remove the shared state from the logic and use already existing eventual consistency systems to store and communicate the state between services.

Maintainability

If you have ever created any software products, you already know that the cost of maintenance is always much higher than the cost of initial development. It happens because when you design a new system, you have a strong concept of how it should work and, as a result, how it should be implemented. Depending on how much time you have spent on your first release, your initial version can look clear, simple, and just perfect. Unfortunately, then the real world comes into play. It means that from time to time, some factors appear, and you have to adjust your system again and again. In the end, you may find yourself working with something big, complicated, and fragile, which we often call “legacy.” Let’s talk about three key factors which you have to control if you want to avoid such situations and reduce the maintenance cost.

DevOps

First of all, it is hard to overestimate the importance of developer operations in your process. Everything described above about scalability and reliability is relying on the level of your operations. Maintainability is not an exclusion. Based on my experience, I would say that we can even evaluate the quality level of software just by checking how easily it can be integrated with dev-ops systems. In other words, if we can install, deploy, update and monitor our application without any workarounds, manual checks, and adjustments, all the time consistently and repeatedly, it means that our software is designed well. This principle works the same way as TDD helps software developers keep their code modularized, flexible and extensible. If designing a service or module, you are starting from the thinking of “users,” like other engineers who will use your API or who will have to control and monitor your product, you are forced to split it properly, organize it logically, provide controllers and documentation and so on and so forth. At the end of the day, even if your product and software are not perfect, you can at least spend time calmly thinking about how to improve it and beat your competitors, but only if your developer operations are great and covering your back.

Simplicity

Then, if you can control and monitor everything in your system, the next factor you are looking at is simplicity. The growth in complexity during the time is inevitable, but at the same time, without control, it can lead to a disaster. Things like tangled dependencies, inconsistent naming, performance hacks, and all kinds of workarounds can tie your hands. It can be that your engineers are scared to make any changes to some parts of your system because the complexity level makes it unpredictable. The process of fixing such problems starts from identifying the most problematic and complex places. As the next step, we should redesign abstractions for them — rethink interfaces for such modules, their responsibilities, and requirements. After introducing a good “facade” for such parts, we can start the refactoring process there. It is difficult, time-consuming, and (the worst part is) hard to “sell” such activities to “business” people. The art of eloquence is not part of this article, but you will definitely need it to support a good technical strategy for the future.

Finally, there are more things like Agile practices, Extreme Programming, company culture, and all other things which are mostly related to people management. Using them, you should find and motivate great professionals to help you with all the things described above. Good luck!