The evolution of software from monolithic to distributed architectures (microservices, FaaS, etc.) has many benefits, including speed, flexibility, fault isolation, scalability, smaller/faster deployments, limiting vendor risk, etc. However, building distributed systems creates a tremendous amount of complexity. As of 2017, Netflix estimated that it had around 700 microservices. 700! That is a lot of services to maintain! While you may not have 700, even an application with 50 microservices can be quite complex. And what happens when you have a bug!?
“The main problem with debugging and finding the root cause in a distributed system is being able to recreate the state of the system when the error occurred so that you can obtain a holistic view.”
My experience with bugs in distributed systems – whether in dev, QA, staging, or production – generally includes some or all of the following events. QA spends lots of time tediously documenting the bug in a ticket for the dev team. The developer spends a few hours trying to recreate the bug. The developer almost always, at some point, says: “it works on my machine!” The developer spends hours digging through log files to try and find where the bug might be. The developer determines they probably aren’t logging the right thing. The developer starts putting print statements all over the code to try and find it. QA is feeling some pressure from the product manager, so they start pushing the dev team. The dev team starts pointing fingers at DevOps. The dev team starts pointing fingers at the 3rd party API, but doesn’t have any data to prove it is their problem. In extreme situations I have had to step in, put on my firefighter hat, and put remote debuggers in other environments to solve mission-critical problems.
“Large-scale distributed systems can be a nightmare to debug…Pervasive logging may record events of interest at appropriate granularity, but correlating events across the logs of large numbers of machines is prohibitively difficult.”
I have experienced this problem as the new dev, the experienced dev, the team lead, the dev manager, and the CTO. Regardless of my role, this process has been painful and expensive. Countless hours are spent trying to identify, locate, and fix bugs in software, particularly when there are many services involved. In a 2018 study, researchers found that the “average the time used to locate and fix a fault increases with the number of microservices involved in the fault: 9.5 hours for one microservice, 20 hours for two microservices, 40 hours for three microservices, 48 hours for more than three microservices.”
So we set out to create a solution.
The Vortex API Debugger allows you to securely route API endpoints directly to your IDE from any environment. What’s more is that it is a network layer solution that requires zero code changes.
As the developer or QA, you can create conditional rules to redirect traffic (i.e. the problem transaction) directly to your IDE of choice. You can then step through the code in your IDE, identify the problem, make a change locally, and push the change through to see the result. Once you have fixed the problem, you can then go through your normal deployment process with the changes.
Join our beta today! https:/www.vortexhub.io
Comments