Tracing is king: How we sped up our PaaS by 10x

Matěj

January 28, 20257 min read

At Zerops, we've built our platform from the ground up and run it on bare metal, giving us complete control over every layer - from infrastructure to user interface. Our UI takes advantage of this by reflecting all changes down to the infrastructure as they happen. But with great control comes great responsibility: managing all these layers means fighting latency at every step. This blog post takes you through the beginning of our journey making everything faster - from API calls to spinning up example projects with your preferred stack in onboarding - and shows how we already quickly and efficiently achieved 10x performance improvements.

Zerops interface reflecting the infrastructure

Our Initial Efforts

After re-launching Zerops as its own company (long story for another day) we had a one and half year technical debt to take care of, most of the services Zerops supported released couple new major version and deprecated older versions we were offering. As part of this effort we also wanted to add the option to choose between Ubuntu or Alpine Linux. Zerops containers are managed through on Incus containers and at that time all our images were based on just Ubuntu.

Ubuntu is great, but it is quite sizable. We thought the major reason for a certain slowness in creation of our containers was caused by the fact that we had to unpack hundreds of megabytes of Ubuntu files. So we introduced Alpine Linux—a very lightweight Linux distribution—to our tech stack in the hope that spawn times would decrease.

But the difference between Alpine and Ubuntu spawn times was marginal at best. There had to be another reason why we were so slow...

We really felt the need for some improvements. Our team spends a great amount of time on our platform debugging things, trying to create one-click software recipes, and even deploying our own projects on Zerops. And if you've ever tried to set up a deployment pipeline, you know the pain of slow iteration. It takes time to create CI/CD pipelines, and every second chipped away from that process is invaluable.

Information Is Power: You Can’t Afford to Be Blind!

The ancient Egyptians revered gods with the heads of falcons—like Ra—for a reason. They understood the importance of clear vision.

Ancient Egyptian God Ra

Running software is no different. We, as software operators, need to know what is going on in our systems. Otherwise, we can't choose the right actions when something isn't working as expected. That's why we log things and collect metrics. Like Ra, we need to see. Optimization is not a guessing game, and this is where tracing comes into play.

Enter APM APM, or Application Performance Monitoring, is an open-source observability tool developed by Elastic as part of the Elastic Stack, which we were already using.

Alternatives such as OpenTelemetry (OTel) offer similar capabilities. OTel is a vendor-neutral open-source framework for collecting telemetry data, including traces, metrics, and logs.

APM lets you collect traces—timed execution paths your code takes—across different parts of your software. These traces are linked by IDs that are passed along to tie the execution flow together. Once you start collecting traces, you see visualizations like this:

Timings of different transactions, RPC calls, and even individual SQL queries.

Neat, right? Luckily, we already had Elastic APM set up, but we hadn't used it for some time. After a few tweaks of context propagation in Go and a few extra APM transactions, we were ready to examine our processes.

And here lies the power of tracing. Once set up, we sifted through the traces and saw what was going on. Almost always, the unnecessarily time-consuming operation screamed at us from the trace visualization.

Lost time is screaming at you.

The great thing about APM is that it comes with a Go library that makes the collection of traces quite easy. Just wrap your SQL, HTTP, and gRPC clients with the library functions, set up Kibana and Elastic, and you should be ready to go.

(We are writing an article about setting up APM or OTel, follow Zerops on X or LinkedIn)

What We Changed Based on our findings from the trace visualizations, we decided to implement a few key changes that seemed worthwhile. Here they are:

Custom Boot Sequence: We modified the boot sequence of our Alpine Linux services and wrote a custom Syslog-NG OpenRC service definition.
Fast-Track Mechanisms: We implemented several fast-track mechanisms in our microservice architecture to bypass waiting for periodic ticks (Go channels were a great help).
Eliminating Technical Debt: After gathering some courage, we tackled one of our most pressing legacy technical debts—fetching Elasticsearch data in backend algorithms. We replaced it with fetching from SQL by "simply" rewriting all queries. This also allowed us to eliminate some very costly waits for data syncing.

We would never have focused on some of these areas without the insights provided by APM; it would have been a blind undertaking. Returning to our initial hypothesis about unpack times, we discovered that on pre-cached images, the LXC create and start operations are almost instantaneous for both Ubuntu and Alpine.

The Results 🎉

Up to 4x Faster Asynchronous Infrastructure Processes: This is actually huge for us.
About 10x Faster API Responses: For most endpoints.

API latencies before the optimization efforts

API latencies after the optimization efforts

Here are some satisfying benchmark results:

The GUI feels crisper and more responsive. About 80% of the logistical time for building and managing infrastructure is gone. And... it just feels good to say that our KeyDB (a fork of Redis) database service is up and running in just 6 seconds.

And it is all thanks to tracing.

Reach for the Low-Hanging Fruit First

A quick word of advice: A former colleague of mine always talked about the principle of low-hanging fruit. Now I've come to truly appreciate the simple wisdom in it. If you have limited resources like time (which you always do), focus on the easiest improvements first. It will bring needed results, and many times, it will be enough to keep things going.

Experience the Difference

Zerops has already come a long way, but we're just getting started! Experience our 10x performance improvement and deploy complete projects - with database, cache, frontend, backend, storage and utility services - in minutes. Get started with $15 in free credits on sign-up, no credit card required. Or use promo code speedup on your first top-up to get total of $95 worth of credits for just $10.