OpenTelemetry and Kubernetes are the two most active projects at the Cloud Native Computing Foundation ( CNFC ).
This blog post aims to familiarize the reader with Jaeger , an open-source distributed tracing backend. We’ll be deploying it in a Kubernetes cluster alongside a demo application from which we will collect traces. Familiarity with basic K8s and OTel terminology is helpful but not required.
OpenTelemetry and Tracing
Most of us started “debugging” programs with log statements (printf debugging). But in a world of microservices and async concurrency, logs and metrics are insufficient tools for tracking the life cycle of a single request. Since 2019, OpenTelemetry has established itself as the standard for tracing requests through distributed systems.
Jaeger
Jaeger was developed by Uber, and after becoming open source in 2017, it was quickly incubated by the CNCF . While it lacks the feature set of some commercial tracing tools such as Honeycomb (which we use here at Massdriver), it is quick to get up and running, free, and stable.
Kubernetes
Originating at Google over 15 years ago, Kubernetes is a orchestration engine for automating the deployment and management of containers. If you’re completely new to this tool, the docs contain a number of great tutorials.
Setup
You’ll need a kubernetes cluster for this demo. For running a cluster locally on your machine, I would recomment minikube . I will be using Massdriver’s aws-eks-cluster bundle to spin up a managed cluster on AWS. The Jaeger Operator we’ll be using does require an ingress .
You will also need the Kubernetes CLI, kubectl , with access to your cluster.
Installing the Jaeger Operator
We’ll be managing Jaeger using an operator . In this case, the operator deploys and manages Jaeger for us, but operators can be used to automate all kinds of tasks. Frameworks for writing operators to extend the Kubernetes API should exist in your favorite programming language - here at Massdriver, we’re partial to bonny written in elixir .
First, we will create a namespace for all our observability-related resources:
Then, download the Jaeger operator and install it into the newly created namespace:
For more information on any of the kubectl commands we’ll be using, you can use the -h flag after the command:
Finally, we verify that our operator is up and running:
Deploying Jaeger
With the operator installed, we can now apply a Jaeger
custom resource to our cluster and the operator will create a Jaeger instance for us:
You can either save the above snippet in a file, say jaeger.yaml
, and apply it using kubectl apply -f jaeger.yaml -n observability
, or create the object directly from stdin:
Now if we check our deployments in the observability namespace, we should see both the operator and the jaeger instance:
We can access our Jaeger instance by port-forwarding the correct service:
We can see two services exposed by the operator, and four exposed by Jaeger. Of those, we are currently interested in jaeger-query
. 16687 is the canonical port for the Service UI.
Now you should be able to access the Jaeger UI in a web browser at localhost:16686
. We don’t have any traces to search yet, so let’s leave this window open for later.
Deploying Hotrod
Hotrod is an example application by the folks at Jaeger. But instead of deploying hotrod and Jaeger together as outlined in the linked README, we’ll be writing our own deployment for the application. We will be keeping our Jaeger infrastructure in the observability namespace, and create a new namespace for hotrod:
$ kubectl create
namespace
hotrod
Following that, we need to create a Deployment , in which we declare how we would like kubernetes to run the hotrod application:
Save the above snippet in a file, hotrod.yaml
, and deploy it via kubectl apply -f hotrod.yaml -n hotrod
.
Of note here is the environment variable OTEL_EXPORTER_OTLP_ENDPOINT
, which we use to specify where traces from hotrod should be sent. We know we have a jaeger-collector
service, which acts similarly to the OpenTelemetry Collector , and we know it runs on port 4318. Within kubernetes, a URL to a given service can be resolved via http://<service_name>.<namespace>.svc.cluster.local:<port>
, and so we can assemble the necessary URL for hotrod to send its traces to Jaeger in a different namespace.
Speaking of URLs, let us expose hotrod as a service so we don’t have to port-forward the specific pod:
The hotrod namespace should now contain the hotrod deployment, replicaset, pod, and service:
Now we can port-forward the hotrod service using kubectl port-forward svc/hotrod -n hotrod 8080
, and visit the frontend in a browser at http://localhost:8080
:
Clicking on any of the four buttons will kick off a request to the customer
service to find the customer, then reach out to a mock Redis to find a driver, and finally to the route
service. Given you are still port-forwarding Jaeger, you can click on open trace
to view the trace in Jaeger:
Within the trace, we can see that this single click on the hotrod web page caused a flurry of activity within a variety of microservices: the frontend
first calls out to a customer
service to retrieve the customer from mysql
, then a driver
service that makes several requests to redis
to find a driver, and finally ~10 requests to a route
service.
We can also see that several timeouts occurred in redis
, but that the request was still able to complete successfully. We might also notice that the requests to redis
to find drivers were executed in series, and could be parallelized to improve latency.
These insights would be difficult to derive from metrics and logs alone.
If we submit a high number of requests to hotrod in a small time frame, we notice that the latency of our requests increases:
Looking at the latest trace, for driver T720428C, we find that the /customer
span took about 1.4 seconds, up from the .25 seconds in the first trace we viewed. Drilling down into that span, we find the culprit in the attached logs: the request has to wait on a lock behind 4 other transactions. Another opportunity for improvement in the hotrod application. And we found all of this without needing to be familiar with hotrod’s source code.
Takeaways
Tracing, and in particular OpenTelemetry, has become a vital tool at Massdriver. We use it to profile and debug our applications, where a request can span multiple kubernetes clusters that can contain hundreds of containers. We use it to define SLOs and decide which areas of the platform we need to work on to reduce latencies and error rates. But your setup does not need to be complicated to benefit from distributed tracing - I also use it for my personal websites and projects, and with this simple setup you can as well.