Get to know OpenTelemetry without the confusion
OpenTelemetry is the 2nd most popular CNCF project ever. It can boost your observability and is not rocket science once you get the hang of it. But it feels like it early on. Let's demystify OTel.
What does OpenTelemetry solve?
It’s a framework that promises to solve several messy problems in observability.
But at its core, it solves one thing unlike anything else out there.
Fragmented instrumentation and collection.
It can instrument all kinds of services for different kinds of data like metrics, logs, and traces and then process them for you.
In other words, OpenTelemetry can make it painless to get the data you need on how your software system is performing.
It helps you generate and capture telemetry data that contributes to better observability.
You’ll also hear some people say it democratizes the process of collecting observability data.
In plain English, it can cut the risk of vendor lock-in. We’ll talk about this later.
What was observability like before OpenTelemetry?
Victor Farcic of the DevOps toolkit has said that “OpenTelemetry can help fix the observability mess.”
His words make sense when you think of having to configure and maintain a multitude of tools just to get your telemetry data.
Before OpenTelemetry, it was not unusual to find 5 to 10 different tools and methods for collecting telemetry data in one software system.
Before OpenTelemetry, there was a mix and match of tooling that collected disparate telemetry data
You had a variety of SDKs, collectors, protocols, APIs, and agents trying to work together.
Each new tool came with an additional layer of complexity because Tool A is configured differently from Tool B is configured differently from Tool C etc.
So engineers had to stay on top of it all.
You’ll hear people talk about this situation with descriptions like lacking standardization and interoperability. Big words highlighting big problems.
The lack of standardization meant engineers struggled to get consistent outcomes across different languages, frameworks, and environments.
The worst aspect of it all was — sure, you could put in the hard work to keep everything running — but it was all fragmented data.
You then faced serious challenges in integrating the data and all kinds of other data flow issues.
By unifying instrumenting, collecting, and exporting data, OpenTelemetry promises to change this for good.
Its interoperability also allows you to change vendors for your querying and visualization work.
Key benefits of OpenTelemetry
Reduces fragmentation from disparate observability tooling
Eliminates the risk of vendor lock-in
Offers comprehensive observability covering logs, metrics, and traces
Ensures a consistent approach to observability across the organization
Works well within complex cloud-native environments
How does OpenTelemetry prevent vendor lock-in?
You might hear from people that OTel makes vendor lock-in a thing of the past.
Or at least makes it easier for you to switch out components because nothing is proprietary.
The absence of proprietary elements means that users can switch out instrumentation libraries, collectors, or exporters with ease.
The key idea here is that you aren’t relegated to any vendor’s specific tech stack.
You don’t have to recode or instrument your services when you decide to change tooling or vendors.
Which might be a nice thing when you’re looking to try something else.
Now you might be thinking: why would an observability vendor want this to happen?
It could be that they think OpenTelemetry will enable organizations to instrument more services, leading to a larger volume of observability data.
Most commercial plans are priced based on data storage and usage. And there you are.
It’s still a win for engineers because you don’t have to use commercial solutions. You could go with unsupported open source, but that rarely works for actual organizations.
Where does OpenTelemetry collect data from?
The simple answer is almost everywhere you can think of.
OpenTelemetry pulls data from:
your frontend and backend — supporting a whole bunch of languages like Go, Java, Python, Ruby, Javascript, and even Erlang
containerized environments like Docker and Kubernetes
all the major cloud providers including AWS, Azure, and GCP
existing observability tools like Prometheus, Jaeger, and more
What kind of data does OpenTelemetry collect?
It can handle the 3 core signals of observability at varying levels of compatibility.
By core signals, I mean logs, metrics, and traces.
Traces are the best covered pillar across different programming languages.
That might have something to do with the much older OpenTracing folding into OTel.
OpenTelemetry’s working groups are also investigating other signal types like continuous profiling.
That’s when you continuously collect data on the application’s runtime. That might cover aspects like CPU usage, memory usage, function calls, and more.
What is OpenTelemetry’s architecture like?
In the least possible words, I’d say OpenTelemetry is a loosely coupled framework.
You can use all of it or some of it. It’s super flexible!
You could just use the SDK to instrument your services and then use different collectors like Jaeger or Zipkin.
Or you could go the other way around and use only the OTel collector. When would you do this? Let’s say you have Kafka as a middleware to stream data from various services.
You could stream data into the collector to push out to your exporter and then the visualization tool of choice.
For simplicity, I’m going to break OTel into 3 separate areas: instrumentation, deployment, and integration.
Let’s talk about instrumentation first because… that’s the first part of the process.
(Phase 1 of 3) Instrumentation
This initial phase marks where you instrument (or place) OpenTelemetry code within your services.
It gives your applications and components the ability to emit telemetry data.
Like I mentioned earlier, OTel has most of the popular programming languages covered.
Your job at this point is to incorporate the right SDK or instrumentation libraries into your application components.
Depending on what kind of data you need from the specific application, you will need to:
create spans for trace data
instrument to collect performance metrics and/or
add logging statements to generate logs
You can do all of this using API calls within the SDK you’re using.
You can choose how you instrument your system’s components in 3 modes:
auto instrumentation
manual instrumentation
a hybrid of manual and auto instrumentation
The choice you make will make a huge difference to the amount of time and energy you spend on the initial OTel setup.
Let’s explore each of these in more detail:
Auto Instrumentation
This is when the instrumentation mechanism automatically attaches to your applications' runtime and injects code to capture telemetry data.
When would you do this?
When you want rapid deployment to instrument your application without making extensive changes to the code
When your application or other component follows common patterns that work well with auto instrumentation
When you need to capture telemetry data from a service that you don’t directly control e.g. a cloud provider’s database
When might you consider against it?
Auto instrumentation can have limitations in capturing domain-specific e.g. conversion rate in eCommerce or custom metrics that fit only your use case
Manual Instrumentation
This is when you add the instrumenting code to the application yourself — without any automation.
When would you do this?
When you need fine-grained control over what and how you instrument, like the custom metrics I mentioned, or when your application doesn’t follow standard patterns
It can also come in useful with legacy systems or unique architectures
When might you consider against it?
If you’re pressed for time or don’t have the resources to do it because manual instrumentation implies manual work
When you don’t have direct access to the codebase of the service you need to instrument
Hybrid of Manual and Auto Instrumentation
A mix of the automatic and manual modes I highlighted earlier
When would you do this?
When you have a clear idea of which components are eligible for auto instrumentation and which are more suitable for manual tweaks
A hybrid approach is useful for transitioning from manual to auto instrumentation or vice versa
When would you consider against it?
When you lack the planning capability to assign the right mode of instrumentation for components
This process requires consistent maintenance work as components evolve or completely change
Instrumenting your application doesn’t push the data out for collection at this point.
That comes in the next phase — deployment.
Advice for instrumenting complex software systems
Before you start to instrument your code, you need to know how OpenTelemetry will best complement your system.
So you need to audit your stack.
Have a clear mental or (ideally) written model outlining:
Languages in your software system
Kinds of signals needed for each component or service (group them to make it easier)
Which protocols you will use (OpenTelemetry’s default OTLP or another)
Which analytics tools you will use
What does OTel instrumented code look like?
OpenTelemetry’s CNCF ambassadors like Adriana Villela are better people to follow for specific examples of OpenTelemetry instrumentation.
Listen to our podcast episode interview where she discusses OpenTelemetry.
I’ll show you a simple example to get you thinking about instrumentation mechanics.
Below you can see 3 different code samples.
We are instrumenting a Node.js application with OpenTelemetry for tracing that then exports span data to Jaeger, a dedicated tracing tool.
The first code sample installs OpenTelemetry packages for our Node.js application:
npm install \\
@opentelemetry/api \\
@opentelemetry/context-base \\
@opentelemetry/core \\
@opentelemetry/exporter-jaeger \\
@opentelemetry/instrumentation \\
@opentelemetry/instrumentation-express \\
@opentelemetry/instrumentation-http \\
@opentelemetry/sdk-trace-node \\
@opentelemetry/tracing \\
@opentelemetry/resources
This second code sample installs the exporter needed to send data to Jaeger
npm install @opentelemetry/exporter-jaeger
This third code sample is within our Node.js application showing the instrumentation steps
// Step 1: Import OpenTelemetry
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { SimpleSpanProcessor, ConsoleSpanExporter } = require('@opentelemetry/tracing');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
// Step 2: Import Jaeger exporter
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
// Step 3: Create and configure a TracerProvider with Jaeger exporter
const tracerProvider = new NodeTracerProvider();
tracerProvider.addSpanProcessor(new SimpleSpanProcessor(new ConsoleSpanExporter())); // For console logging
tracerProvider.addSpanProcessor(new SimpleSpanProcessor(new JaegerExporter({ serviceName: 'example-service' }))); // Export to Jaeger
tracerProvider.register();
// Step 4: Instrumentation - Import instrumentations for popular libraries
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
// Step 5: Register instrumentations with the tracer provider
registerInstrumentations({
instrumentations: [
HttpInstrumentation,
ExpressInstrumentation,
// Add more instrumentations as needed
],
});
// Step 6: Use OpenTelemetry APIs in your code
const { context, trace } = require('@opentelemetry/api');
// Function to simulate a simple HTTP request
async function makeHttpRequest() {
const span = trace.getTracer('example-tracer').startSpan('makeHttpRequest');
// Simulate some work
await new Promise(resolve => setTimeout(resolve, 100));
span.end();
}
// Sample application logic
async function main() {
const span = trace.getTracer('example-tracer').startSpan('main');
// Simulate some work
await makeHttpRequest();
span.end();
}
// Step 7: Execute the application
main();
(Phase 2 of 3) Deployment
After your application has been successfully instrumented, OpenTelemetry needs to run alongside the application’s runtime.
It can do this in various ways including sidecars, daemonsets, gateways, etc.
You might hear about these as “Collectors”.
This is a critical step that cannot be messed up.
It’s the point when telemetry data is gathered during the execution of your application.
Now that we’ve covered deployment, let’s look at the integration phase:
(Phase 3 of 3) Integration
This phase involves receiving, transforming, and then pushing the data out for human use.
Some vendors will have an “Observability Agent” or “Agent” for short that will do all of this.
You can mix and match depending on your programming language, OS, etc.
We can break OpenTelemetry’s integration phase into components like:
Receivers
Processors
Exporters
OTLP
Let’s talk about each of these components now:
1. Receivers
Receivers collect raw telemetry data from your instrumented applications or system components.
They are the entry points for your incoming observability data.
2. Processors
Processors manipulate (or better put) transform the raw telemetry data taken in by receivers.
You can transform the data by filtering, sampling, and enriching it.
3. Exporters
Exporters take your processed telemetry data and send it to external systems. This can be your observability dashboard for example.
You can also direct the data to storage for future querying or visualization needs.
4. OTLP
OTLP is not so much a stage for doing anything to the data as much as the protocol that moves it around. It stands for OpenTelemetry Protocol.
In a way, this is the secret sauce that keeps the data consistent. It’s the common language for communicating within the OTel framework.
There are 2 ways the data can be transported: as protobuf or JSON.
Protobuf is a compact and efficient way of encoding structured data, so it comes as the default.
JSON is available if you want the data to be human readable, but you will increase the data size.
Keep in mind that you don’t have to use OTLP at all for transporting the data.
You can use other protocols like Jaeger Thrift for tracing, Prometheus exposition format for metrics, and more.
Challenges with taking up OpenTelemetry
Not all of it will work well with your architecture
The most important thing you have to keep in mind is that OTel is made up of various components, and not every component is stable or even available.
For example, as of December 29th 2023, logs were not available for Go-based services.
But you can be confident in OTel’s tracing capabilities as that part is mature across most languages.
You can easily check if OTel will work with your various system components.
Go to opentelemetry.io/status and follow the instructions there.
You will see that various OTel components are marked as stable, experimental, in development, or not yet started.
Once a component is stable, you’re good to use it in most situations because stable implies:
long term support — the component is well-tested and ready for production use
dependency isolation — designed to minimize dependencies and provide clear APIs
backwards compatibility — future updates will strive to avoid breaking existing functionality
Some systems just won’t work with OpenTelemetry
Now, I said that a stable component can be used in most situations.
There are some situations where you might not be able to or want to use it.
The first instance is when you’re looking to instrument a legacy system.
You’ll have to make considerations about whether that component will work with your legacy system effectively.
Another instance is when you have a low-latency system like a trading platform.
The performance overhead of OpenTelemetry might call for further looking into how it affects your system’s performance.
OpenTelemetry has a learning curve
While it’s a lot easier than learning 5 or even 10 different tools, OpenTelemetry is still intricate and you will need to learn how to use it well.
A few things I’d recommend for you to do are:
deep dive into the documentation on the OpenTelemetry website
talk with developer advocates at the observability provider you use
reach out to CNCF ambassadors or maintainers who are focused on OpenTelemetry
While I’m on that train of thought, a quick shout out to the hardworking maintainers who maintain OpenTelemetry’s codebase and are working to make every component stable.
Fantastic overview!!! What is unclear to me is whether people are using OTEL for infrastructure telemetry (servers, VMs, containers ...) or just application code/services. It always feels very application-centric, which is fine, but not the whole stack.