OpenTelemetry in client libraries
Disclaimer: This article is my personal opinion based on experience instrumenting and supporting .NET HTTP client/server layers, Azure client libraries, and Functions. YMMV and I’d love to hear your story.
In a simple case, this is the code that traces calls (usually incoming and outgoing RPC calls are a good start): we create a span to describe a call with necessary attributes, propagate context and end the span after the call completes.
Spans have some relationships with each other and other signals, so users can correlate (and causate, if you will) them. Relationships are app-specific — you may have nested spans, forks, merges, and whatnot.
Instrumentation can be done in different ways, e.g. app developers can write it along with business logic or reuse a common handler, configure it with a few lines of code or inject it in runtime with no code at all.
Library can also provide instrumentation out of the box. OpenTelemetry makes this a great viable option (and only enables instrumentation for apps that explicitly configure it).
Instrumentation in client libraries is the ultimate goal
Client library has the most knowledge about underlying service: protocols, specific attributes, how context propagates inside the library and to the service. It can produce telemetry with the best quality and richness at a minimum cost to the user.
Tracing embedded in the library does not need manual configuration or runtime injection (reflection, byte-code instrumentation, profiler APIs, monkey-patching are great, but sometimes not possible or too expensive).
Client libraries have all the freedom to implement instrumentation without hacks and limitations that are usually needed for 3rd party instrumentation. It’s likely the most efficient way in terms of performance.
There are some issues even with this approach:
- dependency hell. Libraries has to depend on opentelemetry-api package and user may bring lower versions where some features or fixes are missing. In .NET, where we made OpenTelemetry predecessor (Activity) part of the platform, we still had a fair amount of version hell issues (hi binding-redirect, I’m not gonna miss you). Some extra steps (plugins or reflection) can reduce the pain, but it complicates onboarding.
- general-case support. Manual instrumentation does not have to support all the cases, as long as it solves one app observability needs. Client library instrumentation has to be unopinionated and support most use cases. It can be tricky — e.g. messaging system consumers can read and process messages as they want (or use library-provided handlers.
- expertise. Instrumenting a library requires reasonable experience with tracing (and vice-versa, instrumentation requires some expertise with the library).
Auto-instrumentation should be the target of conventions
OpenTelemetry provides an evolving list of conventions describing how common kinds of operations should be instrumented: how spans to be created and which properties populated.
I believe, instrumentation in client libraries brings the best experience to users, and, while writing conventions, we should target library developers as the main consumers of such conventions. External auto-instrumentation should benefit from this approach too.
Here’s what I learned while instrumenting libraries and scaling it beyond myself:
- feasibility and performance implications. Library can only control what happens within its API calls. E.g. HTTP clients don’t handle retries, may follow HTTP redirects depending on the user’s choice, and usually can’t control how the app reads the response. Conventions should reflect what’s possible, even if it’s not perfect.
- being unopinionated. Conventions have to support general use cases and not prescribe how applications use the library. E.g. messaging client libraries don’t always prescribe how spans are processed (e.g. one-by-one or in batches) — conventions should specify what makes sense and brings value regardless of usage. They may suggest how manual telemetry can fill the gaps. Another example is context propagation: realistically, only the client library knows how to propagate context to the service, the user-configured propagator is not necessarily helpful and sometimes is dangerous (request signatures).
- evolution. We can publish conventions for proven and non-controversial cases, leaving controversial or not well-tested details for the future as long as we have a general idea of how to evolve instrumentation (through new spans, events, logs, or metrics). We should evolve them based on user feedback.
- reasoning. We can’t assume everyone has deep tracing expertise, and it’s important to explain why something has to be done this way, it helps library writers understand how some minor details translate to different user experiences. E.g. non-obvious requirement to restart trace on trust boundary or for a specific scenario has to be explained to be respected.
- not being too prescriptive. Questions like span kinds (internal or client for logical calls with nested spans?) or restarting a trace on consumers don’t have a single answer. Library or app developers instrument code based on what makes sense to them, e.g. users have to make trade-offs based on instrumentation costs or what their vendor supports (e.g. links, attribute sensitivity, visual experiences, and pricing model). Library developers care about core functionality, performance, and support costs. Being too specific (if unnecessary) may increase costs for users and affect the supportability of instrumentation.
What’s about manual instrumentation
We can‘t expect users to become tracing experts and instrument their code perfectly according to all conventions and based on my experience, it’s not what users want.
When popular client libraries are instrumented, users may have an even better experience with custom instrumentation complementing automatic but should have reasonable observability without manual one.