OpenTelemetry in client libraries

What’s instrumentation

In a simple case, this is the code that traces calls (usually incoming and outgoing RPC calls are a good start): we create a span to describe a call with necessary attributes, propagate context and end the span after the call completes.

Instrumentation in client libraries is the ultimate goal

Client library has the most knowledge about underlying service: protocols, specific attributes, how context propagates inside the library and to the service. It can produce telemetry with the best quality and richness at a minimum cost to the user.

  • dependency hell. Libraries has to depend on opentelemetry-api package and user may bring lower versions where some features or fixes are missing. In .NET, where we made OpenTelemetry predecessor (Activity) part of the platform, we still had a fair amount of version hell issues (hi binding-redirect, I’m not gonna miss you). Some extra steps (plugins or reflection) can reduce the pain, but it complicates onboarding.
  • general-case support. Manual instrumentation does not have to support all the cases, as long as it solves one app observability needs. Client library instrumentation has to be unopinionated and support most use cases. It can be tricky — e.g. messaging system consumers can read and process messages as they want (or use library-provided handlers.
  • expertise. Instrumenting a library requires reasonable experience with tracing (and vice-versa, instrumentation requires some expertise with the library).

Auto-instrumentation should be the target of conventions

OpenTelemetry provides an evolving list of conventions describing how common kinds of operations should be instrumented: how spans to be created and which properties populated.

  • feasibility and performance implications. Library can only control what happens within its API calls. E.g. HTTP clients don’t handle retries, may follow HTTP redirects depending on the user’s choice, and usually can’t control how the app reads the response. Conventions should reflect what’s possible, even if it’s not perfect.
  • being unopinionated. Conventions have to support general use cases and not prescribe how applications use the library. E.g. messaging client libraries don’t always prescribe how spans are processed (e.g. one-by-one or in batches) — conventions should specify what makes sense and brings value regardless of usage. They may suggest how manual telemetry can fill the gaps. Another example is context propagation: realistically, only the client library knows how to propagate context to the service, the user-configured propagator is not necessarily helpful and sometimes is dangerous (request signatures).
  • evolution. We can publish conventions for proven and non-controversial cases, leaving controversial or not well-tested details for the future as long as we have a general idea of how to evolve instrumentation (through new spans, events, logs, or metrics). We should evolve them based on user feedback.
  • reasoning. We can’t assume everyone has deep tracing expertise, and it’s important to explain why something has to be done this way, it helps library writers understand how some minor details translate to different user experiences. E.g. non-obvious requirement to restart trace on trust boundary or for a specific scenario has to be explained to be respected.
  • not being too prescriptive. Questions like span kinds (internal or client for logical calls with nested spans?) or restarting a trace on consumers don’t have a single answer. Library or app developers instrument code based on what makes sense to them, e.g. users have to make trade-offs based on instrumentation costs or what their vendor supports (e.g. links, attribute sensitivity, visual experiences, and pricing model). Library developers care about core functionality, performance, and support costs. Being too specific (if unnecessary) may increase costs for users and affect the supportability of instrumentation.

What’s about manual instrumentation

We can‘t expect users to become tracing experts and instrument their code perfectly according to all conventions and based on my experience, it’s not what users want.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Liudmila Molkova

Liudmila Molkova

Software engineer working on Azure SDKs. Azure Monitor in the past. Distributed tracing enthusiast.