Thoughts on HTTP instrumentation with OpenTelemetry

Instrumentation requirements

  • Answer typical user questions about the request: when it was made, in which context, how long it took, what were destination and result.
  • Enable instrumentation (ideally without code) at runtime. Code configuration should be also possible and some languages would prefer an explicit way. For Java, Python, Node.js, and .NET enabling HTTP tracing implicitly is a standard for users.
  • Support generic use-case consistently (within laws of physics) between client implementations, languages, configuration, and usage
  • Don’t be expensive: there should be nearly zero perf impact when there is no OpenTelemetry SDK (for native instrumentation), the default configuration should have a small perf impact. Extra details cost more for users in terms of telemetry bills and performance and should be opt-in.
  • Don’t break HTTP client core functionality: instrumentation should follow OpenTelemetry error conventions: they are visible, but applications should keep working no matter what happens with tracing. Tracing API stability is a big part of it.

High-level overview

  • each try or redirect is an HTTP span, it has a unique context that must be propagated downstream - HTTP clients don’t control how their APIs are used and spans created need to have the same meaning regardless of use-case. Unambiguously mapping client spans to service span also requires each try/redirect to have/propagate individual context.
  • nesting retries under another span is not always feasible because HTTP client APIs can be used in different ways and the client doesn't know about all retries. While higher-level spans with nested retries may be desirable, for the most popular happy use case, they are redundant. Users or libraries that do HTTP calls MAY add more instrumentation layers when it makes sense.
  • DNS, TLS, TCP, and any lower levels can be instrumented, but should not be enabled by default (due to low demand, costs, and perf implications). They can be added incrementally as events, logs, or new nested span (with other than HTTP semantics). We need some real-world instrumentation examples to come up with conventions.
  • current attributes have proven themselves over years and are great!

More details

  • it has send call to execute a request
  • it likely has handlers/interceptors for auth, retries, logging, etc, and probably users use them for retries, but not all of the users and not all the time
  • it has the configuration (default retry policy, auto-redirects, response buffering, etc).

HTTP Client use-cases

Basic, happy case

OkHttpClient client = new OkHttpClient();Request request = new Request.Builder()
.url(url)
.build();
try (Response response = client.newCall(request).execute()) {
return response.body().string();
}

Retries

OkHttpClient client = new OkHttpClient();
client.interceptors().add(new Interceptor() {
@Override
public Response intercept(Chain chain) throws IOException {
Request request = chain.request();
Response response = chain.proceed(request);
int tryCount = 0;
while (!response.isSuccessful() && tryCount < 3) {
tryCount++;
// you'd also need backoff and jitter
response = chain.proceed(request);
}
return response;
}
});
Span per HTTP try
  • It’ clear which server span belongs to which client span.
  • It’s clear when each retry starts, ends, if the backoff interval is good and HTTP status of each try (which potentially can be achieved with events too)
  • User is responsible for wrapper span creation and can put stream reading and logging under it
Span per HTTP client send call
  • It’s impossible to tell which server span belongs to which client span (timestamps are misleading)
  • It may be clear when each retry starts, ends, backoff interval and HTTP status of each try if it’s done with events, but needs backend support.
  • Users must know how it works and should use interceptors/handlers for tracing to work properly or they will see a mixture of these two pictures (maybe even in the same app)
  • user investigates the reason for retries
  • they ask for support from the downstream service owner e.g. cloud provider.
  • with a span context common among multiple tries, their ask is ambiguous (check what happened with something that has certain traceparent). Since we don’t even know if this particular try ever reached the cloud provider, getting support for this request would likely be hard. It increases costs for both: users and service providers.

Redirects

Async body stream reading/writing

Long polling and streaming

Other instrumentation concerns

Context Propagation

Sampling

Status

  • users can override them in SpanProcessors
  • client libraries that make HTTP calls usually know a bit more about particular status codes and can make better success/failure decision
  • an extra manual span that wraps HTTP call with retries may also set overall status (making explicit that operation underneath was eventually successful)
  • Vendors may and do provide additional tricks to help ignore certain errors

Request/response body

Lower levels

  • I didn’t hear about the need to instrument lower levels from customers (so this is my availability bias)
  • We don’t have a lot of examples of such instrumentations (Wireshark doesn’t count)
  • We should be cautious about performance and amount of telemetry (since it seems most customers don’t need it)
  • We may express them as logs/events (with verbosity) that customers opt-in to.
  • They can be added incrementally and don’t have to block HTTP spans semantics progress as long as there is a consensus on the direction

Higher Levels

  • they can create extra span for nested retries/redirect
  • they can measure time for response body stream reading
  • they frequently know the real status of HTTP spans and the result of the overall logical operation
  • they know extra details about underlying services that are important to users
  • they also know context propagation protocols supported by downstream service

Instrumentation pseudo code

// reduces attribute collection perf impact
// if there is no OTel SDK
instrumentationEnabled = tracer.isEnabled()
if (instrumentationEnabled) {
samplingAttributes = getSamplingAttributes()
span = tracer.startSpan(
spanName,
parentContext,
SpanKind.CLIENT,
samplingAttributes)
// next layers of instrumentation can discover this span
scope = span.makeCurrent()

if (span.isRecorded()) {
// add other non-sampling-related attributes
span.addAttribute(...)
}

if (span.getContext().isValid()) {
// may internally take care of cleaning up
// context before injection
propagator.inject(span.getContext(), setter)
}
}
try {
response = await send(request)

if (instrumentationEnabled && span.isRecorded()) {
span.setAttributes(STATUS_CODE, response.statusCode)
span.setStatus(httpCodeToStatus(response.statusCode))
}

} catch (Exception ex) {
if (instrumentationEnabled && span.isRecorded()) {
span.setException(ex);
}
throw ex
} finally {
if (instrumentationEnabled) {
scope.close()
span.end()
}
}

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Liudmila Molkova

Liudmila Molkova

Software engineer working on Azure SDKs. Azure Monitor in the past. Distributed tracing enthusiast.