Azure Messaging SDKs instrumentation

5 min readSep 19, 2021

As we’re working on the new version of OpenTelemetry messaging semantic conventions, I wanted to share some historical context and design choices we’ve made instrumenting Azure SDKs for EventHubs and ServiceBus before current conventions were created.

Overview

Tracing individual messages could be somewhat similar to synchronous HTTP requests but it involves brokers, settlement, multiple deliveries. Things like configuration and time between producer and consumer become important for observability.

With any sort of batching involved, the relationship between spans can no longer be described as parent-child, and with highly batched aggregation scenarios existing tracing visualization tools no longer work.

Instrumentation

When instrumenting libraries, we follow a few principles
- support generic case: tracing should work for all sorts of API that we have (batching, no batching). We don’t assume how library APIs are used.
- don’t require users to write any tracing code: we instrument public API-level calls and make sure they correlate to everything else
- enable generic tracing needs: when/where/how message was sent, when/where/how it was processed, and enough hints for the tracing backend to visualize it.
- be performant, especially when there is no instrumentation.
- start from the minimum: add things based on the feedback. It keeps users' telemetry bill reasonable and gives more wiggle room to add things in the future without breaking anyone.

Sending

Our API looks like createMessage, sendMessage or sendMessages. I.e. sending can happen in different ways: you can send one message or a batch, or create a message and send it later in the background task along with other independent messages as a perf optimization.

Since we want a generic case, we should assume the worst: a batch where messages are created in different scopes and sent in the background.

Message span

We introduced a span for message creation to assign unique context to the message. Message can’t carry parent context — it would not be possible to distinguish multiple messages created in the scope of the same span. While having a span for a local operation that is always successful is not typical, we don’t have other means to create a new context and record it.

It’s s a producer span — it’s a hint for the backend to identify an async call.

We set trace context on the message during construction and never change it.

Sometimes messages are created within send(content) call, we still generate message span to keep instrumentation uniform and consistent, so backends and users always know what to expect. In this case message and send spans are siblings. Perhaps violating the same-telemetry principle, in this case, we omit links as an optimization for the backend.

Send span

We have another span for the actual send call. It has links to all the messages contexts it carries. It may be retried on the user side, but retries-spans would have links to the same messages attached. It has all the info to uniquely identify the broker.

It’s a client span — it’s a normal RPC call and UX doesn’t need any hint indicating async pattern.

We don’t instrument protocol-level calls (AMQP), but we might, and they would be children of the send span.

Receiving

In our case, messaging brokers push messages to consumers, so receive call is local — no request is made to the broker. Assuming you have enough messages ready waiting to be retrieved, it has 0 duration. When it takes time to retrieve messages, the duration only shows how many messages are created by the producer, not much about the consumer.

Regardless of push/pull-based model, receive call span starts before we know what it’s going to receive. This way there is no way to attach received messages to started span. Also, sampling decisions would not be based on the received message contexts.

Considering local call and limited usefulness of call duration as well as sampling leaving low chances — sampling-probability² — of receive being sampled consistently with the rest of the telemetry, we decided instrumenting it brings low value and we postponed it.

It leaves problems like consumer misconfiguration not detectable by traces at all. We may also miss on scaling issues detection, however, they still can be deduced from times messages spent between producer and consumer.

If we did instrument it, we’d made it a client span since we didn’t find a need for any specific hint for UX.

Consuming

Message processing API is the most interesting one.

Nothing stops our users from just receiving messages and processing them using for loop. In this case, we can’t help users trace their messages and leave it up to manual instrumentation. We might provide APIs to help extract the context (as users don’t have to know context propagation details).

We instrument handlers, which many users choose to use. Handlers invoke user code with batches or individual messages similar to

Consumer.subscribe(events -> {...})

We still don’t know app logic though: if an app consumes a batch and aggregates it, making individual messages tracing irrelevant or still has some logic per each message.

This span is a consumer span — hint for UX backend indicating async call processing.
It has links to all messages being consumed. Each link has information on when it was sent (stamped by broker), so we can calculate the time spent in queue. If the API supports a single message, we make the message context a parent to consumer span — it helps since not all backends support links.
Consumer span has all the information to identify broker, it presumably is a root span and starts a new trace.

If users want, they can trace individual messages by creating new spans based on message context.

Settlement

Settlement calls are system-specific: settle a message or an offset. Letting users know about the settlement is important to debug applications: did app settle, did it do it properly, etc. We trace it as an internal span (for reasons not relevant here and which is likely to change).

When we handle settlement (in handler pattern), we make it a child of processing. We don’t control though if it’s done within the processing scope when done manually.