Telemetry design

Telemetry design#

PyTango’s telemetry support is split between public API and internal runtime machinery.

tango.telemetry is the public Python API for telemetry enums, endpoint objects, and tracer-provider factory hooks.
tango._telemetry is the internal runtime/configuration layer. It parses environment variables, tracks OpenTelemetry package availability, manages the client tracer provider lifecycle, and exposes the runtime hooks used by the rest of PyTango.
tango._instrumentation is the internal tracing wrapper layer. It decides when client and server calls create spans, and when only trace-context propagation should happen.

For device servers, cppTango owns the runtime telemetry configuration. That includes environment-variable defaults, database properties, and admin device commands. When cppTango changes telemetry state on a device, it calls the virtual DeviceImpl methods. PyTango’s trampoline callbacks react by refreshing the Python-side tracer provider or switching to no-op tracing for future spans.

For pure clients, PyTango lazily creates a singleton-like tracer provider from the current runtime configuration. This depends on the telemetry-related environment variables. Since environment variables cannot be changed at runtime, there is no effective way to enable/disable, change endpoints, or change topics for the client tracer at runtime. However, the code exists to referesh the tracer if there was a change. If the tracer-provider factory is replaced, the cached client tracer is invalidated and recreated on the next traced call.

When tracing endpoints change at runtime, PyTango shuts down the old Python tracer provider and constructs a new one with the current configuration. This applies to to device tracers but not the client tracer. Runtime changes affect future spans only.

Context propagation and span emission#

Since version 10.3.0 PyTango treats trace-context propagation and Python span emission as related but separate decisions. That matters because cppTango can often continue an existing distributed trace even when PyTango itself is not meant to emit a client or server span.

On the client side, the runtime has three modes:

If telemetry is disabled, PyTango does nothing special and just calls through.
If telemetry is enabled and client tracing is enabled, PyTango creates a Python client span and injects that context into cppTango.
If telemetry is enabled but client tracing is disabled, PyTango still injects the current OpenTelemetry context into cppTango, but it does not create a Python client span.

In simplified form, the client-side flow is:

current Python OTel context
            |
            v
    PyTango client wrapper
            |
            +-- telemetry disabled ----------> call cppTango directly
            |
            +-- telemetry enabled
                    |
                    +-- client tracing enabled
                    |        |
                    |        +--> create Python client span
                    |        +--> inject context into cppTango
                    |        +--> call cppTango
                    |
                    +-- client tracing disabled
                             |
                             +--> no Python client span
                             +--> inject context into cppTango
                             +--> call cppTango

This means TANGO_TELEMETRY_TYPES=logging, TANGO_TELEMETRY_TYPES=none, or a no-op Python tracer provider do not automatically break end-to-end context propagation. The trace can still continue through cppTango, as long as telemetry is enabled and the OpenTelemetry API packages are available.

The same separation exists on the server side. If a device method is eligible to emit a Python server span, PyTango extracts the incoming cppTango context and starts a server span from it. If span emission is suppressed, for example because tracing is disabled or the selected topics do not include the relevant wrapper, PyTango still attaches the incoming context to the current Python OpenTelemetry context for the duration of the method call. That allows nested outgoing client calls made by the device method to continue the same trace without requiring a Python server span.

In simplified form, the server-side flow is:

incoming cppTango trace context
            |
            v
    PyTango device wrapper
            |
            +-- emit server span
            |        |
            |        +--> extract cppTango context
            |        +--> create Python server span
            |        +--> run device method
            |
            +-- propagate only
            |        |
            |        +--> extract cppTango context
            |        +--> attach Python OTel context
            |        +--> run device method
            |
            +-- no context available
                     |
                     +--> run device method without trace context

inside device method
            |
            v
    nested DeviceProxy call
            |
            +--> inject current Python context into cppTango
            +--> downstream call continues the trace

This is why topic filtering only controls span emission. It does not act as a propagation filter. For example, using a topic set that suppresses user-level PyTango spans can still preserve the incoming trace context for downstream client calls originating inside the server method.

The executor integrations use the same model. When PyTango delegates work through the asyncio, gevent, or futures executors, they capture the current OpenTelemetry context and pass it explicitly via the internal trace_context argument. The client wrappers then inject that explicit context into cppTango even in propagation-only mode, so an executor handoff does not require a Python client span to keep the trace connected.

Mixed telemetry and tracing configuration#

In a distributed trace, propagation and span emission are separate behaviours. Propagation passes the active trace context to the next actor. Span emission records local work for export to a collector.

For example, consider a chain with a client, a parent device, and a child device:

client ---> parent device ---> child device

If all actors have telemetry enabled and tracing enabled, the expected result is one connected trace:

client span
    |
    +-- parent device span
            |
            +-- child device span

If the parent device propagates context but does not emit spans, the trace can still stay connected. The collector will not show local parent-device work, but the child span can still be linked to the incoming client context:

client span
    |
    +-- child device span

In practice, this is the behaviour to aim for when span export is not wanted but trace continuity still matters.

If the parent device has tracing disabled inside cppTango, or telemetry disabled completely, then it may not preserve the incoming context for the downstream child call. The child device can still emit spans if its own telemetry is enabled, but those spans can start a separate trace:

client span

child device span

The same rule applies to any actor in the chain. An actor that does not emit spans can still preserve trace continuity if it continues to propagate context. An actor that neither emits nor propagates telemetry breaks the trace at that point; downstream actors cannot recover the original parent context later.

Current limitations#

PyTango now treats context propagation as separate from span emission, but there are still cppTango-side cases where the correct context does not reach PyTango. Once cppTango has dropped or replaced the incoming context before PyTango enters a device method, PyTango cannot reconstruct it later.

Topic-filtered cppTango kernel spans#

cppTango currently couples two concerns in its internal telemetry macros: activating propagated context, and creating hidden kernel spans. When those hidden spans are filtered out by topic, context propagation can be affected too.

On the server side, with newer opentelemetry-cpp versions, a dropped hidden kernel span can propagate a non-sampled context. Downstream Python spans then disappear, because the Python OpenTelemetry SDK correctly honors the propagated sampling decision.

On the client side, the hidden client/kernel span can replace an already-active user span before the trace context is serialized for the outgoing request. If that hidden span is filtered out by topic, the server can receive the wrong parent context, and the exported parent-child relationship is no longer the expected direct chain.

The intended behavior is that topic filtering controls which cppTango spans are emitted, without changing the active context used for propagation. For example, using only the user topic should suppress cppTango kernel spans but still keep the user-level trace connected.

This is tracked in cppTango #1645.

Tracing disabled in cppTango#

There is also a related limitation when telemetry remains enabled but tracing is disabled inside cppTango. In that mode, cppTango currently does not preserve the active trace context for downstream calls. Nested client calls made from a device method can therefore start disconnected traces, even though they are logically part of the incoming request.

That limitation looks like:

incoming request
      |
      v
cppTango server path
      |
      +-- tracing enabled ------> context reaches PyTango wrapper
      |
      +-- tracing disabled -----> context dropped before PyTango sees it
                                   |
                                   +--> PyTango cannot reattach it later
                                   +--> nested downstream call starts
                                        without the original trace context

If the goal is to avoid exporting spans while preserving propagation, the current workaround is to keep tracing enabled but configure no tracing endpoints. That keeps the propagation path active without exporting cppTango spans. Disabling telemetry completely also avoids telemetry overhead, but then propagation and telemetry-backed logging are disabled too.

This is tracked in cppTango #1652.

Dependency and topic handling#

Dependency handling is also runtime-oriented. Importing tango does not warn merely because OpenTelemetry packages are missing. Warnings are emitted only when telemetry is actually requested. Missing opentelemetry-api means PyTango cannot propagate context or emit Python spans. Missing SDK/exporter packages still allows cppTango telemetry to be enabled, but Python spans fall back to no-op behaviour.

Topic support is currently limited in practice. PyTango documents user and all as the stable topics to rely on:

user covers the user-level spans emitted around device methods.
all additionally enables the currently surfaced kernel-level tracing.
Some topic-producing spans originate in cppTango, while PyTango mainly surfaces the configuration API and applies the topic filter to its Python-side tracing wrappers.
Additional topic names may be present in cppTango, but PyTango does not yet treat them as a stable public tracing contract.
The longer-term default-topic decision is still under discussion, so avoid baking in assumptions beyond the documented user and all behaviour.