# Telemetry design

PyTango's telemetry support is split
between public API and internal runtime machinery.

- `tango.telemetry` is the public Python API
  for telemetry enums, endpoint objects,
  and tracer-provider factory hooks.
- `tango._telemetry` is the internal runtime/configuration layer.
  It parses environment variables,
  tracks OpenTelemetry package availability,
  manages the client tracer provider lifecycle,
  and exposes the runtime hooks used by the rest of PyTango.
- `tango._instrumentation` is the internal tracing wrapper layer.
  It decides when client and server calls create spans,
  and when only trace-context propagation should happen.

For device servers,
cppTango owns the runtime telemetry configuration.
That includes environment-variable defaults,
database properties,
and admin device commands.
When cppTango changes telemetry state on a device,
it calls the virtual `DeviceImpl` methods.
PyTango's trampoline callbacks react
by refreshing the Python-side tracer provider
or switching to no-op tracing for future spans.

For pure clients,
PyTango lazily creates a singleton-like tracer provider
from the current runtime configuration.
This depends on the telemetry-related environment variables.
Since environment variables cannot be changed at runtime,
there is no effective way to
enable/disable, change endpoints, or change topics
for the client tracer at runtime.
However,
the code exists to referesh the tracer
if there was a change.
If the tracer-provider factory is replaced,
the cached client tracer is invalidated
and recreated on the next traced call.

When tracing endpoints change at runtime,
PyTango shuts down the old Python tracer provider
and constructs a new one with the current configuration.
This applies to to device tracers
but not the client tracer.
Runtime changes affect future spans only.

## Context propagation and span emission

Since version 10.3.0
PyTango treats
trace-context propagation
and Python span emission
as related but separate decisions.
That matters because
cppTango can often continue
an existing distributed trace
even when PyTango itself
is not meant to emit a client
or server span.

On the client side,
the runtime has three modes:

- If telemetry is disabled,
  PyTango does nothing special
  and just calls through.
- If telemetry is enabled
  and client tracing is enabled,
  PyTango creates a Python client span
  and injects that context into cppTango.
- If telemetry is enabled
  but client tracing is disabled,
  PyTango still injects the current OpenTelemetry context
  into cppTango,
  but it does not create a Python client span.

In simplified form,
the client-side flow is:

```text
current Python OTel context
            |
            v
    PyTango client wrapper
            |
            +-- telemetry disabled ----------> call cppTango directly
            |
            +-- telemetry enabled
                    |
                    +-- client tracing enabled
                    |        |
                    |        +--> create Python client span
                    |        +--> inject context into cppTango
                    |        +--> call cppTango
                    |
                    +-- client tracing disabled
                             |
                             +--> no Python client span
                             +--> inject context into cppTango
                             +--> call cppTango
```

This means
`TANGO_TELEMETRY_TYPES=logging`,
`TANGO_TELEMETRY_TYPES=none`,
or a no-op Python tracer provider
do not automatically break
end-to-end context propagation.
The trace can still continue
through cppTango,
as long as telemetry is enabled
and the OpenTelemetry API packages
are available.

The same separation exists
on the server side.
If a device method is eligible
to emit a Python server span,
PyTango extracts the incoming cppTango context
and starts a server span from it.
If span emission is suppressed,
for example because tracing is disabled
or the selected topics do not include
the relevant wrapper,
PyTango still attaches the incoming context
to the current Python OpenTelemetry context
for the duration of the method call.
That allows nested outgoing client calls
made by the device method
to continue the same trace
without requiring a Python server span.

In simplified form,
the server-side flow is:

```text
incoming cppTango trace context
            |
            v
    PyTango device wrapper
            |
            +-- emit server span
            |        |
            |        +--> extract cppTango context
            |        +--> create Python server span
            |        +--> run device method
            |
            +-- propagate only
            |        |
            |        +--> extract cppTango context
            |        +--> attach Python OTel context
            |        +--> run device method
            |
            +-- no context available
                     |
                     +--> run device method without trace context

inside device method
            |
            v
    nested DeviceProxy call
            |
            +--> inject current Python context into cppTango
            +--> downstream call continues the trace
```

This is why topic filtering
only controls span emission.
It does not act
as a propagation filter.
For example,
using a topic set
that suppresses user-level PyTango spans
can still preserve the incoming trace context
for downstream client calls
originating inside the server method.

The executor integrations
use the same model.
When PyTango delegates work
through the asyncio,
gevent,
or futures executors,
they capture the current OpenTelemetry context
and pass it explicitly
via the internal `trace_context` argument.
The client wrappers then inject that explicit context
into cppTango
even in propagation-only mode,
so an executor handoff
does not require a Python client span
to keep the trace connected.

## Mixed telemetry and tracing configuration

In a distributed trace,
propagation and span emission
are separate behaviours.
Propagation passes the active trace context
to the next actor.
Span emission records local work
for export to a collector.

For example,
consider a chain with a client,
a parent device,
and a child device:

```text
client ---> parent device ---> child device
```

If all actors have telemetry enabled
and tracing enabled,
the expected result is one connected trace:

```text
client span
    |
    +-- parent device span
            |
            +-- child device span
```

If the parent device propagates context
but does not emit spans,
the trace can still stay connected.
The collector will not show local parent-device work,
but the child span can still be linked
to the incoming client context:

```text
client span
    |
    +-- child device span
```

In practice,
this is the behaviour to aim for
when span export is not wanted
but trace continuity still matters.

If the parent device has tracing disabled
inside cppTango,
or telemetry disabled completely,
then it may not preserve the incoming context
for the downstream child call.
The child device can still emit spans
if its own telemetry is enabled,
but those spans can start a separate trace:

```text
client span

child device span
```

The same rule applies to any actor
in the chain.
An actor that does not emit spans
can still preserve trace continuity
if it continues to propagate context.
An actor that neither emits nor propagates telemetry
breaks the trace at that point;
downstream actors cannot recover
the original parent context later.

## Current limitations

PyTango now treats context propagation
as separate from span emission,
but there are still cppTango-side cases
where the correct context does not reach PyTango.
Once cppTango has dropped
or replaced the incoming context
before PyTango enters a device method,
PyTango cannot reconstruct it later.

### Topic-filtered cppTango kernel spans

cppTango currently couples two concerns
in its internal telemetry macros:
activating propagated context,
and creating hidden kernel spans.
When those hidden spans are filtered out by topic,
context propagation can be affected too.

On the server side,
with newer `opentelemetry-cpp` versions,
a dropped hidden kernel span can propagate
a non-sampled context.
Downstream Python spans then disappear,
because the Python OpenTelemetry SDK correctly honors
the propagated sampling decision.

On the client side,
the hidden client/kernel span can replace
an already-active user span
before the trace context is serialized
for the outgoing request.
If that hidden span is filtered out by topic,
the server can receive the wrong parent context,
and the exported parent-child relationship
is no longer the expected direct chain.

The intended behavior is that topic filtering
controls which cppTango spans are emitted,
without changing the active context used
for propagation.
For example,
using only the `user` topic
should suppress cppTango kernel spans
but still keep the user-level trace connected.

This is tracked in
[cppTango #1645](https://gitlab.com/tango-controls/cppTango/-/work_items/1645).

### Tracing disabled in cppTango

There is also a related limitation
when telemetry remains enabled
but tracing is disabled
inside cppTango.
In that mode,
cppTango currently does not preserve
the active trace context for downstream calls.
Nested client calls made from a device method
can therefore start disconnected traces,
even though they are logically part
of the incoming request.

That limitation looks like:

```text
incoming request
      |
      v
cppTango server path
      |
      +-- tracing enabled ------> context reaches PyTango wrapper
      |
      +-- tracing disabled -----> context dropped before PyTango sees it
                                   |
                                   +--> PyTango cannot reattach it later
                                   +--> nested downstream call starts
                                        without the original trace context
```

If the goal is to avoid exporting spans
while preserving propagation,
the current workaround is to keep tracing enabled
but configure no tracing endpoints.
That keeps the propagation path active
without exporting cppTango spans.
Disabling telemetry completely
also avoids telemetry overhead,
but then propagation and telemetry-backed logging
are disabled too.

This is tracked in
[cppTango #1652](https://gitlab.com/tango-controls/cppTango/-/work_items/1652).

### Dependency and topic handling

Dependency handling is also runtime-oriented.
Importing `tango` does not warn
merely because OpenTelemetry packages are missing.
Warnings are emitted only
when telemetry is actually requested.
Missing `opentelemetry-api`
means PyTango cannot propagate context
or emit Python spans.
Missing SDK/exporter packages
still allows cppTango telemetry to be enabled,
but Python spans fall back to no-op behaviour.

Topic support is currently limited in practice.
PyTango documents `user` and `all`
as the stable topics to rely on:

- `user` covers the user-level spans emitted around device methods.
- `all` additionally enables the currently surfaced kernel-level tracing.
- Some topic-producing spans originate in cppTango,
  while PyTango mainly surfaces the configuration API
  and applies the topic filter
  to its Python-side tracing wrappers.
- Additional topic names may be present in cppTango,
  but PyTango does not yet treat them
  as a stable public tracing contract.
- The longer-term default-topic decision is still under discussion,
  so avoid baking in assumptions
  beyond the documented `user` and `all` behaviour.
