OpenTelemetry from Scratch

Observability tooling has been fragmented for years. Prometheus for metrics. Jaeger for traces. ELK for logs. Different agents, different formats, different query languages.

OpenTelemetry changes this. It’s a single standard for collecting telemetry data - traces, metrics, and logs - from your applications. Instrument once, send anywhere.

This guide covers how to implement OpenTelemetry from scratch, including the parts the documentation glosses over.

Why OpenTelemetry

Before OpenTelemetry, every observability vendor had their own instrumentation. If you used Datadog, you used Datadog’s SDK. Switch to New Relic? Rewrite your instrumentation.

OpenTelemetry is vendor-neutral. You instrument your code with OTel, then send data to whatever backend you want - Jaeger, Zipkin, Prometheus, Datadog, Honeycomb, or all of them simultaneously.

The other benefit is correlation. When traces, metrics, and logs share the same context (trace IDs, service names, resource attributes), debugging becomes dramatically easier. You can go from an error log to the exact trace that caused it.

The Three Signals

OpenTelemetry handles three types of telemetry:

Traces show the path of a request through your system. A single trace contains multiple spans, each representing a unit of work. When a request hits your API, calls a database, and returns a response - that’s one trace with multiple spans.

Metrics are numerical measurements over time. Request count, latency percentiles, CPU usage, queue depth. Metrics tell you what’s happening at aggregate.

Logs are discrete events. Error messages, audit records, debug output. Logs tell you what happened at a specific moment.

Each signal has its strengths. Traces for understanding request flow. Metrics for alerting and dashboards. Logs for detailed debugging. Together, they give complete observability.

Basic Architecture

A typical OpenTelemetry setup has three components:

Instrumentation in your application code generates telemetry data. This can be automatic (agent-based) or manual (SDK calls).

The Collector receives telemetry, processes it, and exports it to backends. It’s optional - you can send directly to backends - but recommended for production.

Backends store and query the data. Jaeger for traces, Prometheus for metrics, Loki for logs. Or all-in-one platforms like Grafana Cloud.

Instrumenting a Python Application

Let’s instrument a Flask application. First, install the packages:

pip install opentelemetry-api \
    opentelemetry-sdk \
    opentelemetry-instrumentation-flask \
    opentelemetry-instrumentation-requests \
    opentelemetry-exporter-otlp

Basic setup:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Configure the tracer
resource = Resource.create({
    "service.name": "my-api",
    "service.version": "1.0.0",
    "deployment.environment": "production"
})

provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Auto-instrument Flask and requests
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()

# Your Flask app
from flask import Flask
app = Flask(__name__)

@app.route("/")
def hello():
    return "Hello, World!"

This automatically traces all Flask requests and outgoing HTTP calls. No manual span creation needed for basic visibility.

Adding Custom Spans

Auto-instrumentation covers HTTP boundaries. For visibility into your business logic, add custom spans:

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def process_order(order_id):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        
        # Validate order
        with tracer.start_as_current_span("validate_order"):
            validate(order_id)
        
        # Charge payment
        with tracer.start_as_current_span("charge_payment"):
            charge(order_id)
        
        # Send confirmation
        with tracer.start_as_current_span("send_confirmation"):
            notify(order_id)

Now when you view a trace, you’ll see the breakdown of time spent in each step.

Setting Up the Collector

The OpenTelemetry Collector receives, processes, and exports telemetry. Deploy it with this configuration:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  
  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  
  prometheus:
    endpoint: 0.0.0.0:8889
  
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [jaeger]
    
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

This collector receives OTLP data and fans it out to Jaeger (traces), Prometheus (metrics), and Loki (logs).

For Kubernetes, deploy as a DaemonSet or Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
spec:
  replicas: 2
  selector:
    matchLabels:
      app: otel-collector
  template:
    spec:
      containers:
        - name: collector
          image: otel/opentelemetry-collector-contrib:latest
          args: ["--config=/etc/otel/config.yaml"]
          ports:
            - containerPort: 4317
            - containerPort: 4318
          volumeMounts:
            - name: config
              mountPath: /etc/otel
      volumes:
        - name: config
          configMap:
            name: otel-collector-config

Metrics with OpenTelemetry

Metrics work similarly to traces. Create a meter and record measurements:

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

# Configure metrics
reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="http://collector:4317"),
    export_interval_millis=60000
)
provider = MeterProvider(metric_readers=[reader], resource=resource)
metrics.set_meter_provider(provider)

# Create instruments
meter = metrics.get_meter(__name__)

request_counter = meter.create_counter(
    "http_requests_total",
    description="Total HTTP requests"
)

latency_histogram = meter.create_histogram(
    "http_request_duration_seconds",
    description="HTTP request latency"
)

# Use them
def handle_request():
    start = time.time()
    
    # ... handle request ...
    
    duration = time.time() - start
    request_counter.add(1, {"method": "GET", "status": "200"})
    latency_histogram.record(duration, {"method": "GET", "endpoint": "/api"})

Logs with Context

The newest addition to OpenTelemetry is logs. The key feature is correlation - logs include trace and span IDs automatically.

import logging
from opentelemetry._logs import set_logger_provider
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter

# Configure logging
logger_provider = LoggerProvider(resource=resource)
logger_provider.add_log_record_processor(
    BatchLogRecordProcessor(OTLPLogExporter(endpoint="http://collector:4317"))
)
set_logger_provider(logger_provider)

# Add OTel handler to Python logging
handler = LoggingHandler(level=logging.INFO, logger_provider=logger_provider)
logging.getLogger().addHandler(handler)

# Now regular logging includes trace context
logger = logging.getLogger(__name__)
logger.info("Processing order", extra={"order_id": "12345"})

When you view this log in Loki or Grafana, it includes the trace ID. Click through to see the full trace.

Common Patterns

Service mesh integration. If you’re using Istio or Linkerd, they generate traces at the network level. Configure them to use the same trace headers (W3C Trace Context), and OTel traces will connect to service mesh traces.

Sampling. Not every request needs to be traced. Configure sampling to reduce volume:

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

provider = TracerProvider(
    resource=resource,
    sampler=TraceIdRatioBased(0.1)  # Sample 10% of traces
)

Baggage. Pass context across services without it appearing in spans:

from opentelemetry import baggage

baggage.set_baggage("user.id", "12345")
# This value propagates to downstream services

Kubernetes Auto-instrumentation

For Kubernetes, the OTel Operator can auto-instrument pods without code changes:

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: my-instrumentation
spec:
  exporter:
    endpoint: http://otel-collector:4317
  propagators:
    - tracecontext
    - baggage
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest

Then annotate your deployments:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-python: "true"

The operator injects the instrumentation automatically. No code changes required.

Making Sense of the Data

Collecting telemetry is pointless without using it effectively.

Start with traces for debugging. When something breaks, find the trace. The spans show exactly where time was spent and where errors occurred.

Use metrics for alerting. Don’t alert on traces. Traces are samples. Metrics are aggregates. Alert when error_rate exceeds your SLO.

Correlate through trace IDs. Error log → find trace ID → view trace → understand context. This workflow should take seconds.

Build service maps. Most tracing backends can generate service dependency maps from trace data. Use these to understand your architecture.

Getting Started

Week 1: Instrument one service with traces. Send to Jaeger. Get comfortable with the trace view.

Week 2: Add the collector. Configure sampling. Add a second service.

Week 3: Add metrics. Create a basic Grafana dashboard.

Week 4: Add logs with trace correlation. Practice the debug workflow.

OpenTelemetry has a learning curve, but the payoff is significant. Unified observability beats cobbled-together tools every time.