The three pillars, demystified

Logs, metrics, and traces are the three telemetry types every observability stack supports. They are not interchangeable, and using them as if they were is the source of most observability bills that get out of control.

  • Logs are discrete events with rich context. They answer "what happened in this specific operation?" They are expensive to store at scale, expensive to search, and easy to abuse.
  • Metrics are pre-aggregated counters and gauges. They answer "how is the system behaving in aggregate?" They are cheap, fast to query, and excellent for alerts and dashboards.
  • Traces are end-to-end records of a single request as it crosses services. They answer "where did this slow or failed request actually spend its time?" They are critical in distributed systems and useless in monoliths.

The rule of thumb is to use metrics for everything you want to alert on or chart, traces for everything you need to debug across services, and logs for everything else — and to be deliberately stingy about that "everything else".

Logs that actually help

Logs are where most observability budgets quietly die. Teams log everything, then complain that searching is slow and expensive. A few disciplines make logs dramatically more useful at a fraction of the cost.

  • Structured logging always. JSON or another structured format. Search and aggregation become possible. Free-text logs are a debugging tool, not an observability tool.
  • Sample debug logs in production. Full-fidelity debug logging is rarely necessary for steady-state traffic. Sample at a low rate, and turn it up on demand when needed.
  • Correlation IDs everywhere. Every log line should carry a request ID (or trace ID) so you can reconstruct a single business operation across services.
  • Never log secrets or PII. This is a compliance issue, not a stylistic one. Build the log pipeline so accidental PII can be redacted or quarantined.
  • Set retention by purpose. 7 days of full logs, 30 days of error logs, 1 year of audit logs. Treat each tier differently. Do not pay archive prices for noise.

Metrics that matter

Useful metrics map directly to the questions you want to answer. The classic frameworks help shape the catalog:

  • RED method (for services): Rate (requests per second), Errors (error rate), Duration (latency distribution). Almost every service should expose these.
  • USE method (for resources): Utilization, Saturation, Errors. For infrastructure — CPU, memory, disk, network.
  • The four golden signals (Google SRE): Latency, traffic, errors, saturation. A superset of the above.

Beyond technical metrics, the highest-leverage move is to add a small set of business metrics directly into the observability stack. "Orders placed per minute", "payments processed", "active users". These metrics tell the team whether the business is healthy in a way no infrastructure metric ever can. When the business metric drops, you know there is a real problem, regardless of what the CPU graph says.

Avoid metric explosion. Every dimension you add multiplies cardinality. High-cardinality metrics (per-user, per-tenant, per-request-ID) belong in traces or logs, not metrics. The single largest metrics bill we have helped reduce was caused by a single label that carried a unique customer ID.

Distributed tracing as the connective tissue

In a distributed system, the difference between an observable architecture and a guessing game is distributed tracing. A trace shows the full journey of a request — every service, every database call, every external API — with timing information at each step. Without it, debugging cross-service issues becomes archaeology.

Modern tracing systems work via propagation: a request enters the system, gets assigned a trace ID, and that ID is passed in headers to every downstream service. Each service records spans (operations) under that trace ID. The observability backend assembles the spans into a tree, showing exactly where time was spent.

Tracing also pays off for cost analysis (which service is causing the most database load?), capacity planning (which spans are creeping up in latency?), and incident response (which span had the error?). For services that handle business-critical flows, the first 80 percent of debugging happens in the trace before anyone looks at logs.

Service-Level Objectives make observability concrete

Observability data without targets is just data. SLOs (Service-Level Objectives) close the loop by defining what "healthy" actually means for each service. A typical SLO has three parts:

  • An indicator (SLI). A specific metric, like "percentage of requests that return successfully in under 200 ms".
  • A target. 99.5 percent over a rolling 28-day window.
  • An error budget. The complement of the target: 0.5 percent of requests are allowed to fail or be slow. If you exceed the budget, you stop new features and prioritize reliability.

The discipline of SLOs replaces vague phrases like "the system is slow today" with actionable language: "we have consumed 70 percent of our latency error budget this month." That precision changes prioritization conversations because everyone knows what the system is supposed to do.

Cost discipline matters more than people admit

Observability platforms can become one of the largest line items in a modern engineering budget. Without discipline, growth is exponential: more services, more replicas, more logs per replica, more dimensions per metric. Then someone notices the invoice and the panic refactor begins.

A few practices prevent that:

  • Use sampling for high-volume traces. 100 percent tracing is rarely necessary or affordable.
  • Tier log storage. Hot search for a few days; cheap object storage for everything older.
  • Audit metric cardinality regularly. Cap labels that explode.
  • Drop telemetry that nobody queries. If a metric or log type has not been touched in 90 days, retire it.
  • Charge teams for what they emit. Internal showback gives engineers the right incentive.

Why OpenTelemetry is the safer bet

Every observability vendor has its own agent, SDK, and conventions. Choosing a vendor used to mean rewriting your instrumentation later if you ever switched. OpenTelemetry (OTel) is the cross-industry standard that decouples instrumentation from the backend. You instrument once, and the data can flow to any compatible backend — Datadog, New Relic, Splunk, Honeycomb, Grafana, or anything else.

For new systems, OTel is essentially the default. For older systems on vendor-specific agents, migration is a multi-quarter effort but worth doing for any team that expects to be on the same observability stack two years from now. The portability also creates negotiating leverage at contract renewal time, which by itself can fund the migration.

Final takeaway

Observability is about reducing the surprise area of your system. Logs, metrics, and traces each answer different questions. SLOs translate the data into decisions. Cost discipline keeps the whole thing sustainable. Spend more time on the few signals that matter than on the dashboards that look impressive.

Need help designing your observability strategy?

If you want to design SLOs, modernize an existing observability stack, or untangle a runaway telemetry bill, we can help.

Talk to Soutello IT about observability