The Three Pillars of Observability — What They Actually Mean in Practice
Observability Theater
A client had Datadog, Grafana, PagerDuty, and Sentry all running. Dashboards everywhere. Alerts firing constantly. But when a production outage hit, nobody could answer the basic question: "What changed and what's affected?"
They had monitoring. They didn't have observability. The difference: monitoring tells you something is wrong. Observability helps you figure out why.
Pillar 1: Structured Logs
Unstructured logs are useless at scale. If your logs look like this, you can't search, filter, or aggregate them:
❌ Unstructured:
2026-04-17 10:23:45 ERROR Failed to process order for customer
2026-04-17 10:23:45 INFO Order processing started
2026-04-17 10:23:46 WARNING Payment retry attempt 2
Structured logs are JSON with consistent fields:
// Structured logging setup
import pino from "pino";
const logger = pino({
level: process.env.LOG_LEVEL || "info",
formatters: {
level: (label) => ({ level: label }),
},
base: {
service: "order-service",
version: process.env.APP_VERSION,
environment: process.env.NODE_ENV,
},
});
// Usage
logger.info({
orderId: "ord_123",
customerId: "cust_456",
action: "order.process.started",
amount: 99.50,
paymentMethod: "card",
}, "Processing order");
// Output:
// {"level":"info","service":"order-service","version":"1.2.3",
// "environment":"production","orderId":"ord_123","customerId":"cust_456",
// "action":"order.process.started","amount":99.50,"msg":"Processing order"}Now you can search: "Show me all logs where orderId = ord_123" or "Show me all ERROR logs for the order-service in the last hour."
The Correlation ID Pattern
Every request gets a unique ID that flows through every service:
// Middleware: attach correlation ID to every request
function correlationMiddleware(req, res, next) {
const correlationId = req.headers["x-correlation-id"] || crypto.randomUUID();
req.correlationId = correlationId;
res.setHeader("x-correlation-id", correlationId);
// Attach to logger context for this request
req.log = logger.child({ correlationId, requestId: crypto.randomUUID() });
next();
}
// Now every log in this request includes correlationId
// When this service calls another service, pass the header:
// headers: { "x-correlation-id": req.correlationId }Pillar 2: Metrics
Metrics are aggregated numbers over time. They answer "how much?" and "how fast?":
// Key metrics every service should expose
const metrics = {
// RED metrics (Request, Error, Duration)
requestCount: new Counter("http_requests_total", {
labels: ["method", "path", "status"],
}),
requestDuration: new Histogram("http_request_duration_seconds", {
labels: ["method", "path"],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
}),
errorCount: new Counter("http_errors_total", {
labels: ["method", "path", "error_type"],
}),
// USE metrics (Utilization, Saturation, Errors) for resources
dbConnectionPool: new Gauge("db_connection_pool_size"),
dbConnectionWaiting: new Gauge("db_connection_waiting"),
cacheHitRate: new Gauge("cache_hit_rate"),
};
// Instrument in middleware
app.use((req, res, next) => {
const start = Date.now();
res.on("finish", () => {
const duration = (Date.now() - start) / 1000;
metrics.requestCount.inc({ method: req.method, path: req.route, status: res.statusCode });
metrics.requestDuration.observe({ method: req.method, path: req.route }, duration);
if (res.statusCode >= 500) {
metrics.errorCount.inc({ method: req.method, path: req.route, error_type: "5xx" });
}
});
next();
});The Four Golden Signals
Google's SRE book defines four signals every service needs:
1. Latency: How long requests take (p50, p95, p99)
2. Traffic: How many requests per second
3. Errors: What percentage of requests fail
4. Saturation: How full your resources are (CPU, memory, disk, connections)
If you monitor nothing else, monitor these four. They'll catch 90% of production issues.
Pillar 3: Distributed Traces
Traces show the full journey of a request across services:
// OpenTelemetry trace setup
import { trace } from "@opentelemetry/api";
const tracer = trace.getTracer("order-service");
async function processOrder(orderId: string) {
return tracer.startActiveSpan("processOrder", async (span) => {
span.setAttribute("order.id", orderId);
// Each sub-operation creates a child span
const order = await tracer.startActiveSpan("fetchOrder", async (child) => {
const result = await db.orders.findUnique({ where: { id: orderId } });
child.setAttribute("order.total", result.total);
child.end();
return result;
});
await tracer.startActiveSpan("chargePayment", async (child) => {
await paymentService.charge(order.paymentMethodId, order.total);
child.end();
});
await tracer.startActiveSpan("sendConfirmation", async (child) => {
await emailService.send(order.customerEmail, "confirmation", order);
child.end();
});
span.setStatus({ code: SpanStatusCode.OK });
span.end();
});
}The trace shows: processOrder took 850ms total. fetchOrder: 50ms. chargePayment: 600ms. sendConfirmation: 200ms. Now you know the payment service is the bottleneck.
Connecting the Three Pillars
The real power is when all three work together:
Alert fires: "Error rate > 1% on /api/checkout" (METRIC)
→ Look at traces for failed requests (TRACE)
→ Trace shows payment service timing out (TRACE)
→ Filter logs by correlation ID from the trace (LOG)
→ Logs show: "Stripe connection timeout after 30s" (LOG)
→ Check Stripe status: partial outage (ROOT CAUSE)
Time to diagnosis: 3 minutes (with good observability)
Time to diagnosis: 45 minutes (with monitoring but no observability)
Build all three pillars. Connect them with correlation IDs. When the next outage hits, you'll find the root cause in minutes instead of hours.