ScaledByDesign/Insights
ServicesPricingAboutContact
Book a Call
Scaled By Design

Fractional CTO + execution partner for revenue-critical systems.

Company

  • About
  • Services
  • Contact

Resources

  • Insights
  • Pricing
  • FAQ

Legal

  • Privacy Policy
  • Terms of Service

© 2026 ScaledByDesign. All rights reserved.

contact@scaledbydesign.com

On This Page

Observability TheaterPillar 1: Structured LogsThe Correlation ID PatternPillar 2: MetricsThe Four Golden SignalsPillar 3: Distributed TracesConnecting the Three Pillars
  1. Insights
  2. Infrastructure
  3. The Three Pillars of Observability — What They Actually Mean in Practice

The Three Pillars of Observability — What They Actually Mean in Practice

April 17, 2026·ScaledByDesign·
observabilitymonitoringloggingmetricstraces

Observability Theater

A client had Datadog, Grafana, PagerDuty, and Sentry all running. Dashboards everywhere. Alerts firing constantly. But when a production outage hit, nobody could answer the basic question: "What changed and what's affected?"

They had monitoring. They didn't have observability. The difference: monitoring tells you something is wrong. Observability helps you figure out why.

Pillar 1: Structured Logs

Unstructured logs are useless at scale. If your logs look like this, you can't search, filter, or aggregate them:

❌ Unstructured:
2026-04-17 10:23:45 ERROR Failed to process order for customer
2026-04-17 10:23:45 INFO Order processing started
2026-04-17 10:23:46 WARNING Payment retry attempt 2

Structured logs are JSON with consistent fields:

// Structured logging setup
import pino from "pino";
 
const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  formatters: {
    level: (label) => ({ level: label }),
  },
  base: {
    service: "order-service",
    version: process.env.APP_VERSION,
    environment: process.env.NODE_ENV,
  },
});
 
// Usage
logger.info({
  orderId: "ord_123",
  customerId: "cust_456",
  action: "order.process.started",
  amount: 99.50,
  paymentMethod: "card",
}, "Processing order");
 
// Output:
// {"level":"info","service":"order-service","version":"1.2.3",
//  "environment":"production","orderId":"ord_123","customerId":"cust_456",
//  "action":"order.process.started","amount":99.50,"msg":"Processing order"}

Now you can search: "Show me all logs where orderId = ord_123" or "Show me all ERROR logs for the order-service in the last hour."

The Correlation ID Pattern

Every request gets a unique ID that flows through every service:

// Middleware: attach correlation ID to every request
function correlationMiddleware(req, res, next) {
  const correlationId = req.headers["x-correlation-id"] || crypto.randomUUID();
  req.correlationId = correlationId;
  res.setHeader("x-correlation-id", correlationId);
  
  // Attach to logger context for this request
  req.log = logger.child({ correlationId, requestId: crypto.randomUUID() });
  next();
}
 
// Now every log in this request includes correlationId
// When this service calls another service, pass the header:
// headers: { "x-correlation-id": req.correlationId }

Pillar 2: Metrics

Metrics are aggregated numbers over time. They answer "how much?" and "how fast?":

// Key metrics every service should expose
const metrics = {
  // RED metrics (Request, Error, Duration)
  requestCount: new Counter("http_requests_total", {
    labels: ["method", "path", "status"],
  }),
  requestDuration: new Histogram("http_request_duration_seconds", {
    labels: ["method", "path"],
    buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  }),
  errorCount: new Counter("http_errors_total", {
    labels: ["method", "path", "error_type"],
  }),
 
  // USE metrics (Utilization, Saturation, Errors) for resources
  dbConnectionPool: new Gauge("db_connection_pool_size"),
  dbConnectionWaiting: new Gauge("db_connection_waiting"),
  cacheHitRate: new Gauge("cache_hit_rate"),
};
 
// Instrument in middleware
app.use((req, res, next) => {
  const start = Date.now();
  res.on("finish", () => {
    const duration = (Date.now() - start) / 1000;
    metrics.requestCount.inc({ method: req.method, path: req.route, status: res.statusCode });
    metrics.requestDuration.observe({ method: req.method, path: req.route }, duration);
    if (res.statusCode >= 500) {
      metrics.errorCount.inc({ method: req.method, path: req.route, error_type: "5xx" });
    }
  });
  next();
});

The Four Golden Signals

Google's SRE book defines four signals every service needs:

1. Latency:    How long requests take (p50, p95, p99)
2. Traffic:    How many requests per second
3. Errors:     What percentage of requests fail
4. Saturation: How full your resources are (CPU, memory, disk, connections)

If you monitor nothing else, monitor these four. They'll catch 90% of production issues.

Pillar 3: Distributed Traces

Traces show the full journey of a request across services:

// OpenTelemetry trace setup
import { trace } from "@opentelemetry/api";
 
const tracer = trace.getTracer("order-service");
 
async function processOrder(orderId: string) {
  return tracer.startActiveSpan("processOrder", async (span) => {
    span.setAttribute("order.id", orderId);
 
    // Each sub-operation creates a child span
    const order = await tracer.startActiveSpan("fetchOrder", async (child) => {
      const result = await db.orders.findUnique({ where: { id: orderId } });
      child.setAttribute("order.total", result.total);
      child.end();
      return result;
    });
 
    await tracer.startActiveSpan("chargePayment", async (child) => {
      await paymentService.charge(order.paymentMethodId, order.total);
      child.end();
    });
 
    await tracer.startActiveSpan("sendConfirmation", async (child) => {
      await emailService.send(order.customerEmail, "confirmation", order);
      child.end();
    });
 
    span.setStatus({ code: SpanStatusCode.OK });
    span.end();
  });
}

The trace shows: processOrder took 850ms total. fetchOrder: 50ms. chargePayment: 600ms. sendConfirmation: 200ms. Now you know the payment service is the bottleneck.

Connecting the Three Pillars

The real power is when all three work together:

Alert fires: "Error rate > 1% on /api/checkout"  (METRIC)
  → Look at traces for failed requests               (TRACE)
  → Trace shows payment service timing out            (TRACE)  
  → Filter logs by correlation ID from the trace      (LOG)
  → Logs show: "Stripe connection timeout after 30s"  (LOG)
  → Check Stripe status: partial outage               (ROOT CAUSE)

Time to diagnosis: 3 minutes (with good observability)
Time to diagnosis: 45 minutes (with monitoring but no observability)

Build all three pillars. Connect them with correlation IDs. When the next outage hits, you'll find the root cause in minutes instead of hours.

Previous
The GA4 Data Layer Implementation That E-Commerce Brands Actually Need
Insights
The Three Pillars of Observability — What They Actually Mean in PracticeRedis Caching Patterns That Actually Work in ProductionZero-Downtime Database Migrations — The Patterns That Actually WorkTerraform State Management Lessons We Learned the Hard WayKubernetes Is Overkill for Your Startup — Here's What to Use InsteadScale Postgres Before Reaching for NoSQLDatabase Migrations Without DowntimeObservability That Actually Helps You Sleep at Night

Ready to Ship?

Let's talk about your engineering challenges and how we can help.

Book a Call