ScaledByDesign/Insights
ServicesPricingAboutContact
Book a Call
Scaled By Design

Fractional CTO + execution partner for revenue-critical systems.

Company

  • About
  • Services
  • Contact

Resources

  • Insights
  • Pricing
  • FAQ

Legal

  • Privacy Policy
  • Terms of Service

© 2026 ScaledByDesign. All rights reserved.

contact@scaledbydesign.com

On This Page

Your Monitoring Is NoiseThe Three Pillars (Actually Useful Version)Pillar 1: Structured LoggingPillar 2: Metrics That MatterPillar 3: Distributed TracingThe Alert Strategy That Doesn't Cry WolfTier 1: Page Someone (Immediately)Tier 2: Notify the Team (Business Hours)Tier 3: Log for Investigation (No Notification)The Runbook PatternAlert: checkout_brokenQuick diagnosis (< 2 minutes)Common causesEscalationThe Observability Stack (Startup Budget)The Implementation RoadmapWeek 1: The BasicsWeek 2: Business MetricsWeek 3: Infrastructure MetricsWeek 4: RunbooksThe Test
  1. Insights
  2. Infrastructure
  3. Observability That Actually Helps You Sleep at Night

Observability That Actually Helps You Sleep at Night

December 31, 2025·ScaledByDesign·
observabilitymonitoringdevopsinfrastructure

Your Monitoring Is Noise

You have Datadog. Or Grafana. Or New Relic. The dashboards look impressive. Nobody looks at them. When something breaks, the team opens Slack and asks "is anyone else seeing this?" — which means your $50k/year monitoring investment is a screensaver.

Why this costs you: We audited a client spending $63K/year on observability tools. In 6 months, they had 14 production incidents. Average time to detection: 8.3 minutes (customers noticed first, not monitoring). Average time to resolution: 47 minutes. Total revenue lost to undetected incidents: $340K. The monitoring was generating 2,400 alerts per week — all ignored.

Observability isn't dashboards. It's the ability to understand what your system is doing when things go wrong — and ideally, before they go wrong.

The Three Pillars (Actually Useful Version)

Pillar 1: Structured Logging

// ❌ Useless log
console.log("Order processed");
 
// ❌ Slightly better but still useless
console.log(`Order ${orderId} processed for customer ${customerId}`);
 
// ✅ Structured, searchable, actionable
logger.info("order.processed", {
  orderId,
  customerId,
  total: order.total,
  itemCount: order.items.length,
  paymentMethod: order.paymentMethod,
  processingTimeMs: Date.now() - startTime,
  isFirstOrder: customer.orderCount === 1,
});

The rules:

  1. Every log has a dot-notation event name (order.processed, payment.failed)
  2. Every log includes the entity IDs involved
  3. Every log includes timing information
  4. Logs are JSON, not strings — so you can query them

Pillar 2: Metrics That Matter

Stop measuring everything. Measure these:

Business Metrics (the ones that pay the bills):
  ├── Orders per minute (is the business working?)
  ├── Revenue per hour (are we making money?)
  ├── Checkout completion rate (is checkout broken?)
  └── Error rate by endpoint (what's failing?)

Infrastructure Metrics (the ones that predict problems):
  ├── Response time p50/p95/p99
  ├── Database connection pool utilization
  ├── Memory usage trend (not current, TREND)
  ├── Queue depth and processing lag
  └── Disk usage and growth rate

THE golden metric:
  └── "Can a customer complete a purchase right now?"
      If you can only monitor one thing, monitor this.

Pillar 3: Distributed Tracing

For any request that touches multiple services or takes > 200ms:

// Trace a critical path
const span = tracer.startSpan("checkout.process");
span.setAttributes({
  "checkout.orderId": orderId,
  "checkout.total": total,
  "checkout.itemCount": items.length,
});
 
try {
  const payment = await tracer.trace("checkout.payment", () =>
    processPayment(order)
  );
  const fulfillment = await tracer.trace("checkout.fulfillment", () =>
    createFulfillment(order)
  );
  span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
  span.setStatus({ code: SpanStatusCode.ERROR });
  span.recordException(error);
  throw error;
} finally {
  span.end();
}

The Alert Strategy That Doesn't Cry Wolf

Tier 1: Page Someone (Immediately)

These wake people up at 3 AM:

alerts:
  - name: checkout_broken
    condition: checkout_success_rate < 90%
    for: 5 minutes
    severity: critical
    action: page on-call
 
  - name: payment_failures_spike
    condition: payment_failure_rate > 15%
    for: 3 minutes
    severity: critical
    action: page on-call
 
  - name: site_down
    condition: health_check_failures > 3
    for: 2 minutes
    severity: critical
    action: page on-call

Rules for Tier 1:

  • Maximum 3-5 alert types
  • Every alert has a runbook linked
  • If it pages and doesn't need action, remove it
  • Review monthly: if an alert fires and nobody acts, it's noise

The cost of too many alerts: One team had 47 different paging alerts. On-call got woken up 3-5 times per night. After 4 months: 60% of the team refused on-call rotation, two senior engineers quit citing burnout. Cost to replace them: $180K in recruiting + 6 months of reduced velocity.

We reduced alerts to 4 critical types. Pages dropped from 4.2/night to 0.3/night. Zero engineers quit in the following 12 months.

Tier 2: Notify the Team (Business Hours)

alerts:
  - name: response_time_degraded
    condition: p95_response_time > 2s
    for: 15 minutes
    severity: warning
    action: slack #engineering
 
  - name: error_rate_elevated
    condition: error_rate > 2%
    for: 10 minutes
    severity: warning
    action: slack #engineering
 
  - name: disk_space_low
    condition: disk_usage > 80%
    for: 30 minutes
    severity: warning
    action: slack #infrastructure

Tier 3: Log for Investigation (No Notification)

Everything else. Visible in dashboards, searchable in logs, but doesn't interrupt anyone.

The Runbook Pattern

Every Tier 1 alert needs a runbook. No exceptions.

## Alert: checkout_broken
**What it means:** Checkout success rate dropped below 90%
 
### Quick diagnosis (< 2 minutes)
1. Check payment provider status: [status page URL]
2. Check database connections: `SELECT count(*) FROM pg_stat_activity`
3. Check recent deployments: [deployment dashboard URL]
 
### Common causes
1. **Payment provider outage**
   → Action: Enable backup provider, notify customers
2. **Database connection pool exhausted**
   → Action: Restart app servers, investigate long-running queries
3. **Bad deployment**
   → Action: Rollback last deployment
 
### Escalation
If not resolved in 15 minutes:
  → Page @engineering-lead
If not resolved in 30 minutes:
  → Page @cto + notify @customer-support

The Observability Stack (Startup Budget)

You don't need to spend $50k/year:

ComponentBudget OptionPremium Option
LoggingGrafana Loki (free)Datadog ($$$)
MetricsPrometheus + Grafana (free)Datadog ($$$)
TracingJaeger (free)Honeycomb ($$)
AlertingGrafana Alerting (free)PagerDuty ($$)
UptimeUptimeRobot ($7/mo)Datadog Synthetics ($$$)

Total budget option: $50-200/month Total premium option: $2,000-10,000/month

Real ROI calculation: One client switched from Datadog ($6,800/month) to self-hosted Grafana stack ($180/month). Saved $79K annually. Time to implement: 2 weeks. The budget stack detected incidents just as fast — the premium was paying for features they never used.

Start with the budget option. Upgrade components that become pain points.

The Implementation Roadmap

Week 1: The Basics

  • Structured logging in your application
  • Health check endpoint that tests critical dependencies
  • Uptime monitoring for your site and API

Week 2: Business Metrics

  • Order rate, revenue, and checkout completion dashboards
  • Error rate by endpoint
  • Tier 1 alerts for checkout and payment

Week 3: Infrastructure Metrics

  • Response time percentiles
  • Database and cache metrics
  • Queue depth monitoring

Week 4: Runbooks

  • Write a runbook for every Tier 1 alert
  • Run a game day: trigger an alert and follow the runbook
  • Iterate on what's missing

The Test

Here's how you know your observability works: at 2 AM, a Tier 1 alert fires. The on-call engineer opens the runbook, follows the steps, and resolves the issue in 15 minutes — without waking anyone else up.

The math that matters:

Bad observability (before):
  - Mean time to detection (MTTD): 8.3 minutes (customers report it)
  - Mean time to resolution (MTTR): 47 minutes
  - Incidents per month: 14
  - Downtime per month: 658 minutes (11 hours)
  - Revenue lost at $15K/hour GMV: $165K/month

Good observability (after):
  - MTTD: 1.2 minutes (alerts fire before customers notice)
  - MTTR: 12 minutes (runbooks guide resolution)
  - Incidents per month: 14 (same issues, better response)
  - Downtime per month: 168 minutes (2.8 hours)
  - Revenue lost: $42K/month

Improvement: $123K/month in recovered revenue
Investment: $4K in engineering time to fix alerts + runbooks
ROI: 30x in first month

If that's not your reality, your observability needs work. Not more dashboards — better alerts, better runbooks, and better signal-to-noise ratio.

Previous
API Design Mistakes That Will Haunt You for Years
Next
The Caching Strategy That Cut Our Client's AWS Bill by 60%
Insights
Scale Postgres Before Reaching for NoSQLDatabase Migrations Without DowntimeObservability That Actually Helps You Sleep at Night

Ready to Ship?

Let's talk about your engineering challenges and how we can help.

Book a Call