AI Won't Fix Your Broken Data Pipeline
The Most Expensive Mistake in AI
A client came to us wanting an "AI-powered analytics dashboard." They'd been quoted $150k by an agency. When we audited their data, we found:
- Customer data in 4 different systems with no shared ID
- Revenue numbers that didn't match between Stripe, their database, and their spreadsheets
- Inventory counts that were off by 15-30% depending on which system you checked
- Event tracking that fired duplicate events 40% of the time
They didn't need AI. They needed a data pipeline that worked.
The Hierarchy of Data Needs
Before AI can help, you need these layers — in order:
Layer 5: AI & ML ← Most companies start here
Layer 4: Analytics ← Dashboards, reports, insights
Layer 3: Transformation ← Clean, deduplicate, normalize
Layer 2: Integration ← Connect systems, unified IDs
Layer 1: Collection ← Accurate event tracking, logging
You can't skip layers. AI on top of broken data doesn't give you insights — it gives you confident wrong answers.
Signs Your Data Isn't Ready for AI
1. The "Which Number Is Right?" Problem
Finance says revenue is $2.1M
Sales dashboard says $2.4M
Stripe says $1.9M
The CEO's spreadsheet says $2.3M
If your team argues about basic numbers, AI will just add a fifth wrong answer to the mix.
2. The Identity Crisis
-- Same customer, four different records
SELECT * FROM customers WHERE email LIKE '%john.smith%';
-- Result:
-- id: 1001, name: "John Smith", email: "john.smith@gmail.com"
-- id: 1847, name: "J. Smith", email: "john.smith@gmail.com"
-- id: 2103, name: "John Smith", email: "jsmith@company.com"
-- id: 3299, name: "john smith", email: "John.Smith@gmail.com"AI can't predict customer behavior when it doesn't know which records belong to the same customer.
3. The Time Travel Problem
Your data arrives out of order, gets backfilled, or has timestamps in different timezones. Your "real-time" dashboard is actually showing data from 6 hours ago.
4. The Missing Data Problem
30% of your order records don't have a source attribution. 20% of your customer records are missing key fields. AI models trained on incomplete data learn incomplete patterns.
What to Fix First
Step 1: Single Source of Truth
Pick one system as the authority for each data type:
| Data Type | Source of Truth | Syncs To |
|---|---|---|
| Revenue | Stripe | Database, Analytics |
| Customers | Database | CRM, Email platform |
| Inventory | ERP/WMS | Shopify, Dashboard |
| Orders | Database | Analytics, Support |
Step 2: Identity Resolution
Build a unified customer ID that works across systems:
interface UnifiedCustomer {
id: string; // Your canonical ID
externalIds: {
stripe: string;
shopify: string;
klaviyo: string;
zendesk: string;
};
mergedFrom: string[]; // IDs that were deduplicated
}Step 3: Event Tracking That Doesn't Lie
// Bad: fire-and-forget tracking
analytics.track("purchase", { amount: order.total });
// Good: validated, deduplicated, server-side
async function trackPurchase(order: Order) {
// Deduplicate
const exists = await events.find({
type: "purchase",
orderId: order.id,
});
if (exists) return;
// Validate
const validated = validateEvent({
type: "purchase",
orderId: order.id,
amount: order.total,
currency: order.currency,
timestamp: order.completedAt,
source: "server",
});
// Store with audit trail
await events.insert(validated);
}Step 4: Data Quality Monitoring
You need alerts for data problems, not just application problems:
const DATA_QUALITY_CHECKS = [
{
name: "revenue_reconciliation",
query: "Compare Stripe settlements vs database orders",
threshold: 0.02, // 2% variance max
frequency: "daily",
},
{
name: "customer_duplicates",
query: "Find customers with matching email, different IDs",
threshold: 10, // Max 10 new duplicates per day
frequency: "daily",
},
{
name: "event_completeness",
query: "Orders without corresponding tracking events",
threshold: 0.05, // 5% missing max
frequency: "hourly",
},
];When You're Actually Ready for AI
Your data is ready for AI when:
- One number for revenue, and everyone agrees on it
- Customer identity is resolved across systems
- Event tracking is server-side and deduplicated
- Data quality checks run daily with < 2% variance
- Historical data is clean enough to train against
- You can answer basic analytics questions without caveats
The Honest Conversation
Half the companies that come to us wanting AI actually need data infrastructure. That's not a failure — it's a foundation. The companies that build the pipeline first get 10x more value from AI when they eventually add it.
The ones that skip to AI spend 6 months building models on bad data, get bad results, and conclude "AI doesn't work for us." It does. Your data just wasn't ready.
Fix the pipes. Then add the intelligence.