Back to Blog
Hook Mesh Team

From 0 to 10K Webhooks: Scaling Your First Implementation

A practical guide for startups on how to scale webhooks from your first implementation to handling 10,000+ events per hour. Learn what breaks at each growth phase and how to fix it before your customers notice.

From 0 to 10K Webhooks: Scaling Your First Implementation

From 0 to 10K Webhooks: Scaling Your First Implementation

Your MVP works. Then you land an enterprise customer or hit Hacker News—and everything breaks.

Scaling webhooks is predictable. The same problems break systems at the same volume thresholds. This guide maps that journey from first webhook to 10,000/hour, explaining what breaks at each phase and how to fix it before customers complain.

Phase 1: 0-100 Webhooks/Hour

Simplicity wins. Handful of customers, goal is shipping features.

Starting Architecture

Most teams use synchronous delivery: event occurs, HTTP POST, wait for response:

async function sendWebhook(endpoint, payload) {
  try {
    const response = await fetch(endpoint.url, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(payload),
      timeout: 30000
    });
    await logDelivery(endpoint, payload, response.status);
  } catch (error) {
    await logFailure(endpoint, payload, error);
  }
}

Easy to understand, debug, no infrastructure overhead.

What Works

  • Synchronous delivery from main app
  • Simple database logging
  • Manual retry on customer reports
  • Basic error logging

Warning Signs

Watch for:

  • API response times increasing during webhook peaks
  • Customer complaints about delays/missing webhooks
  • Timeouts on slow customer endpoints

Phase 2: 100-1K Webhooks/Hour

Real customers depend on webhooks. Reliability becomes non-negotiable.

What Breaks: Synchronous Delivery

One slow customer endpoint blocks processing for everyone. Your API slows, background jobs back up, users notice.

The Fix: Introduce a Queue

Move delivery asynchronous. Push to queue, return immediately. Separate worker handles delivery:

// Event occurs - push to queue instantly
async function onOrderCompleted(order) {
  await webhookQueue.push({
    type: 'order.completed',
    payload: serializeOrder(order),
    endpoints: await getSubscribedEndpoints(order.customerId)
  });
}

// Worker processes queue independently
async function processWebhookJob(job) {
  for (const endpoint of job.endpoints) {
    await attemptDelivery(endpoint, job.payload);
  }
}

Decoupling is transformative. Main app stays fast. Customer endpoint problems become isolated incidents.

Retry Logic

With a queue, implement proper retries. Exponential backoff prevents hammering failed endpoints:

const RETRY_DELAYS = [60, 300, 1800, 7200, 86400]; // 1m, 5m, 30m, 2h, 24h

async function attemptDelivery(endpoint, payload, attempt = 0) {
  try {
    const response = await deliverWebhook(endpoint, payload);
    if (response.ok) {
      await markDelivered(endpoint, payload);
    } else if (attempt < RETRY_DELAYS.length) {
      await scheduleRetry(endpoint, payload, attempt + 1);
    } else {
      await markFailed(endpoint, payload);
    }
  } catch (error) {
    if (attempt < RETRY_DELAYS.length) {
      await scheduleRetry(endpoint, payload, attempt + 1);
    } else {
      await markFailed(endpoint, payload);
    }
  }
}

Metrics to Track

  • Delivery rate (target: 99%+)
  • Latency p95 (target: <5 sec)
  • Failure rate (target: <0.1%)
  • Queue depth (should stay low)

Climbing queue depth or dropping delivery rate signals problems early.

Phase 3: 1K-10K Webhooks/Hour

Scaling reveals infrastructure limits. Theoretical problems become urgent.

Challenge 1: Database Load

Every attempt generates writes. At 10K/hour with 3 retries: 30K+ ops/hour.

Solutions:

  • Batch writes instead of individual inserts
  • Use time-series DB for logs (optimized for append-heavy)
  • Rotate and archive logs (no need for instant 6-month access)
  • Separate webhook logging from main DB

Challenge 2: Retry Storms

Customer endpoint down 1 hour. All queued retries fire simultaneously when it recovers. Thousands of requests in seconds. Overwhelms queue, delays new webhooks, crashes recovering endpoint.

Solutions:

  • Add jitter to retry timing
  • Per-endpoint rate limiting
  • Circuit breakers for consistently failing endpoints
  • Gradually ramp up delivery on recovery
// Circuit breaker pattern
async function checkEndpointHealth(endpoint) {
  const recentFailures = await getRecentFailures(endpoint, '5m');
  if (recentFailures > 10) {
    await pauseEndpoint(endpoint, '15m');
    await notifyCustomer(endpoint, 'Endpoint paused due to failures');
    return false;
  }
  return true;
}

Challenge 3: Queue Management

Simple Redis queues struggle at 10K. Need visibility, prioritization, graceful backlog handling.

Considerations:

  • Dead letter queues for permanent failures
  • Priority lanes for time-sensitive events
  • Monitor consumer lag, scale workers
  • Consider managed queues for auto-scaling

Monitoring & Alerting

Proactive alerting essential at this volume:

  • Alert if delivery rate drops below 98%
  • Alert on p95 latency >30 sec
  • Alert on sustained high queue depth
  • Alert on high-value customer failures

Set up on-call rotations. Webhook failures at 3 AM affect customers' businesses.

When DIY Stops Making Sense

Teams discover:

Maintenance grows faster than expected. Edge cases (slow endpoints, DNS failures, SSL problems) accumulate complexity.

Customer expectations increase. Enterprise customers expect 99.99% delivery, detailed logs, same-day resolution.

Opportunity cost rises. Engineers debugging retry logic could ship differentiating features.

Math changes at 10K/hr: Managed service costs a few hundred $/month. Less than hours of engineering time—and you spend far more on maintenance.

Signs to Consider Managed Services

  • Spending >few hours/week on webhook infrastructure
  • Customers report undiagnosable delivery issues
  • Scaling queue workers is routine
  • Building features services already provide
  • Engineer departure creates knowledge gap

Conclusion

Scaling from 0 to 10K/hour is predictable. Synchronous works at launch, breaks when reliability matters. Simple queues scale poorly. DIY systems become distractions from core product.

Every startup reaches the point where webhook infrastructure stops being an advantage and becomes a tax on engineering time. Recognizing that transition is the difference between smooth and painful scaling.

Queue your delivery, implement retries, monitor, and know when to rent expertise rather than build it. Your customers depend on reliable webhooks. How you deliver that reliability matters less than delivering it.

Related Posts