Back to Blog
·Hook Mesh Team

From 0 to 10K Webhooks: Scaling Your First Implementation

A practical guide for startups on how to scale webhooks from your first implementation to handling 10,000+ events per hour. Learn what breaks at each growth phase and how to fix it before your customers notice.

From 0 to 10K Webhooks: Scaling Your First Implementation

From 0 to 10K Webhooks: Scaling Your First Implementation

Your MVP works. Then you land an enterprise customer or hit Hacker News—and everything breaks.

Scaling webhooks is predictable. The same problems break systems at the same volume thresholds. This guide maps that journey from first webhook to 10,000/hour, explaining what breaks at each phase and how to fix it before customers complain.

Webhook scaling phases showing architecture evolution from synchronous delivery to queue-based workers with circuit breakers

Phase 1: 0-100 Webhooks/Hour

Simplicity wins. Handful of customers, goal is shipping features.

Starting Architecture

Most teams use synchronous delivery: event occurs, HTTP POST, wait for response:

async function sendWebhook(endpoint, payload) {
  try {
    const response = await fetch(endpoint.url, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(payload),
      timeout: 30000
    });
    await logDelivery(endpoint, payload, response.status);
  } catch (error) {
    await logFailure(endpoint, payload, error);
  }
}

Easy to understand, debug, no infrastructure overhead.

What Works

  • Synchronous delivery from main app
  • Simple database logging
  • Manual retry on customer reports
  • Basic error logging

Warning Signs

Watch for:

  • API response times increasing during webhook peaks
  • Customer complaints about delays/missing webhooks
  • Timeouts on slow customer endpoints

Phase 2: 100-1K Webhooks/Hour

Real customers depend on webhooks. Reliability becomes non-negotiable.

What Breaks: Synchronous Delivery

One slow customer endpoint blocks processing for everyone. Your API slows, background jobs back up, users notice.

The Fix: Introduce a Queue

Move delivery asynchronous. Push to queue, return immediately. Separate worker handles delivery:

// Event occurs - push to queue instantly
async function onOrderCompleted(order) {
  await webhookQueue.push({
    type: 'order.completed',
    payload: serializeOrder(order),
    endpoints: await getSubscribedEndpoints(order.customerId)
  });
}

// Worker processes queue independently
async function processWebhookJob(job) {
  for (const endpoint of job.endpoints) {
    await attemptDelivery(endpoint, job.payload);
  }
}

Decoupling is transformative. Main app stays fast. Customer endpoint problems become isolated incidents.

Retry Logic

With a queue, implement proper retries. Exponential backoff prevents hammering failed endpoints:

const RETRY_DELAYS = [60, 300, 1800, 7200, 86400]; // 1m, 5m, 30m, 2h, 24h

async function attemptDelivery(endpoint, payload, attempt = 0) {
  try {
    const response = await deliverWebhook(endpoint, payload);
    if (response.ok) {
      await markDelivered(endpoint, payload);
    } else if (attempt < RETRY_DELAYS.length) {
      await scheduleRetry(endpoint, payload, attempt + 1);
    } else {
      await markFailed(endpoint, payload);
    }
  } catch (error) {
    if (attempt < RETRY_DELAYS.length) {
      await scheduleRetry(endpoint, payload, attempt + 1);
    } else {
      await markFailed(endpoint, payload);
    }
  }
}

Metrics to Track

  • Delivery rate (target: 99%+)
  • Latency p95 (target: <5 sec)
  • Failure rate (target: <0.1%)
  • Queue depth (should stay low)

Climbing queue depth or dropping delivery rate signals problems early.

Idempotency: Handle Duplicates Gracefully

At-least-once delivery means duplicates happen. Your handlers must be idempotent—processing the same event twice produces the same result:

async function processWebhookIdempotently(event) {
  const eventKey = `${event.type}:${event.id}`;

  // Check if already processed
  const existing = await db.webhookEvents.findUnique({ where: { eventKey } });
  if (existing?.status === 'processed') {
    return { skipped: true, reason: 'duplicate' };
  }

  // Process with upsert to handle race conditions
  await db.webhookEvents.upsert({
    where: { eventKey },
    create: { eventKey, status: 'processing', receivedAt: new Date() },
    update: { lastAttempt: new Date() }
  });

  await handleEvent(event);
  await db.webhookEvents.update({ where: { eventKey }, data: { status: 'processed' } });
}

Three common patterns:

  • Event ID tracking: Store processed IDs, skip duplicates
  • Timestamp comparison: Only process if event timestamp is newer than last processed
  • Idempotent operations: Design handlers so duplicates are harmless (e.g., upserts instead of inserts)

Phase 3: 1K-10K Webhooks/Hour

Scaling reveals infrastructure limits. Theoretical problems become urgent.

Challenge 1: Database Load

Every attempt generates writes. At 10K/hour with 3 retries: 30K+ ops/hour.

Solutions:

  • Batch writes instead of individual inserts
  • Use time-series DB for logs (optimized for append-heavy)
  • Rotate and archive logs (no need for instant 6-month access)
  • Separate webhook logging from main DB

Challenge 2: Retry Storms

Customer endpoint down 1 hour. All queued retries fire simultaneously when it recovers. Thousands of requests in seconds. Overwhelms queue, delays new webhooks, crashes recovering endpoint.

Diagram showing retry storm problem with all retries hitting at once versus controlled recovery with jitter and rate limiting

Solutions:

  • Add jitter to retry timing (randomize within a window)
  • Per-endpoint rate limiting
  • Circuit breakers for consistently failing endpoints
  • Gradually ramp up delivery on recovery
// Circuit breaker pattern with jitter
function getRetryDelayWithJitter(baseDelay) {
  const jitter = Math.random() * 0.3 * baseDelay; // 0-30% jitter
  return baseDelay + jitter;
}

async function checkEndpointHealth(endpoint) {
  const recentFailures = await getRecentFailures(endpoint, '5m');
  if (recentFailures > 10) {
    await pauseEndpoint(endpoint, '15m');
    await notifyCustomer(endpoint, 'Endpoint paused due to failures');
    return false;
  }
  return true;
}

Challenge 3: Queue Management

Simple Redis queues struggle at 10K. Need visibility, prioritization, graceful backlog handling.

Considerations:

  • Dead letter queues for permanent failures
  • Priority lanes for time-sensitive events (payment webhooks > analytics)
  • Monitor consumer lag, scale workers based on queue depth
  • Consider managed queues (SQS, Cloud Tasks) for auto-scaling

Challenge 4: Hot Endpoints

One customer receives 80% of your webhooks. Their endpoint slows, your entire queue backs up.

Solutions:

  • Per-endpoint queues or partitioning
  • Concurrency limits per customer
  • Separate worker pools for high-volume customers
  • Backpressure detection: pause delivery when endpoint response times spike

Monitoring & Alerting

Proactive alerting essential at this volume. Define SLOs (Service Level Objectives) tied to business impact:

Dashboard showing key webhook metrics: delivery rate, queue depth, P95 latency, and failure rate with alert thresholds

Key SLOs for webhook delivery:

MetricTargetAlert Threshold
Delivery rate99.5%<98%
P95 latency<5 sec>30 sec
Queue depthNear zeroSustained >1000
Failure rate<0.1%>1%

Track these with proper observability:

  • Queue depth + max age: Detect backpressure before it cascades
  • Estimated drain time: Predict when backlogs clear
  • Per-endpoint failure rates: Isolate problem customers quickly

Set up on-call rotations. Webhook failures at 3 AM affect customers' businesses.

Beyond 10K: Event Destinations

At extreme scale (100K+/hour), HTTP webhooks hit fundamental limits. Modern platforms like Stripe and Shopify now support event destinations—direct delivery to message buses:

  • Amazon EventBridge
  • Google Cloud Pub/Sub
  • Amazon Kinesis

This bypasses HTTP entirely. No timeouts, no retries, no authentication handshakes. Events flow directly into your event infrastructure. If you're building for massive scale, design your architecture to support both traditional webhooks and event destinations.

When DIY Stops Making Sense

Teams discover:

Maintenance grows faster than expected. Edge cases (slow endpoints, DNS failures, SSL problems) accumulate complexity.

Customer expectations increase. Enterprise customers expect 99.99% delivery, detailed logs, same-day resolution.

Opportunity cost rises. Engineers debugging retry logic could ship differentiating features.

Math changes at 10K/hr: Managed service costs a few hundred $/month. Less than hours of engineering time—and you spend far more on maintenance. See our build vs buy analysis for detailed cost comparisons.

Signs to Consider Managed Services

  • Spending >few hours/week on webhook infrastructure
  • Customers report undiagnosable delivery issues
  • Scaling queue workers is routine
  • Building features services already provide
  • Engineer departure creates knowledge gap

Conclusion

Scaling from 0 to 10K/hour is predictable. Synchronous works at launch, breaks when reliability matters. Simple queues scale poorly. DIY systems become distractions from core product.

Every startup reaches the point where webhook infrastructure stops being an advantage and becomes a tax on engineering time. Recognizing that transition is the difference between smooth and painful scaling.

Queue your delivery, implement retries, monitor, and know when to rent expertise rather than build it. Your customers depend on reliable webhooks. How you deliver that reliability matters less than delivering it.

For your MVP webhook architecture, start simple. Add complexity only when metrics demand it. And when infrastructure maintenance exceeds feature development time, consider managed solutions like Hook Mesh that handle scaling automatically.

Related Posts