Retry Strategy

Hook Mesh automatically retries failed webhook deliveries using exponential backoff with jitter. This ensures maximum reliability while respecting your customers' endpoints.

Why Retries Matter

Networks are unreliable. Customer endpoints experience temporary issues:

  • Deployments: Brief downtime during code deploys (30-60 seconds)
  • Network blips: Transient connection failures or timeouts
  • Rate limiting: Temporary throttling during traffic spikes
  • Server overload: 503 errors during high load

Without retries, these temporary failures would result in lost events. Hook Mesh's intelligent retry strategy ensures webhooks are delivered even when endpoints experience brief issues.

Exponential Backoff Schedule

Hook Mesh retries failed deliveries with exponentially increasing delays between attempts. This gives endpoints time to recover without overwhelming them with requests.

AttemptDelayCumulative TimeWith Jitter (±25%)
15 seconds5s3.75-6.25s
225 seconds30s18.75-31.25s
32 minutes2.5 min1m 33s - 2m 35s
410 minutes12.5 min7.5m - 12.5m
550 minutes1 hour37.5m - 62.5m
64 hours5 hours3h - 5h
Formula: Each retry delay is 5x the previous delay (5s, 25s, 125s, 625s, 3125s...) with ±25% random jitter.

48-Hour Delivery Window

After 6 retry attempts (~5 hours), Hook Mesh continues retrying with a maximum delay of 6 hours until 48 hours have elapsed from job creation. If delivery still hasn't succeeded after 48 hours, the job is marked as discarded.

Complete Retry Timeline
Initial attempt:        0s
Retry 1:                5s
Retry 2:                30s
Retry 3:                2m 30s
Retry 4:                12m 30s
Retry 5:                1h 2m
Retry 6:                5h 2m
Retry 7+:               Every 6 hours until 48 hours elapsed

Final status at 48h:    Discarded (if still failing)

Jitter Explained

Jitter adds ±25% random variation to each retry delay. This prevents the "thundering herd" problem where many failed webhooks would retry at exactly the same time, potentially overwhelming a recovering endpoint.

Jitter Calculation Example
// Base delay for retry 2: 25 seconds
// Jitter range: ±25% = ±6.25 seconds
// Actual delay: randomly between 18.75s and 31.25s

function calculateRetryDelay(attemptNumber) {
  const baseDelay = 5 * Math.pow(5, attemptNumber - 1); // 5s * 5^(n-1)
  const jitterPercent = (Math.random() - 0.5) * 0.5; // ±25%
  const jitter = baseDelay * jitterPercent;
  return baseDelay + jitter;
}

console.log(calculateRetryDelay(2)); // ~25s ±6.25s

When Retries Happen

Hook Mesh automatically retries deliveries for these failure conditions:

ConditionRetry?Reason
5xx Server ErrorYesTemporary server issue
429 Rate LimitedYesEndpoint is throttling requests
TimeoutYesEndpoint took too long to respond
Connection ErrorYesNetwork issue or endpoint down
DNS FailureYesTemporary DNS resolution issue

When Retries DON'T Happen

Hook Mesh does NOT retry these conditions because they indicate permanent failures:

ConditionRetry?Reason
2xx SuccessNoWebhook delivered successfully
400 Bad RequestNoInvalid payload (won't change on retry)
401 UnauthorizedNoAuthentication issue (permanent)
404 Not FoundNoEndpoint URL doesn't exist
410 GoneNoEndpoint permanently removed

Manual Retries

You can manually retry failed webhook jobs via the API or dashboard. This is useful when:

  • Your customer fixed their endpoint and wants immediate delivery
  • A temporary configuration issue has been resolved
  • You want to retry a job that was discarded after 48 hours
Manual Retry API
POST /v1/webhook-jobs/{job_id}/retry
// Manually retry a failed job
const response = await fetch(
  'https://api.hookmesh.com/v1/webhook-jobs/job_xyz/retry',
  {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.HOOKMESH_API_KEY}`
    }
  }
);

const job = await response.json();
console.log('Job status:', job.status); // "created" (queued for retry)
Manual retries reset the attempt counter: The job starts fresh with attempt 0, getting the full retry schedule again.

Job Lifecycle Visualization

Webhook Job State Flow
Created


Executing ───────────┐
   │                 │
   ├─ 2xx ──▶ Succeeded (✓ Terminal)

   ├─ 4xx ──▶ Discarded (✗ Terminal, except 429)

   └─ 5xx/timeout/error


   Awaiting Retry

        ├─ Within 48h ──▶ Back to Executing

        └─ After 48h ──▶ Discarded (✗ Terminal)

Circuit Breaker Integration

If an endpoint fails consistently (5+ consecutive failures or 50%+ failure rate), Hook Mesh's circuit breaker automatically pauses deliveries to that endpoint. This prevents wasting retry attempts on endpoints that are clearly down.

  • Circuit opens: Stop attempting deliveries to protect the endpoint
  • Jobs queued: Webhooks wait in queue instead of being discarded
  • Test delivery: After 5 minutes, send a test webhook
  • Circuit closes: If test succeeds, resume normal deliveries
Retries + Circuit Breaker = Maximum Reliability: Temporary failures get retried. Persistent failures trigger the circuit breaker to avoid overwhelming bad endpoints.

Monitoring Retry Metrics

Track retry behavior in your dashboard:

// Get retry statistics for an application
const response = await fetch(
  'https://api.hookmesh.com/v1/webhook-jobs?application_id=app_xyz&status=awaiting_retry',
  {
    headers: {
      'Authorization': `Bearer ${apiKey}`
    }
  }
);

const { data, pagination } = await response.json();
console.log(`${pagination.total} jobs awaiting retry`);

// Calculate retry rate
const totalJobs = await getTotalJobs();
const retryRate = (pagination.total / totalJobs) * 100;
console.log(`Retry rate: ${retryRate.toFixed(1)}%`);

Healthy Retry Metrics

MetricHealthy RangeAlert Threshold
Retry rate< 5%> 10%
Jobs in retry queue< 100> 1000
Average attempts1-2> 3

Best Practices

  • Return correct status codes: Use 5xx for temporary failures, 4xx for permanent errors
  • Implement idempotency: Store webhook IDs to handle duplicate deliveries
  • Respond quickly: Return 200 immediately, process async (prevents timeouts)
  • Use 429 for rate limiting: Hook Mesh will respect your rate limits and retry
  • Monitor retry rates: High retry rates indicate endpoint issues
  • Set appropriate timeouts: Match your endpoint's response time (15-30s typical)
  • Test failure scenarios: Verify your endpoint handles retries correctly

Comparison with Other Providers

ProviderInitial DelayMax RetriesTotal Duration
Hook Mesh5 seconds6+ attempts48 hours
Stripe5 seconds8 attempts~6 hours
Svix5 seconds5 attempts~1 hour
SlackImmediate5 attempts~2 hours
Hook Mesh offers the longest delivery window: 48 hours ensures maximum reliability for critical events while remaining cost-effective for SMB customers.

Next Steps