Retry Strategy

Hook Mesh automatically retries failed webhook deliveries using exponential backoff with jitter. This ensures maximum reliability while respecting your customers' endpoints.

Why Retries Matter

Networks are unreliable. Customer endpoints experience temporary issues:

Deployments: Brief downtime during code deploys (30-60 seconds)
Network blips: Transient connection failures or timeouts
Rate limiting: Temporary throttling during traffic spikes
Server overload: 503 errors during high load

Without retries, these temporary failures would result in lost events. Hook Mesh's intelligent retry strategy ensures webhooks are delivered even when endpoints experience brief issues.

Exponential Backoff Schedule

Hook Mesh retries failed deliveries with exponentially increasing delays between attempts. This gives endpoints time to recover without overwhelming them with requests.

Attempt	Delay	Cumulative Time	With Jitter (±25%)
1	5 seconds	5s	3.75-6.25s
2	25 seconds	30s	18.75-31.25s
3	2 minutes	2.5 min	1m 33s - 2m 35s
4	10 minutes	12.5 min	7.5m - 12.5m
5	50 minutes	1 hour	37.5m - 62.5m
6	4 hours	5 hours	3h - 5h

Formula: Each retry delay is 5x the previous delay (5s, 25s, 125s, 625s, 3125s...) with ±25% random jitter.

48-Hour Delivery Window

After 6 retry attempts (~5 hours), Hook Mesh continues retrying with a maximum delay of 6 hours until 48 hours have elapsed from job creation. If delivery still hasn't succeeded after 48 hours, the job is marked as discarded.

Complete Retry Timeline

Initial attempt:        0s
Retry 1:                5s
Retry 2:                30s
Retry 3:                2m 30s
Retry 4:                12m 30s
Retry 5:                1h 2m
Retry 6:                5h 2m
Retry 7+:               Every 6 hours until 48 hours elapsed

Final status at 48h:    Discarded (if still failing)

Jitter Explained

Jitter adds ±25% random variation to each retry delay. This prevents the "thundering herd" problem where many failed webhooks would retry at exactly the same time, potentially overwhelming a recovering endpoint.

Jitter Calculation Example

// Base delay for retry 2: 25 seconds
// Jitter range: ±25% = ±6.25 seconds
// Actual delay: randomly between 18.75s and 31.25s

function calculateRetryDelay(attemptNumber) {
  const baseDelay = 5 * Math.pow(5, attemptNumber - 1); // 5s * 5^(n-1)
  const jitterPercent = (Math.random() - 0.5) * 0.5; // ±25%
  const jitter = baseDelay * jitterPercent;
  return baseDelay + jitter;
}

console.log(calculateRetryDelay(2)); // ~25s ±6.25s

When Retries Happen

Hook Mesh automatically retries deliveries for these failure conditions:

Condition	Retry?	Reason
`5xx Server Error`	Yes	Temporary server issue
`429 Rate Limited`	Yes	Endpoint is throttling requests
`Timeout`	Yes	Endpoint took too long to respond
`Connection Error`	Yes	Network issue or endpoint down
`DNS Failure`	Yes	Temporary DNS resolution issue

When Retries DON'T Happen

Hook Mesh does NOT retry these conditions because they indicate permanent failures:

Condition	Retry?	Reason
`2xx Success`	No	Webhook delivered successfully
`400 Bad Request`	No	Invalid payload (won't change on retry)
`401 Unauthorized`	No	Authentication issue (permanent)
`404 Not Found`	No	Endpoint URL doesn't exist
`410 Gone`	No	Endpoint permanently removed

Important: 4xx errors (except 429) are treated as permanent failures and won't be retried. Ensure your customers respond with the correct status codes.

Manual Retries

You can manually retry failed webhook jobs via the API or dashboard. This is useful when:

Your customer fixed their endpoint and wants immediate delivery
A temporary configuration issue has been resolved
You want to retry a job that was discarded after 48 hours

Manual Retry API

POST /v1/webhook-jobs/{job_id}/retry

// Manually retry a failed job
const response = await fetch(
  'https://api.hookmesh.com/v1/webhook-jobs/job_xyz/retry',
  {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.HOOKMESH_API_KEY}`
    }
  }
);

const job = await response.json();
console.log('Job status:', job.status); // "created" (queued for retry)

Manual retries reset the attempt counter: The job starts fresh with attempt 0, getting the full retry schedule again.

Job Lifecycle Visualization

Webhook Job State Flow

Created
   │
   ▼
Executing ───────────┐
   │                 │
   ├─ 2xx ──▶ Succeeded (✓ Terminal)
   │
   ├─ 4xx ──▶ Discarded (✗ Terminal, except 429)
   │
   └─ 5xx/timeout/error
        │
        ▼
   Awaiting Retry
        │
        ├─ Within 48h ──▶ Back to Executing
        │
        └─ After 48h ──▶ Discarded (✗ Terminal)

Circuit Breaker Integration

If an endpoint fails consistently (5+ consecutive failures or 50%+ failure rate), Hook Mesh's circuit breaker automatically pauses deliveries to that endpoint. This prevents wasting retry attempts on endpoints that are clearly down.

Circuit opens: Stop attempting deliveries to protect the endpoint
Jobs queued: Webhooks wait in queue instead of being discarded
Test delivery: After 5 minutes, send a test webhook
Circuit closes: If test succeeds, resume normal deliveries

Retries + Circuit Breaker = Maximum Reliability: Temporary failures get retried. Persistent failures trigger the circuit breaker to avoid overwhelming bad endpoints.

Monitoring Retry Metrics

Track retry behavior in your dashboard:

// Get retry statistics for an application
const response = await fetch(
  'https://api.hookmesh.com/v1/webhook-jobs?application_id=app_xyz&status=awaiting_retry',
  {
    headers: {
      'Authorization': `Bearer ${apiKey}`
    }
  }
);

const { data, pagination } = await response.json();
console.log(`${pagination.total} jobs awaiting retry`);

// Calculate retry rate
const totalJobs = await getTotalJobs();
const retryRate = (pagination.total / totalJobs) * 100;
console.log(`Retry rate: ${retryRate.toFixed(1)}%`);

Healthy Retry Metrics

Metric	Healthy Range	Alert Threshold
Retry rate	< 5%	> 10%
Jobs in retry queue	< 100	> 1000
Average attempts	1-2	> 3

Best Practices

Return correct status codes: Use 5xx for temporary failures, 4xx for permanent errors
Implement idempotency: Store webhook IDs to handle duplicate deliveries
Respond quickly: Return 200 immediately, process async (prevents timeouts)
Use 429 for rate limiting: Hook Mesh will respect your rate limits and retry
Monitor retry rates: High retry rates indicate endpoint issues
Set appropriate timeouts: Match your endpoint's response time (15-30s typical)
Test failure scenarios: Verify your endpoint handles retries correctly

Comparison with Other Providers

Provider	Initial Delay	Max Retries	Total Duration
Hook Mesh	5 seconds	6+ attempts	48 hours
Stripe	5 seconds	8 attempts	~6 hours
Svix	5 seconds	5 attempts	~1 hour
Slack	Immediate	5 attempts	~2 hours

Hook Mesh offers the longest delivery window: 48 hours ensures maximum reliability for critical events while remaining cost-effective for SMB customers.

Next Steps

Learn about the circuit breaker mechanism for handling persistent failures
Implement idempotency to handle duplicate deliveries safely
View the Webhook Jobs API for manual retry operations
Understand delivery guarantees and SLA commitments