Retry Strategy
Hook Mesh automatically retries failed webhook deliveries using exponential backoff with jitter. This ensures maximum reliability while respecting your customers' endpoints.
Why Retries Matter
Networks are unreliable. Customer endpoints experience temporary issues:
- Deployments: Brief downtime during code deploys (30-60 seconds)
- Network blips: Transient connection failures or timeouts
- Rate limiting: Temporary throttling during traffic spikes
- Server overload: 503 errors during high load
Without retries, these temporary failures would result in lost events. Hook Mesh's intelligent retry strategy ensures webhooks are delivered even when endpoints experience brief issues.
Exponential Backoff Schedule
Hook Mesh retries failed deliveries with exponentially increasing delays between attempts. This gives endpoints time to recover without overwhelming them with requests.
| Attempt | Delay | Cumulative Time | With Jitter (±25%) |
|---|---|---|---|
| 1 | 5 seconds | 5s | 3.75-6.25s |
| 2 | 25 seconds | 30s | 18.75-31.25s |
| 3 | 2 minutes | 2.5 min | 1m 33s - 2m 35s |
| 4 | 10 minutes | 12.5 min | 7.5m - 12.5m |
| 5 | 50 minutes | 1 hour | 37.5m - 62.5m |
| 6 | 4 hours | 5 hours | 3h - 5h |
48-Hour Delivery Window
After 6 retry attempts (~5 hours), Hook Mesh continues retrying with a maximum delay of 6 hours until 48 hours have elapsed from job creation. If delivery still hasn't succeeded after 48 hours, the job is marked as discarded.
Initial attempt: 0s
Retry 1: 5s
Retry 2: 30s
Retry 3: 2m 30s
Retry 4: 12m 30s
Retry 5: 1h 2m
Retry 6: 5h 2m
Retry 7+: Every 6 hours until 48 hours elapsed
Final status at 48h: Discarded (if still failing)Jitter Explained
Jitter adds ±25% random variation to each retry delay. This prevents the "thundering herd" problem where many failed webhooks would retry at exactly the same time, potentially overwhelming a recovering endpoint.
// Base delay for retry 2: 25 seconds
// Jitter range: ±25% = ±6.25 seconds
// Actual delay: randomly between 18.75s and 31.25s
function calculateRetryDelay(attemptNumber) {
const baseDelay = 5 * Math.pow(5, attemptNumber - 1); // 5s * 5^(n-1)
const jitterPercent = (Math.random() - 0.5) * 0.5; // ±25%
const jitter = baseDelay * jitterPercent;
return baseDelay + jitter;
}
console.log(calculateRetryDelay(2)); // ~25s ±6.25sWhen Retries Happen
Hook Mesh automatically retries deliveries for these failure conditions:
| Condition | Retry? | Reason |
|---|---|---|
5xx Server Error | Yes | Temporary server issue |
429 Rate Limited | Yes | Endpoint is throttling requests |
Timeout | Yes | Endpoint took too long to respond |
Connection Error | Yes | Network issue or endpoint down |
DNS Failure | Yes | Temporary DNS resolution issue |
When Retries DON'T Happen
Hook Mesh does NOT retry these conditions because they indicate permanent failures:
| Condition | Retry? | Reason |
|---|---|---|
2xx Success | No | Webhook delivered successfully |
400 Bad Request | No | Invalid payload (won't change on retry) |
401 Unauthorized | No | Authentication issue (permanent) |
404 Not Found | No | Endpoint URL doesn't exist |
410 Gone | No | Endpoint permanently removed |
Manual Retries
You can manually retry failed webhook jobs via the API or dashboard. This is useful when:
- Your customer fixed their endpoint and wants immediate delivery
- A temporary configuration issue has been resolved
- You want to retry a job that was discarded after 48 hours
POST /v1/webhook-jobs/{job_id}/retry// Manually retry a failed job
const response = await fetch(
'https://api.hookmesh.com/v1/webhook-jobs/job_xyz/retry',
{
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.HOOKMESH_API_KEY}`
}
}
);
const job = await response.json();
console.log('Job status:', job.status); // "created" (queued for retry)Job Lifecycle Visualization
Created
│
▼
Executing ───────────┐
│ │
├─ 2xx ──▶ Succeeded (✓ Terminal)
│
├─ 4xx ──▶ Discarded (✗ Terminal, except 429)
│
└─ 5xx/timeout/error
│
▼
Awaiting Retry
│
├─ Within 48h ──▶ Back to Executing
│
└─ After 48h ──▶ Discarded (✗ Terminal)Circuit Breaker Integration
If an endpoint fails consistently (5+ consecutive failures or 50%+ failure rate), Hook Mesh's circuit breaker automatically pauses deliveries to that endpoint. This prevents wasting retry attempts on endpoints that are clearly down.
- Circuit opens: Stop attempting deliveries to protect the endpoint
- Jobs queued: Webhooks wait in queue instead of being discarded
- Test delivery: After 5 minutes, send a test webhook
- Circuit closes: If test succeeds, resume normal deliveries
Monitoring Retry Metrics
Track retry behavior in your dashboard:
// Get retry statistics for an application
const response = await fetch(
'https://api.hookmesh.com/v1/webhook-jobs?application_id=app_xyz&status=awaiting_retry',
{
headers: {
'Authorization': `Bearer ${apiKey}`
}
}
);
const { data, pagination } = await response.json();
console.log(`${pagination.total} jobs awaiting retry`);
// Calculate retry rate
const totalJobs = await getTotalJobs();
const retryRate = (pagination.total / totalJobs) * 100;
console.log(`Retry rate: ${retryRate.toFixed(1)}%`);Healthy Retry Metrics
| Metric | Healthy Range | Alert Threshold |
|---|---|---|
| Retry rate | < 5% | > 10% |
| Jobs in retry queue | < 100 | > 1000 |
| Average attempts | 1-2 | > 3 |
Best Practices
- Return correct status codes: Use 5xx for temporary failures, 4xx for permanent errors
- Implement idempotency: Store webhook IDs to handle duplicate deliveries
- Respond quickly: Return 200 immediately, process async (prevents timeouts)
- Use 429 for rate limiting: Hook Mesh will respect your rate limits and retry
- Monitor retry rates: High retry rates indicate endpoint issues
- Set appropriate timeouts: Match your endpoint's response time (15-30s typical)
- Test failure scenarios: Verify your endpoint handles retries correctly
Comparison with Other Providers
| Provider | Initial Delay | Max Retries | Total Duration |
|---|---|---|---|
| Hook Mesh | 5 seconds | 6+ attempts | 48 hours |
| Stripe | 5 seconds | 8 attempts | ~6 hours |
| Svix | 5 seconds | 5 attempts | ~1 hour |
| Slack | Immediate | 5 attempts | ~2 hours |
Next Steps
- Learn about the circuit breaker mechanism for handling persistent failures
- Implement idempotency to handle duplicate deliveries safely
- View the Webhook Jobs API for manual retry operations
- Understand delivery guarantees and SLA commitments