Monitoring

Effective webhook monitoring is essential for reliability and customer trust. This guide covers key metrics, monitoring strategies, alerting best practices, and how to set up comprehensive observability for your webhook infrastructure.

Why monitoring matters: Webhooks are critical integration points between systems. Poor webhook reliability can lead to data loss, delayed processing, and customer dissatisfaction. Proactive monitoring helps you catch issues before they impact customers.

Key Metrics to Track

These metrics provide visibility into webhook health and performance. Track them continuously and set up alerts for anomalies.

MetricTargetDescription
Delivery Success Rate>99%% of webhooks successfully delivered (excludes 4xx client errors)
Average Response Time<1sMean endpoint response time (SLO: <2s)
P95 Response Time<3s95th percentile response time
P99 Response Time<5s99th percentile response time
Retry Rate<2%% of webhooks requiring retry (baseline for healthy system)
Circuit Breaker Trips<1/dayEndpoints entering circuit breaker state
Event ThroughputMonitorEvents per hour (baseline for anomaly detection)
Queue Depth<100Pending webhook jobs (spikes indicate delivery issues)
SLO vs Target: Targets are aspirational goals for optimal performance. Service Level Objectives (SLOs) are the minimum guaranteed performance:
  • • Success Rate SLO: >99.5%
  • • Response Time SLO: <2s average
  • • Availability SLO: 99.9% uptime

Note: Thresholds should be tuned based on your system's throughput and business requirements.

Queue Depth Guidelines

  • Normal: 0-100 jobs (typical for healthy system)
  • Elevated: 100-500 jobs (monitor for growth trend)
  • Critical: >500 jobs (indicates delivery issues or endpoint outages)
  • Note: Thresholds scale with throughput - adjust for high-volume systems

Hook Mesh Dashboard

The Hook Mesh dashboard provides real-time visibility into webhook delivery health and performance.

Dashboard Features

  • ✅ Overview metrics: success rate, average response time (last 24 hours)
  • ✅ Job timeline: delivery attempts over last 48 hours
  • ✅ Endpoint health status: healthy, degraded, circuit breaker open
  • ✅ Circuit breaker status: which endpoints are paused
  • ✅ Recent failures: latest failed deliveries with error details
  • ✅ Filtering: by endpoint, event type, status, date range
  • ✅ Search: find specific jobs by ID, payload content, or metadata

Access the dashboard at https://app.hookmesh.com/dashboard

Webhook Metrics API

Query webhook metrics programmatically to build custom dashboards or integrate with your monitoring stack.

API Endpoint

GET /v1/applications/{id}/metrics

Query Parameters

  • start_date (required) - ISO 8601 timestamp (e.g., 2024-01-01T00:00:00Z)
  • end_date (required) - ISO 8601 timestamp (e.g., 2024-01-02T00:00:00Z)
  • granularity (optional) - hour, day (default: hour)
  • endpoint_id (optional) - Filter by specific endpoint
Node.js - Fetch Metrics
import fetch from 'node-fetch';

const applicationId = 'app_abc123';
const startDate = new Date(Date.now() - 7 * 24 * 60 * 60 * 1000); // 7 days ago
const endDate = new Date();

const params = new URLSearchParams({
  start_date: startDate.toISOString(),
  end_date: endDate.toISOString(),
  granularity: 'day'
});

const response = await fetch(
  `https://api.hookmesh.com/v1/applications/${applicationId}/metrics?${params}`,
  {
    headers: {
      'Authorization': `Bearer ${process.env.HOOKMESH_API_KEY}`
    }
  }
);

const metrics = await response.json();

console.log('Success rate:', metrics.success_rate);
console.log('Average response time:', metrics.avg_response_time_ms);
console.log('Total events:', metrics.total_events);

// Time-series data
metrics.timeseries.forEach(point => {
  console.log(`${point.timestamp}: ${point.success_rate}% success`);
});
Python - Fetch Metrics
import requests
import os
from datetime import datetime, timedelta

application_id = 'app_abc123'
start_date = (datetime.now() - timedelta(days=7)).isoformat()
end_date = datetime.now().isoformat()

response = requests.get(
    f'https://api.hookmesh.com/v1/applications/{application_id}/metrics',
    params={
        'start_date': start_date,
        'end_date': end_date,
        'granularity': 'day'
    },
    headers={'Authorization': f'Bearer {os.environ["HOOKMESH_API_KEY"]}'}
)

metrics = response.json()

print(f'Success rate: {metrics["success_rate"]}%')
print(f'Average response time: {metrics["avg_response_time_ms"]}ms')
print(f'Total events: {metrics["total_events"]}')

# Time-series data
for point in metrics['timeseries']:
    print(f'{point["timestamp"]}: {point["success_rate"]}% success')
Go - Fetch Metrics
package main

import (
    "encoding/json"
    "fmt"
    "net/http"
    "net/url"
    "os"
    "time"
)

type Metrics struct {
    SuccessRate      float64 `json:"success_rate"`
    AvgResponseTimeMs int     `json:"avg_response_time_ms"`
    TotalEvents      int     `json:"total_events"`
    Timeseries       []struct {
        Timestamp   string  `json:"timestamp"`
        SuccessRate float64 `json:"success_rate"`
    } `json:"timeseries"`
}

func main() {
    applicationId := "app_abc123"
    startDate := time.Now().AddDate(0, 0, -7).Format(time.RFC3339)
    endDate := time.Now().Format(time.RFC3339)

    params := url.Values{}
    params.Add("start_date", startDate)
    params.Add("end_date", endDate)
    params.Add("granularity", "day")

    url := fmt.Sprintf("https://api.hookmesh.com/v1/applications/%s/metrics?%s",
        applicationId, params.Encode())

    req, _ := http.NewRequest("GET", url, nil)
    req.Header.Set("Authorization", "Bearer "+os.Getenv("HOOKMESH_API_KEY"))

    client := &http.Client{}
    resp, err := client.Do(req)
    if err != nil {
        panic(err)
    }
    defer resp.Body.Close()

    var metrics Metrics
    json.NewDecoder(resp.Body).Decode(&metrics)

    fmt.Printf("Success rate: %.2f%%\n", metrics.SuccessRate)
    fmt.Printf("Average response time: %dms\n", metrics.AvgResponseTimeMs)
    fmt.Printf("Total events: %d\n", metrics.TotalEvents)
}
PHP - Fetch Metrics
<?php

$applicationId = 'app_abc123';
$startDate = (new DateTime('-7 days'))->format('c');
$endDate = (new DateTime())->format('c');

$url = sprintf(
    'https://api.hookmesh.com/v1/applications/%s/metrics?%s',
    $applicationId,
    http_build_query([
        'start_date' => $startDate,
        'end_date' => $endDate,
        'granularity' => 'day'
    ])
);

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HTTPHEADER, [
    'Authorization: Bearer ' . getenv('HOOKMESH_API_KEY')
]);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$response = curl_exec($ch);
curl_close($ch);

$metrics = json_decode($response, true);

echo "Success rate: {$metrics['success_rate']}%\n";
echo "Average response time: {$metrics['avg_response_time_ms']}ms\n";
echo "Total events: {$metrics['total_events']}\n";

// Time-series data
foreach ($metrics['timeseries'] as $point) {
    echo "{$point['timestamp']}: {$point['success_rate']}% success\n";
}

Application Metrics (Consumer Side)

Track metrics on your webhook consumer endpoints to complement Hook Mesh metrics and gain full observability.

Metrics to Track

  • Processing time: Time to process webhook payload (after signature verification)
  • Error rates: % of webhooks that fail processing
  • Duplicate events: Count of duplicate webhook IDs received
  • Queue depth: Number of webhooks waiting to be processed
  • Signature verification failures: Count of invalid signatures
Express.js Middleware with Prometheus Metrics
import express from 'express';
import { Histogram, Counter, Gauge } from 'prom-client';

// Define metrics
const webhookDuration = new Histogram({
  name: 'webhook_processing_duration_seconds',
  help: 'Time to process webhook',
  labelNames: ['event_type', 'status']
});

const webhookErrors = new Counter({
  name: 'webhook_errors_total',
  help: 'Total webhook processing errors',
  labelNames: ['event_type', 'error_type']
});

const duplicateWebhooks = new Counter({
  name: 'webhook_duplicates_total',
  help: 'Total duplicate webhooks received',
  labelNames: ['event_type']
});

const webhookQueueDepth = new Gauge({
  name: 'webhook_queue_depth',
  help: 'Number of webhooks waiting to be processed'
});

// Middleware to track metrics
export function webhookMetricsMiddleware(req, res, next) {
  const startTime = Date.now();
  const eventType = req.body?.event_type || 'unknown';

  // Track response
  res.on('finish', () => {
    const duration = (Date.now() - startTime) / 1000;
    const status = res.statusCode >= 200 && res.statusCode < 300 ? 'success' : 'failure';

    webhookDuration.observe({ event_type: eventType, status }, duration);

    if (status === 'failure') {
      webhookErrors.inc({ event_type: eventType, error_type: 'processing_error' });
    }
  });

  next();
}

// Example webhook handler
app.post('/webhooks/hookmesh',
  express.json(),
  webhookMetricsMiddleware,
  async (req, res) => {
    try {
      // Verify signature
      verifyWebhook(req.body, req.headers, process.env.WEBHOOK_SECRET);

      // Check for duplicate
      const isDuplicate = await checkDuplicate(req.headers['webhook-id']);
      if (isDuplicate) {
        duplicateWebhooks.inc({ event_type: req.body.event_type });
        return res.status(200).json({ received: true, duplicate: true });
      }

      // Process webhook
      await processWebhook(req.body);

      res.status(200).json({ received: true });
    } catch (error) {
      console.error('Webhook error:', error);

      if (error.message.includes('signature')) {
        webhookErrors.inc({ event_type: req.body.event_type, error_type: 'signature_failure' });
        return res.status(400).json({ error: 'Invalid signature' });
      }

      return res.status(500).json({ error: 'Internal server error' });
    }
  }
);

// Expose metrics endpoint for Prometheus
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Alerting Strategies

Set up alerts for critical issues that require immediate attention. Alert on trends and sustained issues, not transient failures.

Alert ConditionThresholdSeverity
Success rate drops below 99%>5 minutesCritical
Average response time > 2 seconds>10 minutesWarning
P95 response time > 3 seconds>10 minutesWarning
Circuit breaker opensImmediateCritical
Retry rate > 5%>10 minutesWarning
Queue depth growing>500 jobsWarning
Prometheus Alert Rules
groups:
  - name: webhook_alerts
    interval: 30s
    rules:
      # Alert on low success rate
      - alert: WebhookSuccessRateLow
        expr: |
          (
            sum(rate(webhook_deliveries_total{status="success"}[5m]))
            /
            sum(rate(webhook_deliveries_total[5m]))
          ) < 0.99
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Webhook success rate below 99%"
          description: "Success rate is {{ $value | humanizePercentage }} (target: >99%, SLO: >99.5%)"

      # Alert on average response time
      - alert: WebhookAvgResponseTimeSlow
        expr: |
          avg(rate(webhook_processing_duration_seconds_sum[10m]))
          /
          avg(rate(webhook_processing_duration_seconds_count[10m]))
          > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Webhook average response time >2s"
          description: "Average response time is {{ $value }}s (target: <1s, SLO: <2s)"

      # Alert on P95 response time
      - alert: WebhookP95ResponseTimeSlow
        expr: |
          histogram_quantile(0.95,
            rate(webhook_processing_duration_seconds_bucket[10m])
          ) > 3
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Webhook P95 response time >3s"
          description: "P95 response time is {{ $value }}s (target: <3s)"

      # Alert on circuit breaker open
      - alert: WebhookCircuitBreakerOpen
        expr: webhook_circuit_breaker_state{state="open"} == 1
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Webhook circuit breaker opened"
          description: "Circuit breaker open for endpoint {{ $labels.endpoint_url }}"

      # Alert on high retry rate
      - alert: WebhookRetryRateHigh
        expr: |
          (
            sum(rate(webhook_retries_total[10m]))
            /
            sum(rate(webhook_deliveries_total[10m]))
          ) > 0.05
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Webhook retry rate >5%"
          description: "Retry rate is {{ $value | humanizePercentage }} (target: <2%)"

      # Alert on growing queue
      - alert: WebhookQueueDepthHigh
        expr: webhook_queue_depth > 500
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Webhook queue depth high"
          description: "Queue depth is {{ $value }} jobs (normal: <100, critical: >500)"

      # Alert on critical queue depth
      - alert: WebhookQueueDepthCritical
        expr: webhook_queue_depth > 1000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Webhook queue depth critical"
          description: "Queue depth is {{ $value }} jobs - system may be backed up"
Alert fatigue: Set alert thresholds to avoid false positives. Transient failures are normal—only alert on sustained issues that require human intervention.

Health Checks

Hook Mesh provides endpoint health check APIs and automated monitoring to track endpoint availability.

Node.js - Check Endpoint Health
import fetch from 'node-fetch';

const endpointId = 'ep_abc123';

const response = await fetch(
  `https://api.hookmesh.com/v1/endpoints/${endpointId}/health`,
  {
    headers: {
      'Authorization': `Bearer ${process.env.HOOKMESH_API_KEY}`
    }
  }
);

const health = await response.json();

console.log('Status:', health.status); // healthy, degraded, unhealthy
console.log('Success rate:', health.success_rate);
console.log('Average response time:', health.avg_response_time_ms);
console.log('Last successful delivery:', health.last_success_at);
console.log('Last failure:', health.last_failure_at);
console.log('Consecutive failures:', health.consecutive_failures);

// Alert if unhealthy
if (health.status === 'unhealthy') {
  console.error(`⚠ Endpoint ${endpointId} is unhealthy!`);
  // Send notification to ops team
  await sendPagerDutyAlert({
    message: `Webhook endpoint unhealthy: ${health.last_failure_reason}`,
    severity: 'critical'
  });
}

Automated Health Monitoring

  • Frequency: Hook Mesh checks endpoint health every 1 minute
  • Healthy: Success rate >99%, average response time <1s
  • Degraded: Success rate 95-99% or response time 1-3s
  • Unhealthy: Success rate <95%, response time >3s, or circuit breaker open
  • Notifications: Webhooks sent on status changes (endpoint.health_changed)

Log Aggregation

Aggregate and structure logs from both Hook Mesh and your webhook consumer endpoints for debugging and analysis.

Recommended Log Fields

  • event_id / webhook_id: Unique identifier for correlation
  • event_type: Type of webhook event
  • application_id: Application or customer identifier
  • endpoint_url: Destination endpoint
  • duration_ms: Processing time in milliseconds
  • status: success, failure, retry
  • status_code: HTTP response code
  • error_message: Error details (if failed)
  • timestamp: ISO 8601 timestamp
Winston Structured Logger Configuration
import winston from 'winston';

const webhookLogger = winston.createLogger({
  level: 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'webhook-consumer',
    environment: process.env.NODE_ENV
  },
  transports: [
    // Write to file
    new winston.transports.File({
      filename: 'logs/webhooks-error.log',
      level: 'error'
    }),
    new winston.transports.File({
      filename: 'logs/webhooks.log'
    }),
    // Send to log aggregation service
    new winston.transports.Http({
      host: 'logs.example.com',
      port: 443,
      path: '/logs',
      ssl: true
    })
  ]
});

// Example usage in webhook handler
app.post('/webhooks/hookmesh', async (req, res) => {
  const startTime = Date.now();
  const webhookId = req.headers['webhook-id'];
  const eventType = req.body.event_type;

  try {
    webhookLogger.info('Webhook received', {
      event_id: webhookId,
      event_type: eventType,
      application_id: req.body.application_id,
      endpoint_url: req.url,
      timestamp: new Date().toISOString()
    });

    // Process webhook
    await processWebhook(req.body);

    const duration = Date.now() - startTime;
    webhookLogger.info('Webhook processed', {
      event_id: webhookId,
      event_type: eventType,
      duration_ms: duration,
      status: 'success',
      status_code: 200
    });

    res.status(200).json({ received: true });
  } catch (error) {
    const duration = Date.now() - startTime;
    webhookLogger.error('Webhook processing failed', {
      event_id: webhookId,
      event_type: eventType,
      duration_ms: duration,
      status: 'failure',
      error_message: error.message,
      error_stack: error.stack
    });

    res.status(500).json({ error: 'Internal server error' });
  }
});

Grafana Dashboard Setup

Create a comprehensive Grafana dashboard to visualize webhook metrics and health.

Grafana Dashboard JSON (Sample Configuration)
{
  "dashboard": {
    "title": "Webhook Monitoring",
    "panels": [
      {
        "id": 1,
        "title": "Success Rate (24h)",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(webhook_deliveries_total{status=\"success\"}[5m])) / sum(rate(webhook_deliveries_total[5m]))",
            "legendFormat": "Success Rate"
          }
        ],
        "yaxis": {
          "format": "percentunit",
          "max": 1,
          "min": 0
        },
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [0.99],
                "type": "lt"
              }
            }
          ]
        }
      },
      {
        "id": 2,
        "title": "Response Time Distribution",
        "type": "heatmap",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(webhook_processing_duration_seconds_bucket[5m]))",
            "legendFormat": "P95"
          },
          {
            "expr": "histogram_quantile(0.99, rate(webhook_processing_duration_seconds_bucket[5m]))",
            "legendFormat": "P99"
          }
        ]
      },
      {
        "id": 3,
        "title": "Event Throughput",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(webhook_deliveries_total[5m])) * 3600",
            "legendFormat": "Events per Hour"
          }
        ]
      },
      {
        "id": 4,
        "title": "Circuit Breaker Status",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(webhook_circuit_breaker_state{state=\"open\"})",
            "legendFormat": "Open Circuits"
          }
        ],
        "thresholds": [
          {
            "value": 0,
            "color": "green"
          },
          {
            "value": 1,
            "color": "red"
          }
        ]
      },
      {
        "id": 5,
        "title": "Top Errors (Last Hour)",
        "type": "table",
        "targets": [
          {
            "expr": "topk(10, sum by (error_type, event_type) (rate(webhook_errors_total[1h])))",
            "format": "table"
          }
        ]
      },
      {
        "id": 6,
        "title": "Queue Depth",
        "type": "graph",
        "targets": [
          {
            "expr": "webhook_queue_depth",
            "legendFormat": "Pending Jobs"
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [500],
                "type": "gt"
              }
            }
          ]
        }
      }
    ],
    "refresh": "30s",
    "time": {
      "from": "now-24h",
      "to": "now"
    }
  }
}

Performance Optimization

Use monitoring data to identify and fix performance bottlenecks in your webhook consumer endpoints.

Optimization Strategies

  • Identify slow endpoints: Use P95/P99 metrics to find outliers
  • Optimize database queries: Add indexes, use read replicas, cache frequently accessed data
  • Connection pooling: Reuse database connections instead of creating new ones per request
  • Implement caching: Cache lookup data (users, products, etc.) to reduce DB queries
  • Offload heavy processing: Queue long-running tasks instead of processing synchronously
  • Auto-scaling: Scale webhook consumer instances based on queue depth and throughput
Example: Queue Heavy Processing
import express from 'express';
import { Queue } from 'bullmq';

// Create job queue for heavy processing
const processingQueue = new Queue('webhook-processing', {
  connection: { host: 'localhost', port: 6379 }
});

app.post('/webhooks/hookmesh', async (req, res) => {
  try {
    // Verify signature (fast)
    verifyWebhook(req.body, req.headers, process.env.WEBHOOK_SECRET);

    // Store webhook for idempotency check (fast)
    await markWebhookReceived(req.headers['webhook-id']);

    // Queue heavy processing (instead of doing it synchronously)
    await processingQueue.add('process-webhook', {
      webhook_id: req.headers['webhook-id'],
      event_type: req.body.event_type,
      payload: req.body
    });

    // Return 200 immediately (within <1s)
    res.status(200).json({ received: true });

    // Heavy processing happens asynchronously in worker
  } catch (error) {
    console.error('Webhook error:', error);
    res.status(400).json({ error: 'Invalid webhook' });
  }
});

Monitoring Checklist

Complete this checklist to ensure comprehensive monitoring coverage:

CategoryItemStatus
DashboardDashboard showing key metrics (success rate, response time, throughput)
AlertsAlerts configured for critical issues (success rate, circuit breaker)
LogsLogs aggregated and searchable (ELK, Datadog, CloudWatch)
ReviewRegular review of metrics (weekly or bi-weekly)
RunbooksRunbooks documented for common issues (circuit breaker, timeouts)
On-CallOn-call rotation defined for webhook incidents
Health ChecksHealth check endpoints configured and monitored
RetentionMetrics retention policy defined (e.g., 90 days)

Best Practices

Do This

  • ✓ Monitor both provider (Hook Mesh) and consumer (your app) side metrics
  • ✓ Set realistic SLOs based on business requirements (e.g., 99.5% success rate)
  • ✓ Alert on trends and sustained issues, not transient failures
  • ✓ Include context in alerts (job_id, endpoint, error_message) for faster debugging
  • ✓ Review metrics regularly (weekly) to identify patterns and optimization opportunities
  • ✓ Document resolution steps in runbooks for common issues
  • ✓ Use structured logging with consistent field names for easy querying

Avoid This

  • ✗ Alerting on every single failure (creates alert fatigue)
  • ✗ Ignoring circuit breaker events (they indicate serious problems)
  • ✗ Setting unrealistic SLOs (100% uptime is impossible)
  • ✗ Logging sensitive data (secrets, passwords, credit cards, PII)
  • ✗ Only monitoring from one side (need both provider and consumer metrics)
  • ✗ No alerting after hours (webhooks fail 24/7, not just during business hours)

Related Documentation