Monitoring

Effective webhook monitoring is essential for reliability and customer trust. This guide covers key metrics, monitoring strategies, alerting best practices, and how to set up comprehensive observability for your webhook infrastructure.

Why monitoring matters: Webhooks are critical integration points between systems. Poor webhook reliability can lead to data loss, delayed processing, and customer dissatisfaction. Proactive monitoring helps you catch issues before they impact customers.

Key Metrics to Track

These metrics provide visibility into webhook health and performance. Track them continuously and set up alerts for anomalies.

Metric	Target	Description
Delivery Success Rate	>99%	% of webhooks successfully delivered (excludes 4xx client errors)
Average Response Time	<1s	Mean endpoint response time (SLO: <2s)
P95 Response Time	<3s	95th percentile response time
P99 Response Time	<5s	99th percentile response time
Retry Rate	<2%	% of webhooks requiring retry (baseline for healthy system)
Circuit Breaker Trips	<1/day	Endpoints entering circuit breaker state
Event Throughput	Monitor	Events per hour (baseline for anomaly detection)
Queue Depth	<100	Pending webhook jobs (spikes indicate delivery issues)

SLO vs Target: Targets are aspirational goals for optimal performance. Service Level Objectives (SLOs) are the minimum guaranteed performance:

• Success Rate SLO: >99.5%
• Response Time SLO: <2s average
• Availability SLO: 99.9% uptime

Note: Thresholds should be tuned based on your system's throughput and business requirements.

Queue Depth Guidelines

Normal: 0-100 jobs (typical for healthy system)
Elevated: 100-500 jobs (monitor for growth trend)
Critical: >500 jobs (indicates delivery issues or endpoint outages)
Note: Thresholds scale with throughput - adjust for high-volume systems

Hook Mesh Dashboard

The Hook Mesh dashboard provides real-time visibility into webhook delivery health and performance.

Dashboard Features

✅ Overview metrics: success rate, average response time (last 24 hours)
✅ Job timeline: delivery attempts over last 48 hours
✅ Endpoint health status: healthy, degraded, circuit breaker open
✅ Circuit breaker status: which endpoints are paused
✅ Recent failures: latest failed deliveries with error details
✅ Filtering: by endpoint, event type, status, date range
✅ Search: find specific jobs by ID, payload content, or metadata

Access the dashboard at https://app.hookmesh.com/dashboard

Webhook Metrics API

Query webhook metrics programmatically to build custom dashboards or integrate with your monitoring stack.

API Endpoint

GET /v1/applications/{id}/metrics

Query Parameters

start_date (required) - ISO 8601 timestamp (e.g., 2024-01-01T00:00:00Z)
end_date (required) - ISO 8601 timestamp (e.g., 2024-01-02T00:00:00Z)
granularity (optional) - hour, day (default: hour)
endpoint_id (optional) - Filter by specific endpoint

Node.js - Fetch Metrics

import fetch from 'node-fetch';

const applicationId = 'app_abc123';
const startDate = new Date(Date.now() - 7 * 24 * 60 * 60 * 1000); // 7 days ago
const endDate = new Date();

const params = new URLSearchParams({
  start_date: startDate.toISOString(),
  end_date: endDate.toISOString(),
  granularity: 'day'
});

const response = await fetch(
  `https://api.hookmesh.com/v1/applications/${applicationId}/metrics?${params}`,
  {
    headers: {
      'Authorization': `Bearer ${process.env.HOOKMESH_API_KEY}`
    }
  }
);

const metrics = await response.json();

console.log('Success rate:', metrics.success_rate);
console.log('Average response time:', metrics.avg_response_time_ms);
console.log('Total events:', metrics.total_events);

// Time-series data
metrics.timeseries.forEach(point => {
  console.log(`${point.timestamp}: ${point.success_rate}% success`);
});

Python - Fetch Metrics

import requests
import os
from datetime import datetime, timedelta

application_id = 'app_abc123'
start_date = (datetime.now() - timedelta(days=7)).isoformat()
end_date = datetime.now().isoformat()

response = requests.get(
    f'https://api.hookmesh.com/v1/applications/{application_id}/metrics',
    params={
        'start_date': start_date,
        'end_date': end_date,
        'granularity': 'day'
    },
    headers={'Authorization': f'Bearer {os.environ["HOOKMESH_API_KEY"]}'}
)

metrics = response.json()

print(f'Success rate: {metrics["success_rate"]}%')
print(f'Average response time: {metrics["avg_response_time_ms"]}ms')
print(f'Total events: {metrics["total_events"]}')

# Time-series data
for point in metrics['timeseries']:
    print(f'{point["timestamp"]}: {point["success_rate"]}% success')

Go - Fetch Metrics

package main

import (
    "encoding/json"
    "fmt"
    "net/http"
    "net/url"
    "os"
    "time"
)

type Metrics struct {
    SuccessRate      float64 `json:"success_rate"`
    AvgResponseTimeMs int     `json:"avg_response_time_ms"`
    TotalEvents      int     `json:"total_events"`
    Timeseries       []struct {
        Timestamp   string  `json:"timestamp"`
        SuccessRate float64 `json:"success_rate"`
    } `json:"timeseries"`
}

func main() {
    applicationId := "app_abc123"
    startDate := time.Now().AddDate(0, 0, -7).Format(time.RFC3339)
    endDate := time.Now().Format(time.RFC3339)

    params := url.Values{}
    params.Add("start_date", startDate)
    params.Add("end_date", endDate)
    params.Add("granularity", "day")

    url := fmt.Sprintf("https://api.hookmesh.com/v1/applications/%s/metrics?%s",
        applicationId, params.Encode())

    req, _ := http.NewRequest("GET", url, nil)
    req.Header.Set("Authorization", "Bearer "+os.Getenv("HOOKMESH_API_KEY"))

    client := &http.Client{}
    resp, err := client.Do(req)
    if err != nil {
        panic(err)
    }
    defer resp.Body.Close()

    var metrics Metrics
    json.NewDecoder(resp.Body).Decode(&metrics)

    fmt.Printf("Success rate: %.2f%%\n", metrics.SuccessRate)
    fmt.Printf("Average response time: %dms\n", metrics.AvgResponseTimeMs)
    fmt.Printf("Total events: %d\n", metrics.TotalEvents)
}

PHP - Fetch Metrics

<?php

$applicationId = 'app_abc123';
$startDate = (new DateTime('-7 days'))->format('c');
$endDate = (new DateTime())->format('c');

$url = sprintf(
    'https://api.hookmesh.com/v1/applications/%s/metrics?%s',
    $applicationId,
    http_build_query([
        'start_date' => $startDate,
        'end_date' => $endDate,
        'granularity' => 'day'
    ])
);

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HTTPHEADER, [
    'Authorization: Bearer ' . getenv('HOOKMESH_API_KEY')
]);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$response = curl_exec($ch);
curl_close($ch);

$metrics = json_decode($response, true);

echo "Success rate: {$metrics['success_rate']}%\n";
echo "Average response time: {$metrics['avg_response_time_ms']}ms\n";
echo "Total events: {$metrics['total_events']}\n";

// Time-series data
foreach ($metrics['timeseries'] as $point) {
    echo "{$point['timestamp']}: {$point['success_rate']}% success\n";
}

Application Metrics (Consumer Side)

Track metrics on your webhook consumer endpoints to complement Hook Mesh metrics and gain full observability.

Metrics to Track

Processing time: Time to process webhook payload (after signature verification)
Error rates: % of webhooks that fail processing
Duplicate events: Count of duplicate webhook IDs received
Queue depth: Number of webhooks waiting to be processed
Signature verification failures: Count of invalid signatures

Express.js Middleware with Prometheus Metrics

import express from 'express';
import { Histogram, Counter, Gauge } from 'prom-client';

// Define metrics
const webhookDuration = new Histogram({
  name: 'webhook_processing_duration_seconds',
  help: 'Time to process webhook',
  labelNames: ['event_type', 'status']
});

const webhookErrors = new Counter({
  name: 'webhook_errors_total',
  help: 'Total webhook processing errors',
  labelNames: ['event_type', 'error_type']
});

const duplicateWebhooks = new Counter({
  name: 'webhook_duplicates_total',
  help: 'Total duplicate webhooks received',
  labelNames: ['event_type']
});

const webhookQueueDepth = new Gauge({
  name: 'webhook_queue_depth',
  help: 'Number of webhooks waiting to be processed'
});

// Middleware to track metrics
export function webhookMetricsMiddleware(req, res, next) {
  const startTime = Date.now();
  const eventType = req.body?.event_type || 'unknown';

  // Track response
  res.on('finish', () => {
    const duration = (Date.now() - startTime) / 1000;
    const status = res.statusCode >= 200 && res.statusCode < 300 ? 'success' : 'failure';

    webhookDuration.observe({ event_type: eventType, status }, duration);

    if (status === 'failure') {
      webhookErrors.inc({ event_type: eventType, error_type: 'processing_error' });
    }
  });

  next();
}

// Example webhook handler
app.post('/webhooks/hookmesh',
  express.json(),
  webhookMetricsMiddleware,
  async (req, res) => {
    try {
      // Verify signature
      verifyWebhook(req.body, req.headers, process.env.WEBHOOK_SECRET);

      // Check for duplicate
      const isDuplicate = await checkDuplicate(req.headers['webhook-id']);
      if (isDuplicate) {
        duplicateWebhooks.inc({ event_type: req.body.event_type });
        return res.status(200).json({ received: true, duplicate: true });
      }

      // Process webhook
      await processWebhook(req.body);

      res.status(200).json({ received: true });
    } catch (error) {
      console.error('Webhook error:', error);

      if (error.message.includes('signature')) {
        webhookErrors.inc({ event_type: req.body.event_type, error_type: 'signature_failure' });
        return res.status(400).json({ error: 'Invalid signature' });
      }

      return res.status(500).json({ error: 'Internal server error' });
    }
  }
);

// Expose metrics endpoint for Prometheus
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Alerting Strategies

Set up alerts for critical issues that require immediate attention. Alert on trends and sustained issues, not transient failures.

Alert Condition	Threshold	Severity
Success rate drops below 99%	>5 minutes	Critical
Average response time > 2 seconds	>10 minutes	Warning
P95 response time > 3 seconds	>10 minutes	Warning
Circuit breaker opens	Immediate	Critical
Retry rate > 5%	>10 minutes	Warning
Queue depth growing	>500 jobs	Warning

Prometheus Alert Rules

groups:
  - name: webhook_alerts
    interval: 30s
    rules:
      # Alert on low success rate
      - alert: WebhookSuccessRateLow
        expr: |
          (
            sum(rate(webhook_deliveries_total{status="success"}[5m]))
            /
            sum(rate(webhook_deliveries_total[5m]))
          ) < 0.99
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Webhook success rate below 99%"
          description: "Success rate is {{ $value | humanizePercentage }} (target: >99%, SLO: >99.5%)"

      # Alert on average response time
      - alert: WebhookAvgResponseTimeSlow
        expr: |
          avg(rate(webhook_processing_duration_seconds_sum[10m]))
          /
          avg(rate(webhook_processing_duration_seconds_count[10m]))
          > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Webhook average response time >2s"
          description: "Average response time is {{ $value }}s (target: <1s, SLO: <2s)"

      # Alert on P95 response time
      - alert: WebhookP95ResponseTimeSlow
        expr: |
          histogram_quantile(0.95,
            rate(webhook_processing_duration_seconds_bucket[10m])
          ) > 3
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Webhook P95 response time >3s"
          description: "P95 response time is {{ $value }}s (target: <3s)"

      # Alert on circuit breaker open
      - alert: WebhookCircuitBreakerOpen
        expr: webhook_circuit_breaker_state{state="open"} == 1
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Webhook circuit breaker opened"
          description: "Circuit breaker open for endpoint {{ $labels.endpoint_url }}"

      # Alert on high retry rate
      - alert: WebhookRetryRateHigh
        expr: |
          (
            sum(rate(webhook_retries_total[10m]))
            /
            sum(rate(webhook_deliveries_total[10m]))
          ) > 0.05
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Webhook retry rate >5%"
          description: "Retry rate is {{ $value | humanizePercentage }} (target: <2%)"

      # Alert on growing queue
      - alert: WebhookQueueDepthHigh
        expr: webhook_queue_depth > 500
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Webhook queue depth high"
          description: "Queue depth is {{ $value }} jobs (normal: <100, critical: >500)"

      # Alert on critical queue depth
      - alert: WebhookQueueDepthCritical
        expr: webhook_queue_depth > 1000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Webhook queue depth critical"
          description: "Queue depth is {{ $value }} jobs - system may be backed up"

Alert fatigue: Set alert thresholds to avoid false positives. Transient failures are normal—only alert on sustained issues that require human intervention.

Health Checks

Hook Mesh provides endpoint health check APIs and automated monitoring to track endpoint availability.

Node.js - Check Endpoint Health

import fetch from 'node-fetch';

const endpointId = 'ep_abc123';

const response = await fetch(
  `https://api.hookmesh.com/v1/endpoints/${endpointId}/health`,
  {
    headers: {
      'Authorization': `Bearer ${process.env.HOOKMESH_API_KEY}`
    }
  }
);

const health = await response.json();

console.log('Status:', health.status); // healthy, degraded, unhealthy
console.log('Success rate:', health.success_rate);
console.log('Average response time:', health.avg_response_time_ms);
console.log('Last successful delivery:', health.last_success_at);
console.log('Last failure:', health.last_failure_at);
console.log('Consecutive failures:', health.consecutive_failures);

// Alert if unhealthy
if (health.status === 'unhealthy') {
  console.error(`⚠ Endpoint ${endpointId} is unhealthy!`);
  // Send notification to ops team
  await sendPagerDutyAlert({
    message: `Webhook endpoint unhealthy: ${health.last_failure_reason}`,
    severity: 'critical'
  });
}

Automated Health Monitoring

Frequency: Hook Mesh checks endpoint health every 1 minute
Healthy: Success rate >99%, average response time <1s
Degraded: Success rate 95-99% or response time 1-3s
Unhealthy: Success rate <95%, response time >3s, or circuit breaker open
Notifications: Webhooks sent on status changes (endpoint.health_changed)

Log Aggregation

Aggregate and structure logs from both Hook Mesh and your webhook consumer endpoints for debugging and analysis.

Recommended Log Fields

event_id / webhook_id: Unique identifier for correlation
event_type: Type of webhook event
application_id: Application or customer identifier
endpoint_url: Destination endpoint
duration_ms: Processing time in milliseconds
status: success, failure, retry
status_code: HTTP response code
error_message: Error details (if failed)
timestamp: ISO 8601 timestamp

Winston Structured Logger Configuration

import winston from 'winston';

const webhookLogger = winston.createLogger({
  level: 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'webhook-consumer',
    environment: process.env.NODE_ENV
  },
  transports: [
    // Write to file
    new winston.transports.File({
      filename: 'logs/webhooks-error.log',
      level: 'error'
    }),
    new winston.transports.File({
      filename: 'logs/webhooks.log'
    }),
    // Send to log aggregation service
    new winston.transports.Http({
      host: 'logs.example.com',
      port: 443,
      path: '/logs',
      ssl: true
    })
  ]
});

// Example usage in webhook handler
app.post('/webhooks/hookmesh', async (req, res) => {
  const startTime = Date.now();
  const webhookId = req.headers['webhook-id'];
  const eventType = req.body.event_type;

  try {
    webhookLogger.info('Webhook received', {
      event_id: webhookId,
      event_type: eventType,
      application_id: req.body.application_id,
      endpoint_url: req.url,
      timestamp: new Date().toISOString()
    });

    // Process webhook
    await processWebhook(req.body);

    const duration = Date.now() - startTime;
    webhookLogger.info('Webhook processed', {
      event_id: webhookId,
      event_type: eventType,
      duration_ms: duration,
      status: 'success',
      status_code: 200
    });

    res.status(200).json({ received: true });
  } catch (error) {
    const duration = Date.now() - startTime;
    webhookLogger.error('Webhook processing failed', {
      event_id: webhookId,
      event_type: eventType,
      duration_ms: duration,
      status: 'failure',
      error_message: error.message,
      error_stack: error.stack
    });

    res.status(500).json({ error: 'Internal server error' });
  }
});

Never log sensitive data: Do not log webhook signatures, secrets, or sensitive payload fields (passwords, credit cards, PII) in production logs.

Grafana Dashboard Setup

Create a comprehensive Grafana dashboard to visualize webhook metrics and health.

Grafana Dashboard JSON (Sample Configuration)

{
  "dashboard": {
    "title": "Webhook Monitoring",
    "panels": [
      {
        "id": 1,
        "title": "Success Rate (24h)",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(webhook_deliveries_total{status=\"success\"}[5m])) / sum(rate(webhook_deliveries_total[5m]))",
            "legendFormat": "Success Rate"
          }
        ],
        "yaxis": {
          "format": "percentunit",
          "max": 1,
          "min": 0
        },
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [0.99],
                "type": "lt"
              }
            }
          ]
        }
      },
      {
        "id": 2,
        "title": "Response Time Distribution",
        "type": "heatmap",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(webhook_processing_duration_seconds_bucket[5m]))",
            "legendFormat": "P95"
          },
          {
            "expr": "histogram_quantile(0.99, rate(webhook_processing_duration_seconds_bucket[5m]))",
            "legendFormat": "P99"
          }
        ]
      },
      {
        "id": 3,
        "title": "Event Throughput",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(webhook_deliveries_total[5m])) * 3600",
            "legendFormat": "Events per Hour"
          }
        ]
      },
      {
        "id": 4,
        "title": "Circuit Breaker Status",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(webhook_circuit_breaker_state{state=\"open\"})",
            "legendFormat": "Open Circuits"
          }
        ],
        "thresholds": [
          {
            "value": 0,
            "color": "green"
          },
          {
            "value": 1,
            "color": "red"
          }
        ]
      },
      {
        "id": 5,
        "title": "Top Errors (Last Hour)",
        "type": "table",
        "targets": [
          {
            "expr": "topk(10, sum by (error_type, event_type) (rate(webhook_errors_total[1h])))",
            "format": "table"
          }
        ]
      },
      {
        "id": 6,
        "title": "Queue Depth",
        "type": "graph",
        "targets": [
          {
            "expr": "webhook_queue_depth",
            "legendFormat": "Pending Jobs"
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [500],
                "type": "gt"
              }
            }
          ]
        }
      }
    ],
    "refresh": "30s",
    "time": {
      "from": "now-24h",
      "to": "now"
    }
  }
}

Performance Optimization

Use monitoring data to identify and fix performance bottlenecks in your webhook consumer endpoints.

Optimization Strategies

Identify slow endpoints: Use P95/P99 metrics to find outliers
Optimize database queries: Add indexes, use read replicas, cache frequently accessed data
Connection pooling: Reuse database connections instead of creating new ones per request
Implement caching: Cache lookup data (users, products, etc.) to reduce DB queries
Offload heavy processing: Queue long-running tasks instead of processing synchronously
Auto-scaling: Scale webhook consumer instances based on queue depth and throughput

Example: Queue Heavy Processing

import express from 'express';
import { Queue } from 'bullmq';

// Create job queue for heavy processing
const processingQueue = new Queue('webhook-processing', {
  connection: { host: 'localhost', port: 6379 }
});

app.post('/webhooks/hookmesh', async (req, res) => {
  try {
    // Verify signature (fast)
    verifyWebhook(req.body, req.headers, process.env.WEBHOOK_SECRET);

    // Store webhook for idempotency check (fast)
    await markWebhookReceived(req.headers['webhook-id']);

    // Queue heavy processing (instead of doing it synchronously)
    await processingQueue.add('process-webhook', {
      webhook_id: req.headers['webhook-id'],
      event_type: req.body.event_type,
      payload: req.body
    });

    // Return 200 immediately (within <1s)
    res.status(200).json({ received: true });

    // Heavy processing happens asynchronously in worker
  } catch (error) {
    console.error('Webhook error:', error);
    res.status(400).json({ error: 'Invalid webhook' });
  }
});

Monitoring Checklist

Complete this checklist to ensure comprehensive monitoring coverage:

Category	Item	Status
Dashboard	Dashboard showing key metrics (success rate, response time, throughput)
Alerts	Alerts configured for critical issues (success rate, circuit breaker)
Logs	Logs aggregated and searchable (ELK, Datadog, CloudWatch)
Review	Regular review of metrics (weekly or bi-weekly)
Runbooks	Runbooks documented for common issues (circuit breaker, timeouts)
On-Call	On-call rotation defined for webhook incidents
Health Checks	Health check endpoints configured and monitored
Retention	Metrics retention policy defined (e.g., 90 days)

Best Practices

✓ Do This

✓ Monitor both provider (Hook Mesh) and consumer (your app) side metrics
✓ Set realistic SLOs based on business requirements (e.g., 99.5% success rate)
✓ Alert on trends and sustained issues, not transient failures
✓ Include context in alerts (job_id, endpoint, error_message) for faster debugging
✓ Review metrics regularly (weekly) to identify patterns and optimization opportunities
✓ Document resolution steps in runbooks for common issues
✓ Use structured logging with consistent field names for easy querying

✗ Avoid This

✗ Alerting on every single failure (creates alert fatigue)
✗ Ignoring circuit breaker events (they indicate serious problems)
✗ Setting unrealistic SLOs (100% uptime is impossible)
✗ Logging sensitive data (secrets, passwords, credit cards, PII)
✗ Only monitoring from one side (need both provider and consumer metrics)
✗ No alerting after hours (webhooks fail 24/7, not just during business hours)