Webhook Monitoring: How to Know When Your Integrations Break
Webhook integrations fail silently. A third-party service stops sending. Your handler throws an exception. A deployment breaks your endpoint URL. None of these send you an email. You find out when a customer complains.
Testing and deploying a webhook integration is only half the job. The other half is building enough visibility that you know, within minutes, when something stops working in production. This post covers the practical strategies for doing that: logging, alerting, heartbeat checks, and the metrics worth tracking.
The three failure modes
Most webhook failures fit into one of three categories. Knowing which one you are dealing with determines where you look.
The sender stops sending. The provider has a service disruption, you misconfigured the webhook registration, or the account that owns the integration was deactivated. From your side, no requests arrive. Your handler never runs. Your logs are empty. This failure is invisible unless you are actively watching for expected traffic.
Delivery fails. The provider tries to send the request, but your endpoint returns an error or times out. The provider logs a failed delivery attempt and may retry with exponential backoff. Eventually, retries are exhausted and the event is dropped. Your application never processes it.
The handler fails silently. The webhook arrives and your server returns 200. The provider considers delivery successful. But inside your handler, a bug causes the wrong behavior: the wrong record gets updated, an email is skipped, a payment status is set incorrectly. There is no error in the delivery log because the HTTP handshake succeeded.
Each failure mode requires a different detection strategy. Sender failures need heartbeat monitoring. Delivery failures show up in provider logs and your server's error rate. Silent handler failures need application-level logging and business metric monitoring.
Logging every incoming webhook
The foundation of webhook observability is a structured log entry for every request that arrives. Write this log before your handler processes anything, so you capture arrivals even when the handler crashes.
The fields worth logging at minimum:
- Timestamp (with millisecond precision)
- Source IP
- Event type, parsed from the provider's header (
X-GitHub-Event, Stripe'stypefield in the body, etc.) - Event ID, for deduplication and tracing
- HTTP response code you returned
- Processing time in milliseconds
# Django example: middleware that logs on arrival, before handler runs
class WebhookLoggingMiddleware:
def __init__(self, get_response):
self.get_response = get_response
def __call__(self, request):
if request.path.startswith('/webhooks/'):
start = time.monotonic()
response = self.get_response(request)
duration_ms = int((time.monotonic() - start) * 1000)
logger.info('webhook_received', extra={
'path': request.path,
'source_ip': request.META.get('REMOTE_ADDR'),
'event_type': request.headers.get('X-GitHub-Event')
or request.headers.get('X-Shopify-Topic', 'unknown'),
'status_code': response.status_code,
'duration_ms': duration_ms,
})
return response
return self.get_response(request)
Use structured logging (JSON output) rather than plain text. Structured logs are queryable in Datadog, CloudWatch, Loki, and most log aggregators. Plain text logs are not.
Alerting on delivery failures
For delivery failures, the fastest source of truth is the provider's own dashboard. Stripe, GitHub, and Shopify all maintain a webhook delivery log with response codes and failure counts.
- Stripe: Developers > Webhooks > select your endpoint > the event log shows each attempt, the response code, and whether retries are pending.
- GitHub: Repository Settings > Webhooks > Recent Deliveries tab.
- Shopify: Settings > Notifications > scroll to your webhook.
- Linear: Settings > API > Webhooks > click the webhook to see delivery history.
Most providers also send notification emails when a webhook endpoint consistently fails. Make sure these emails go to a monitored address, not a shared inbox that no one reads.
For webhooks your own system sends outbound, monitor the dead letter queue (DLQ) depth. If events pile up in the DLQ, delivery is failing. Alert when DLQ depth exceeds a threshold, not when it spikes once.
# BullMQ example: alert on DLQ depth
const dlqSize = await webhookQueue.getFailedCount();
if (dlqSize > 50) {
alerting.trigger('webhook_dlq_depth_high', { count: dlqSize });
}
Heartbeat monitoring
Some integrations go quiet legitimately, some silently break. The sender-stops-sending failure is the hardest to detect because your logs show nothing. There is no error to alert on, because nothing arrives.
The solution is a heartbeat check: expect at least one webhook of a specific type within a given time window. Alert if none arrives.
Stripe sends a balance.available event daily. If you process that event to update a balance record, you can also check: has the record been updated in the last 26 hours? If not, either the event stopped coming or your handler broke.
# Celery beat task: runs every hour, alerts if balance not updated recently
@shared_task
def check_stripe_heartbeat():
threshold = timezone.now() - timedelta(hours=25)
last_update = BalanceSyncLog.objects.order_by('-updated_at').first()
if last_update is None or last_update.updated_at < threshold:
alerting.trigger('stripe_balance_heartbeat_missing', {
'last_seen': last_update.updated_at.isoformat() if last_update else None,
})
The time window depends on how frequently the event legitimately fires. For a daily event, a 25-26 hour window catches missed deliveries without generating false positives. For hourly events, a 90-minute window is appropriate.
Heartbeat monitoring works for any predictable periodic event. Set it up for the events your application actually depends on, not all of them.
Using Payloader as a lightweight audit log
During staging and QA, pointing your webhooks through Payloader gives you a complete, browsable record of every request: timestamp, headers, and full body. When a test fails and you need to know what the provider actually sent, you do not need to reproduce the event. You look it up.
In production, the same request history is useful as a lightweight audit log. Payloader captures every incoming webhook with its timestamp, headers, and body, and stores it for the retention period of your plan. If a customer reports a missed order notification or a billing discrepancy, you can check whether the relevant event arrived and what it contained, without involving the provider's support team.
This is not a replacement for application-level logging, but it is a useful second source of truth that operates independently of your application. If your app crashes and loses its in-flight log entries, Payloader's record is unaffected.
What metrics to track in production
Once you have structured logging in place, these are the metrics worth instrumenting and tracking over time:
- Webhooks received per hour, broken down by event type. Sudden drops are often the first signal that a sender-side problem has developed.
- Error rate on your handler, meaning the percentage of incoming requests that result in a non-2xx response or an unhandled exception.
- p95 processing time. Slow handlers risk timing out and triggering retries. Track the 95th percentile, not just the average.
- Retry count from provider. If Stripe or GitHub is actively retrying deliveries, something is wrong. Track this as a metric, not just as a log entry.
- DLQ depth for any outbound webhook queue your system operates.
These five metrics, tracked over time, give you a clear picture of whether your webhook integration is healthy. Set alert thresholds on each one.
Setting up basic alerting
For teams that already have observability infrastructure, route your webhook metrics into whatever you already use: Datadog, Grafana, New Relic. Create a dashboard for the five metrics above and set alert rules on the ones that matter most for your business.
For teams without dedicated observability tooling, start with two things:
Provider notification emails. Every major provider can email you when webhook deliveries fail consistently. This is the simplest possible alerting. It is delayed and blunt, but it catches the most common failure mode (delivery errors) with zero infrastructure.
A Slack or PagerDuty alert on your error rate. If you ship structured logs, most log aggregators (Datadog, Logtail, Papertrail) can trigger a webhook or email when a specific log pattern appears at high frequency. Wire that to a Slack channel for non-critical issues and to PagerDuty for critical payment or fulfillment flows.
# Example: Datadog monitor query for high webhook error rate
# Alert when error rate exceeds 5% over a 10-minute window
sum(last_10m):
sum:webhook.errors{env:production}.as_count()
/
sum:webhook.requests{env:production}.as_count()
> 0.05
Start with provider email notifications and one Slack alert on error rate. That covers the two most common failure modes. Add heartbeat checks for the integrations your business depends on most. Expand from there as you learn which signals actually fire in practice.
Closing
Good webhook monitoring is mostly about structured logging and a handful of targeted alerts. Log every arrival. Watch your provider dashboards for delivery failures. Add heartbeat checks for integrations that go quiet when they break. Track the five metrics above and set thresholds before something breaks in production.
The goal is to find out a webhook integration broke before a customer does. That bar is lower than it sounds: most teams find out from customers. A few hours of logging and alerting setup changes that.