Webhook Retry Strategies: Handling Delivery Failures Reliably
Webhooks do not guarantee delivery. Networks fail, servers restart, deployments cause brief windows of unavailability, and TLS handshakes time out. A well-designed provider will retry on your behalf, but only if you understand the rules they follow and design your system around them.
This post covers the retry side of webhook reliability: how providers schedule retries, what triggers them, how to build a receiver that survives them, and how to implement retries correctly if you are sending webhooks yourself. Handling duplicate deliveries once retries succeed is a separate concern covered in the idempotent webhooks guide.
How major providers retry
Every major platform has its own retry policy. None of them retry immediately after a failure; they all wait before the next attempt and space out subsequent retries.
- Stripe retries over a 3-day window using exponential backoff: 1 hour after the first failure, then 2 hours, 4 hours, and continuing to expand. You can disable retries per endpoint in the Stripe dashboard.
- GitHub also retries for up to 3 days. GitHub's retry schedule is less documented than Stripe's but follows a similar spacing pattern.
- Shopify is the most aggressive: it retries up to 19 times over a 48-hour period. After 19 consecutive failures, Shopify automatically disables the webhook endpoint.
- WooCommerce retries 5 times total, with delays between each attempt. After 5 failures, the webhook is marked as disabled.
The key point: retries are not instant, and they are not unlimited. If your server has extended downtime, you will miss events entirely once the retry window closes.
What triggers a retry
Providers retry when they do not receive a successful response. What counts as "unsuccessful" varies by platform, but there are three primary triggers.
Non-2xx HTTP response. Any status code outside the 200-299 range tells the provider the delivery failed. Stripe retries on any non-2xx, including 4xx status codes. Some providers treat 4xx differently: since a 4xx usually indicates a client error (bad request, unauthorized), a few platforms interpret it as a permanent failure and skip retrying. Check your provider's documentation on this distinction.
Connection timeout. If the provider cannot establish a TCP connection to your server at all (DNS failure, port closed, server not listening), the delivery fails immediately. This is logged as a connection error, not an HTTP error.
Response timeout. The provider establishes a connection, your server accepts it, but you do not return a response within the allowed time window. Stripe allows 30 seconds. GitHub allows 10 seconds. Slack's incoming webhooks allow only 3 seconds. If your handler exceeds the timeout, the provider marks the delivery as failed even if your server eventually finishes the work.
Exponential backoff with jitter
Exponential backoff is the standard algorithm for spacing out retries. The wait time between attempts grows by a power of two: after attempt 1 you wait 2 seconds, after attempt 2 you wait 4 seconds, after attempt 3 you wait 8 seconds, and so on.
wait_seconds = 2 ** attempt_number
Compared to linear backoff (where every retry waits the same fixed interval), exponential backoff reduces the load on a struggling server by spacing retries further apart as failures accumulate. A server that is overwhelmed benefits from this breathing room.
Jitter adds a random offset to the calculated wait time. Without jitter, every client that failed at the same moment will retry at the same moment. That synchronized wave of retries can overwhelm a server that just recovered, causing a second failure called a thundering herd.
import random
def backoff_delay(attempt: int, base: float = 1.0, cap: float = 300.0) -> float:
exponential = base * (2 ** attempt)
capped = min(exponential, cap)
jitter = random.uniform(0, capped)
return jitter
This "full jitter" approach selects a random value between zero and the capped exponential delay. It spreads retry traffic across time instead of concentrating it. For high-volume systems, decorrelated jitter (where each delay is random relative to the previous one, not relative to zero) distributes traffic even more evenly.
Designing your receiver to survive retries
The most important rule for a webhook receiver: respond 200 as fast as possible, then do the actual work afterward.
If your handler synchronously processes the payload (database writes, external API calls, email sends) before returning a response, you risk hitting the provider's timeout. The provider sees no response, marks the delivery failed, and schedules a retry. Your server may have actually processed the event correctly, but the retry will deliver it again.
The solution is to push the work to a background queue.
# Django + Celery example
from django.views.decorators.csrf import csrf_exempt
from django.http import HttpResponse
from .tasks import process_webhook_payload
@csrf_exempt
def webhook_handler(request):
raw_body = request.body
# Verify signature here before queuing
process_webhook_payload.delay(raw_body)
return HttpResponse(status=200)
The handler verifies the signature, enqueues the payload, and returns immediately. The background worker picks up the job and does the real work. The provider gets its 200 in milliseconds regardless of how long processing takes.
Also consider the timeout ceiling for your specific providers. If one of your integrations uses Slack's 3-second timeout, your handler must return within 3 seconds including any synchronous work before the queue. Even signature verification and JSON parsing have a cost at scale.
Dead letter queues
When a provider exhausts its retry schedule, it stops trying. If your handler also processes events from an internal queue, your queue will have its own retry logic. At some point, a job may fail enough times that the system gives up on it entirely. That job needs to go somewhere you can inspect and act on it.
A dead letter queue (DLQ) is a separate queue where failed jobs land after exceeding their retry limit. Rather than silently disappearing, the job is preserved for investigation.
A minimal DLQ setup has three components:
- Failure capture: The job processor catches exceptions, increments a failure counter, and moves the job to the DLQ after a configured maximum.
- Alerting: Something monitors the DLQ and pages someone when items land there. A queue that silently fills is worse than no DLQ at all.
- Replay: You can re-enqueue DLQ items manually once the underlying problem is fixed. Without replay, recovering from a DLQ requires re-triggering events at the source, which may not always be possible.
BullMQ, Celery, and most other job queue systems have built-in DLQ support. For BullMQ, failed jobs are automatically moved to a "failed" set with the error and stack trace preserved. Celery has a similar mechanism using task acks and rejection with the reject_on_worker_lost setting.
For local DLQ testing, Payloader is useful for replaying specific payloads to your handler. You can capture the original delivery, then use Payloader's replay feature to re-POST the exact payload to your local server after fixing the bug, without needing to re-trigger the event at the source.
Building retry logic into your own webhook sender
If your product sends webhooks to customers, you need your own retry implementation. The common mistake is sending webhooks synchronously in the request handler that triggered the event. If the customer's server is slow or down, your request blocks, times out, and the event is lost.
The correct pattern uses a background job for every webhook delivery:
1. Store the pending delivery in your database. When an event occurs, write a record with the target URL, payload, and status (pending). This is your source of truth. If your process crashes before the HTTP request goes out, the delivery is not lost.
2. Enqueue a background job. The job makes the HTTP request to the customer's endpoint. Use a timeout (10-30 seconds is standard). Do not block indefinitely.
3. Implement exponential backoff with jitter. On failure, update the delivery record with the failure count and the calculated next attempt time. Schedule the next job for that time rather than retrying immediately.
4. Cap total attempts. Pick a maximum (19 attempts over 48 hours, or 5 attempts over 3 days, depending on your product's guarantees). After the cap, mark the delivery as permanently failed and stop retrying.
5. Alert on repeated failures. If a customer's endpoint fails consistently across many events, send them an alert. Silently abandoning deliveries damages trust. Give customers visibility into delivery failures and a way to replay missed events.
MAX_ATTEMPTS = 10
def schedule_webhook_delivery(delivery_id: int, attempt: int) -> None:
delivery = WebhookDelivery.objects.get(id=delivery_id)
try:
response = requests.post(
delivery.target_url,
json=delivery.payload,
timeout=30,
headers={"X-Webhook-Signature": sign(delivery.payload)},
)
response.raise_for_status()
delivery.status = "delivered"
delivery.save()
except Exception as exc:
attempt += 1
if attempt >= MAX_ATTEMPTS:
delivery.status = "failed"
delivery.save()
notify_customer_of_failure(delivery)
return
delay = backoff_delay(attempt)
delivery.next_attempt_at = now() + timedelta(seconds=delay)
delivery.attempt_count = attempt
delivery.save()
schedule_webhook_delivery.apply_async(
(delivery_id, attempt), countdown=delay
)
Testing retry behavior with Payloader
Retry logic is difficult to test manually because you have to simulate failures and timing. Payloader makes this easier in a few ways.
When you use a Payloader endpoint as your webhook destination during development, every delivery attempt arrives as a separate request in the log. If a provider retries three times, you will see three entries with their exact timestamps. The time gaps between them confirm whether your provider is following its documented backoff schedule.
To simulate a retry scenario locally, capture a real delivery from the provider in Payloader, then use the replay feature to POST that exact payload to your local handler. You can replay it multiple times in quick succession to test your idempotency handling, or space out the replays manually to simulate the timing of a retry sequence.
If you are testing your own webhook sender's retry logic, point the target URL at a Payloader endpoint that you control. You can inspect each delivery attempt, check the headers (your retry implementation should include metadata like attempt count and delivery ID), and verify that the backoff timing matches your implementation.
Payloader captures the full request: method, headers, body, and timestamp. For retry debugging, the timestamp is often the most important field, since it tells you whether the timing between attempts matches what your code calculated.
Reliability is bidirectional
Webhook reliability requires care on both ends of the connection. Providers do their part by retrying failed deliveries according to a backoff schedule. Your receiver's job is to respond fast and queue work so it does not time out. If you send webhooks, the same rules apply in reverse.
Getting these pieces right means your integrations survive the network failures, deployment windows, and brief outages that are inevitable in production systems.