Skip to content

Self-Healing Health Checks for OpenClaw

Executive Summary

This white paper documents a real-world production failure where an OpenClaw gateway went down for 18+ hours without any alert reaching the operator. We present the root cause analysis, the self-healing health check architecture we built to prevent recurrence, and recommendations for any OpenClaw deployment.

Key Findings

  • A Telegram polling conflict between two gateway instances silently disabled all cron jobs for days
  • The existing health check detected failures but had no effective alerting or remediation
  • A self-healing health check reduced mean-time-to-recovery (MTTR) from 18+ hours to under 3 minutes

The Solution

We built a self-healing health check with three design shifts:

  1. Alert through channels people actually monitor – Telegram DMs instead of macOS notifications
  2. Detect, then fix – Auto-restart gateway, restart containers, bypass failed schedulers
  3. Audit at the job level – Cron job auditor that triggers missed jobs

Results

Metric Before After
Time to recovery 18+ hours < 3 minutes
Missed cron jobs All of them Zero
Alert delivery Silent notification Telegram DM + audio

Read the full white paper on GitHub