Self-healing fleets: design for absence

The standard you are held to

A professional install runs unattended for weeks. Nobody reboots anything. Every recovery you would perform by hand must be performed by the device itself — you design for your own absence.

The five layers (from production)

Our data-center fleet firmware ships five recovery layers; use them as your checklist:

WiFi watchdog — no connectivity for 5 minutes → reboot. Catches router reboots, DHCP hiccups, radio wedges.
HTTP fail-counter — N consecutive failed posts → reboot. The network can be "up" while the path to your server is not.
Stuck-state recovery — anomaly score pegged for 10 minutes → re-baseline automatically. A rearranged room must not alarm forever.
Scheduled reboot — every 12 hours, unconditionally. Clears slow leaks and wedged drivers before they matter. Unglamorous; extremely effective.
Dead-man on the server — the fleet dashboard alerts when a device stops reporting. Silence is a signal; someone must own it.

Plus the bridge for gaps: an in-RAM ring buffer holds the last frames through WiFi outages, so short blips lose nothing.

Telemetry that answers "why"

Ship the boot reason with every report (esp_reset_reason()): BROWNOUT means bad power at that socket, WDT means firmware wedging, EXT means a human unplugged it. A fleet's reset-reason histogram diagnoses sites from your desk.

The discipline

Every time you fix a device by hand, ask: which layer should have caught this? Add it to the firmware, not to your calendar. Fleets scale; visits do not.