The standard you are held to
A professional install runs unattended for weeks. Nobody reboots anything. Every recovery you would perform by hand must be performed by the device itself — you design for your own absence.
The five layers (from production)
Our data-center fleet firmware ships five recovery layers; use them as your checklist:
- WiFi watchdog — no connectivity for 5 minutes → reboot. Catches router reboots, DHCP hiccups, radio wedges.
- HTTP fail-counter — N consecutive failed posts → reboot. The network can be "up" while the path to your server is not.
- Stuck-state recovery — anomaly score pegged for 10 minutes → re-baseline automatically. A rearranged room must not alarm forever.
- Scheduled reboot — every 12 hours, unconditionally. Clears slow leaks and wedged drivers before they matter. Unglamorous; extremely effective.
- Dead-man on the server — the fleet dashboard alerts when a device stops reporting. Silence is a signal; someone must own it.
Plus the bridge for gaps: an in-RAM ring buffer holds the last frames through WiFi outages, so short blips lose nothing.
Telemetry that answers "why"
Ship the boot reason with every report (esp_reset_reason()): BROWNOUT means bad power at that socket, WDT means firmware wedging, EXT means a human unplugged it. A fleet's reset-reason histogram diagnoses sites from your desk.
The discipline
Every time you fix a device by hand, ask: which layer should have caught this? Add it to the firmware, not to your calendar. Fleets scale; visits do not.