← Field Engineer Playbook

Self-healing fleets: design for absence

12 min

The standard you are held to

A professional install runs unattended for weeks. Nobody reboots anything. Every recovery you would perform by hand must be performed by the device itself — you design for your own absence.

The five layers (from production)

Our data-center fleet firmware ships five recovery layers; use them as your checklist:

  1. WiFi watchdog — no connectivity for 5 minutes → reboot. Catches router reboots, DHCP hiccups, radio wedges.
  2. HTTP fail-counter — N consecutive failed posts → reboot. The network can be "up" while the path to your server is not.
  3. Stuck-state recovery — anomaly score pegged for 10 minutes → re-baseline automatically. A rearranged room must not alarm forever.
  4. Scheduled reboot — every 12 hours, unconditionally. Clears slow leaks and wedged drivers before they matter. Unglamorous; extremely effective.
  5. Dead-man on the server — the fleet dashboard alerts when a device stops reporting. Silence is a signal; someone must own it.

Plus the bridge for gaps: an in-RAM ring buffer holds the last frames through WiFi outages, so short blips lose nothing.

Telemetry that answers "why"

Ship the boot reason with every report (esp_reset_reason()): BROWNOUT means bad power at that socket, WDT means firmware wedging, EXT means a human unplugged it. A fleet's reset-reason histogram diagnoses sites from your desk.

The discipline

Every time you fix a device by hand, ask: which layer should have caught this? Add it to the firmware, not to your calendar. Fleets scale; visits do not.


Sign in to use this feature — it takes 20 seconds and it’s free. Sign in Next lesson →