Resilience & Crash Handling
The FirstBreath Vision system is designed to be resilient to network failures, camera crashes, and container restarts. This page details the robust crash handling and auto-recovery mechanisms implemented in the camera-manager service.
Architecture
The resilience logic is centralized in the Camera Manager, which orchestrates the lifecycle of camera connections (threads) whether running in Batch or Distributed mode.
Key Mechanisms
1. Health Monitoring Loop
A dedicated background thread in manager.py polls the status of all active camera threads every 10 seconds. It uses an internal State Machine to decide when to intervene to avoid flapping (spamming restarts for minor issues).
- Event-Driven Updates: To minimize database IO, the
running_scriptstable is ONLY updated when the camera status changes (e.g.,RUNNING->CRASHED). - Timestamping: When a change occurs,
status_updated_atis set toNOW(). This precise timestamp allows the UI to calculate "Since when" the camera is in this state. - No Periodic Heartbeat: We deliberately avoid writing to the DB if the status is stable ("No news is good news"), significantly reducing database load.
2. Auto-Restart Strategy
The system automatically recovers from various failure modes:
| Failure Mode | Detection Logic | Action |
|---|---|---|
| Thread Crash | CameraReader thread catches exception and sets status='CRASHED' | Immediate Remove + Add of the camera. |
| Stalled Stream | Status is RUNNING but last_frame_time > 60s | Systematic Restart (Assumes frozen connection). |
| Network Flap | Status is RECONNECTING | < 30s: No Action (Debounce) > 30s: Mark as CRASHED in DB> 60s: Force Restart |
3. Startup Recovery (Persistence)
The camera-manager service is stateless in memory but stateful via the database.
- On Startup: The service runs
load_snapshot(). - Logic: It queries the
running_scriptstable for any camera that is marked asrunningORcrashed. - Effect: If the container was restarted (upgrade, crash, manual restart), all previously active cameras are automatically re-initialized.
4. Graceful Shutdown
To support Startup Recovery effectively, we must know which cameras should be running.
- On SIGTERM (Docker Stop): A signal handler intercepts the shutdown request.
- Action: It executes
UPDATE running_scripts SET status='crashed' WHERE status='running'. - Why?: Marking them as
crashedensures they are picked up by the Startup Recovery logic on the next boot. Cameras explicitly stopped by the user (stoppedstatus) remain stopped.
Manual Control
- Start: API sends Redis
start-> Manager adds camera -> DB set torunning. - Stop: API sends Redis
stop-> Manager removes camera -> DB set tostopped. - Result: A manually stopped camera will not automatically restart on container reboot, which is the expected behavior.