Introduction
The FirstBreath Vision platform relies on a robust, production-grade monitoring stack to ensure high availability, performance capabilities, and rapid incident response.
This monitoring system provides full observability into the AI pipeline, tracking everything from hardware health (GPU temps) to high-level business logic (inference frames per second).
π System Architectureβ
The monitoring stack operates alongside the core application services (camera-manager, batch-inference) on a shared network.
π Key Componentsβ
| Component | Role | Port |
|---|---|---|
| Grafana | Visualization dashboard and alerting interface. | 3000 |
| Prometheus | Scrapes and stores metrics from all services. | 9090 |
| cAdvisor | Tracks Docker container resource usage (RAM, CPU). | 8080 |
| Node Exporter | Monitors the host OS (Disk, Network I/O). | 9100 |
| DCGM Exporter | Specialized NVIDIA exporter for GPU telemetry. | 9400 |
π― Objectivesβ
- Reliability: Detect service crashes or restarts instantly.
- Performance Tuning: Identify bottlenecks (e.g., Inference is too slow, or Redis is lagging).
- Hardware Health: Prevent GPU overheating or OOM (Out Of Memory) kills.