Introduction

The FirstBreath Vision platform relies on a robust, production-grade monitoring stack to ensure high availability, performance capabilities, and rapid incident response.

This monitoring system provides full observability into the AI pipeline, tracking everything from hardware health (GPU temps) to high-level business logic (inference frames per second).

🏗 System Architecture

The monitoring stack operates alongside the core application services (camera-manager, batch-inference) on a shared network.

🚀 Key Components

Component	Role	Port
Grafana	Visualization dashboard and alerting interface.	`3000`
Prometheus	Scrapes and stores metrics from all services.	`9090`
cAdvisor	Tracks Docker container resource usage (RAM, CPU).	`8080`
Node Exporter	Monitors the host OS (Disk, Network I/O).	`9100`
DCGM Exporter	Specialized NVIDIA exporter for GPU telemetry.	`9400`

🎯 Objectives

Reliability: Detect service crashes or restarts instantly.
Performance Tuning: Identify bottlenecks (e.g., Inference is too slow, or Redis is lagging).
Hardware Health: Prevent GPU overheating or OOM (Out Of Memory) kills.

🏗 System Architecture​

🚀 Key Components​

🎯 Objectives​

🏗 System Architecture

🚀 Key Components

🎯 Objectives