Skip to main content

Introduction

The FirstBreath Vision platform relies on a robust, production-grade monitoring stack to ensure high availability, performance capabilities, and rapid incident response.

This monitoring system provides full observability into the AI pipeline, tracking everything from hardware health (GPU temps) to high-level business logic (inference frames per second).

πŸ— System Architecture​

The monitoring stack operates alongside the core application services (camera-manager, batch-inference) on a shared network.

πŸš€ Key Components​

ComponentRolePort
GrafanaVisualization dashboard and alerting interface.3000
PrometheusScrapes and stores metrics from all services.9090
cAdvisorTracks Docker container resource usage (RAM, CPU).8080
Node ExporterMonitors the host OS (Disk, Network I/O).9100
DCGM ExporterSpecialized NVIDIA exporter for GPU telemetry.9400

🎯 Objectives​

  1. Reliability: Detect service crashes or restarts instantly.
  2. Performance Tuning: Identify bottlenecks (e.g., Inference is too slow, or Redis is lagging).
  3. Hardware Health: Prevent GPU overheating or OOM (Out Of Memory) kills.