Skip to main content

Monitoring & Observability

This document describes the monitoring stack deployed to ensure solution availability and performance.

Monitoring Architecture

Components

Prometheus

  • Image: prom/prometheus:latest
  • Scrape interval: 15 seconds
  • Retention: 15 days (--storage.tsdb.retention.time=15d)
  • Network: monitor-net + dokploy-network

Configured Targets

JobTargetPortMetrics
prometheusmonitoring-prometheus9090Self-monitoring
node-exportermonitoring-node-exporter9100CPU, RAM, disk, network
cadvisormonitoring-cadvisor8080CPU/RAM/network per container
dcgm-exportermonitoring-dcgm-exporter9400GPU usage, temperature, VRAM
camera-managercontrol-hub-camera-manager4000Business metrics (RTSP streams, frames)
batch-inferencecontrol-hub-batch-inference4002Inference latency, batch size, throughput
redis-workercontrol-hub-redis-worker4001Jobs processed, latency, errors

Grafana

  • Image: grafana/grafana:latest
  • Authentication: GitHub OAuth SSO (no login/password)
    • Authorized organizations configured
    • Auto-assign Admin role
    • Login form disabled
  • Provisioning: datasources and dashboards provisioned automatically via YAML files

Pre-configured Dashboards

DashboardFileDisplayed Metrics
Custom Dashboardcustom_dashboard.jsonBusiness metrics (inference, cameras)
DCGM (GPU)dcgm.jsonGPU usage, temperature, VRAM
Node Exporternode-exporter.jsonHost CPU, RAM, disk, network

Node Exporter

  • Image: prom/node-exporter:latest
  • Access: read-only mount of /proc, /sys, /
  • Exclusions: virtual filesystems (/sys, /proc, /dev, /host, /etc)

cAdvisor

  • Image: gcr.io/cadvisor/cadvisor:v0.47.2 (pinned version for cgroup v2 stability)
  • Access: Docker socket + /sys, /var/lib/docker, /dev/disk mounts
  • Mode: privileged + /dev/kmsg access

DCGM Exporter (GPU)

  • Image: nvidia/dcgm-exporter:latest
  • Required: NVIDIA GPU + DCGM drivers
  • Capabilities: SYS_ADMIN
  • Metrics: GPU usage, temperature, VRAM, power, clock speed

Docker Healthchecks

Every critical service has a Docker healthcheck:

ServiceTestIntervalTimeoutStart PeriodRetries
MySQLmysqladmin ping -h localhost10s5s5
Redisredis-cli -a $PASS ping10s5s5
Camera Managerpgrep -f src/manager.py30s10s10s3
Batch Inferencepgrep -f inference_service30s10s30s3
Redis Workerpgrep -f python worker.py30s10s5s3
Showcasewget --spider http://127.0.0.1:300030s10s40s3

Availability

MechanismDescription
API Replicas2 instances with Traefik load balancing
Restart Policiesunless-stopped or always on all services
HealthchecksAutomatic detection of failing services
Depends_onStartup ordering (MySQL → Redis → API)
Start PeriodGrace delay for slow-starting services (batch-inference: 30s)