Monitoring & Observability
This document describes the monitoring stack deployed to ensure solution availability and performance.
Monitoring Architecture
Components
Prometheus
- Image:
prom/prometheus:latest - Scrape interval: 15 seconds
- Retention: 15 days (
--storage.tsdb.retention.time=15d) - Network:
monitor-net+dokploy-network
Configured Targets
| Job | Target | Port | Metrics |
|---|---|---|---|
prometheus | monitoring-prometheus | 9090 | Self-monitoring |
node-exporter | monitoring-node-exporter | 9100 | CPU, RAM, disk, network |
cadvisor | monitoring-cadvisor | 8080 | CPU/RAM/network per container |
dcgm-exporter | monitoring-dcgm-exporter | 9400 | GPU usage, temperature, VRAM |
camera-manager | control-hub-camera-manager | 4000 | Business metrics (RTSP streams, frames) |
batch-inference | control-hub-batch-inference | 4002 | Inference latency, batch size, throughput |
redis-worker | control-hub-redis-worker | 4001 | Jobs processed, latency, errors |
Grafana
- Image:
grafana/grafana:latest - Authentication: GitHub OAuth SSO (no login/password)
- Authorized organizations configured
- Auto-assign
Adminrole - Login form disabled
- Provisioning: datasources and dashboards provisioned automatically via YAML files
Pre-configured Dashboards
| Dashboard | File | Displayed Metrics |
|---|---|---|
| Custom Dashboard | custom_dashboard.json | Business metrics (inference, cameras) |
| DCGM (GPU) | dcgm.json | GPU usage, temperature, VRAM |
| Node Exporter | node-exporter.json | Host CPU, RAM, disk, network |
Node Exporter
- Image:
prom/node-exporter:latest - Access: read-only mount of
/proc,/sys,/ - Exclusions: virtual filesystems (
/sys,/proc,/dev,/host,/etc)
cAdvisor
- Image:
gcr.io/cadvisor/cadvisor:v0.47.2(pinned version for cgroup v2 stability) - Access: Docker socket +
/sys,/var/lib/docker,/dev/diskmounts - Mode:
privileged+/dev/kmsgaccess
DCGM Exporter (GPU)
- Image:
nvidia/dcgm-exporter:latest - Required: NVIDIA GPU + DCGM drivers
- Capabilities:
SYS_ADMIN - Metrics: GPU usage, temperature, VRAM, power, clock speed
Docker Healthchecks
Every critical service has a Docker healthcheck:
| Service | Test | Interval | Timeout | Start Period | Retries |
|---|---|---|---|---|---|
| MySQL | mysqladmin ping -h localhost | 10s | 5s | — | 5 |
| Redis | redis-cli -a $PASS ping | 10s | 5s | — | 5 |
| Camera Manager | pgrep -f src/manager.py | 30s | 10s | 10s | 3 |
| Batch Inference | pgrep -f inference_service | 30s | 10s | 30s | 3 |
| Redis Worker | pgrep -f python worker.py | 30s | 10s | 5s | 3 |
| Showcase | wget --spider http://127.0.0.1:3000 | 30s | 10s | 40s | 3 |
Availability
| Mechanism | Description |
|---|---|
| API Replicas | 2 instances with Traefik load balancing |
| Restart Policies | unless-stopped or always on all services |
| Healthchecks | Automatic detection of failing services |
| Depends_on | Startup ordering (MySQL → Redis → API) |
| Start Period | Grace delay for slow-starting services (batch-inference: 30s) |