Re_Backend/monitoring/DASHBOARD_METRICS_REFERENCE.md

168 lines
9.1 KiB
Markdown

# RE Workflow Dashboard - Metrics Reference
## 📊 Complete KPI List with Data Sources
### **Section 1: API Overview**
| Panel Name | Metric Query | Data Source | What It Measures |
|------------|--------------|-------------|------------------|
| **Request Rate** | `sum(rate(http_requests_total{job="re-workflow-backend"}[5m]))` | Backend metrics | HTTP requests per second (all endpoints) |
| **Error Rate** | `sum(rate(http_request_errors_total{job="re-workflow-backend"}[5m])) / sum(rate(http_requests_total{job="re-workflow-backend"}[5m]))` | Backend metrics | Percentage of failed HTTP requests |
| **P95 Latency** | `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="re-workflow-backend"}[5m])) by (le))` | Backend metrics | 95th percentile response time (seconds) |
| **API Status** | `up{job="re-workflow-backend"}` | Prometheus | Backend service up/down status (1=up, 0=down) |
| **Request Rate by Method** | `sum(rate(http_requests_total{job="re-workflow-backend"}[5m])) by (method)` | Backend metrics | Requests per method (GET, POST, etc.) |
| **Response Time Percentiles** | `histogram_quantile(0.50/0.95/0.99, ...)` | Backend metrics | Response time distribution (P50, P95, P99) |
---
### **Section 2: Logs**
| Panel Name | Metric Query | Data Source | What It Measures |
|------------|--------------|-------------|------------------|
| **Errors (Time Range)** | `count_over_time({job="re-workflow-backend", level="error"}[...])` | Loki logs | Total error log entries in selected time range |
| **Warnings (Time Range)** | `count_over_time({job="re-workflow-backend", level="warn"}[...])` | Loki logs | Total warning log entries in selected time range |
| **TAT Breaches (Time Range)** | Log filter for TAT breaches | Loki logs | TAT breach events logged |
| **Auth Failures (Time Range)** | Log filter for auth failures | Loki logs | Authentication failure events |
| **Recent Errors & Warnings** | `{job="re-workflow-backend"} \|= "error" or "warn"` | Loki logs | Live log stream of errors and warnings |
---
### **Section 3: Node.js Runtime** (Process-Level Metrics)
| Panel Name | Metric Query | Data Source | What It Measures |
|------------|--------------|-------------|------------------|
| **Node.js Process Memory (Heap)** | `process_resident_memory_bytes{job="re-workflow-backend"}` <br> `nodejs_heap_size_used_bytes{job="re-workflow-backend"}` <br> `nodejs_heap_size_total_bytes{job="re-workflow-backend"}` | Node.js metrics (prom-client) | Node.js process memory usage: <br>- RSS (Resident Set Size) <br>- Heap Used <br>- Heap Total |
| **Node.js Event Loop Lag** | `nodejs_eventloop_lag_seconds{job="re-workflow-backend"}` | Node.js metrics | Event loop lag in seconds (high = performance issue) |
| **Node.js Active Handles & Requests** | `nodejs_active_handles_total{job="re-workflow-backend"}` <br> `nodejs_active_requests_total{job="re-workflow-backend"}` | Node.js metrics | Active file handles and pending async requests |
| **Node.js Process CPU Usage** | `rate(process_cpu_seconds_total{job="re-workflow-backend"}[5m])` | Node.js metrics | CPU usage by Node.js process only (0-1 = 0-100%) |
**Key Point**: These metrics track the **Node.js application process** specifically, not the entire host system.
---
### **Section 4: Redis & Queue Status**
| Panel Name | Metric Query | Data Source | What It Measures |
|------------|--------------|-------------|------------------|
| **Redis Status** | `redis_up` | Redis Exporter | Redis server status (1=up, 0=down) |
| **Redis Connections** | `redis_connected_clients` | Redis Exporter | Number of active client connections to Redis |
| **Redis Memory** | `redis_memory_used_bytes` | Redis Exporter | Memory used by Redis (bytes) |
| **TAT Queue Waiting** | `queue_jobs_waiting{queue_name="tatQueue"}` | Backend queue metrics | Jobs waiting in TAT notification queue |
| **Pause/Resume Queue Waiting** | `queue_jobs_waiting{queue_name="pauseResumeQueue"}` | Backend queue metrics | Jobs waiting in pause/resume queue |
| **TAT Queue Failed** | `queue_jobs_failed{queue_name="tatQueue"}` | Backend queue metrics | Failed TAT notification jobs (should be 0) |
| **Pause/Resume Queue Failed** | `queue_jobs_failed{queue_name="pauseResumeQueue"}` | Backend queue metrics | Failed pause/resume jobs (should be 0) |
| **All Queues - Job Status** | `queue_jobs_waiting` <br> `queue_jobs_active` <br> `queue_jobs_delayed` | Backend queue metrics | Timeline of job status across all queues (stacked) |
| **Redis Commands Rate** | `rate(redis_commands_processed_total[1m])` | Redis Exporter | Redis commands executed per second |
**Key Point**: Queue metrics are collected by the backend every 15 seconds via BullMQ queue API.
---
### **Section 5: System Resources (Host)** (Host-Level Metrics)
| Panel Name | Metric Query | Data Source | What It Measures |
|------------|--------------|-------------|------------------|
| **Host CPU Usage (All Cores)** | `100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)` | Node Exporter | Total CPU usage across all cores on host machine (%) |
| **Host Memory Usage (RAM)** | `(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100` | Node Exporter | RAM usage on host machine (%) |
| **Host Disk Usage (/root)** | `100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"}) * 100)` | Node Exporter | Disk usage of root filesystem (%) |
| **Disk Space Left** | `node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"}` | Node Exporter | Available disk space in gigabytes |
**Key Point**: These metrics track the **entire host system**, not just the Node.js process.
---
## 🔍 Data Source Summary
| Exporter/Service | Port | Metrics Provided | Collection Interval |
|------------------|------|------------------|---------------------|
| **RE Workflow Backend** | 5000 | HTTP metrics, custom business metrics, Node.js runtime | 10s (Prometheus scrape) |
| **Node Exporter** | 9100 | Host system metrics (CPU, memory, disk, network) | 15s (Prometheus scrape) |
| **Redis Exporter** | 9121 | Redis server metrics (connections, memory, commands) | 15s (Prometheus scrape) |
| **Queue Metrics** | 5000 | BullMQ queue job counts (via backend) | 15s (internal collection) |
| **Loki** | 3100 | Application logs | Real-time streaming |
---
## 🎯 Renamed Panels for Clarity
### Before → After
**Node.js Runtime Section:**
- ❌ "Memory Usage" → ✅ "Node.js Process Memory (Heap)"
- ❌ "CPU Usage" → ✅ "Node.js Process CPU Usage"
- ❌ "Event Loop Lag" → ✅ "Node.js Event Loop Lag"
- ❌ "Active Handles & Requests" → ✅ "Node.js Active Handles & Requests"
**System Resources Section:**
- ❌ "System CPU Usage" → ✅ "Host CPU Usage (All Cores)"
- ❌ "System Memory Usage" → ✅ "Host Memory Usage (RAM)"
- ❌ "System Disk Usage" → ✅ "Host Disk Usage (/root)"
---
## 📈 Understanding the Difference
### **Process vs Host Metrics**
| Aspect | Node.js Process Metrics | Host System Metrics |
|--------|------------------------|---------------------|
| **Scope** | Single Node.js application | Entire server/container |
| **CPU** | CPU used by Node.js only | CPU used by all processes |
| **Memory** | Node.js heap memory | Total RAM on machine |
| **Purpose** | Application performance | Infrastructure health |
| **Example Use** | Detect memory leaks in app | Ensure server has capacity |
**Example Scenario:**
- **Node.js Process CPU**: 15% → Your app is using 15% of one CPU core
- **Host CPU Usage**: 75% → The entire server is at 75% CPU (all processes combined)
---
## 🚨 Alert Thresholds (Recommended)
| Metric | Warning | Critical | Action |
|--------|---------|----------|--------|
| **Node.js Process Memory** | 80% of heap | 90% of heap | Investigate memory leaks |
| **Host Memory Usage** | 70% | 85% | Scale up or optimize |
| **Host CPU Usage** | 60% | 80% | Scale horizontally |
| **Redis Memory** | 500MB | 1GB | Review Redis usage |
| **Queue Jobs Waiting** | >10 | >50 | Check worker health |
| **Queue Jobs Failed** | >0 | >5 | Immediate investigation |
| **Event Loop Lag** | >100ms | >500ms | Performance optimization needed |
---
## 🔧 Troubleshooting
### No Data Showing?
1. **Check Prometheus Targets**: http://localhost:9090/targets
- All targets should show "UP" status
2. **Test Metric Availability**:
```promql
up{job="re-workflow-backend"}
```
Should return `1`
3. **Check Time Range**: Set to "Last 15 minutes" in Grafana
4. **Verify Backend**: http://localhost:5000/metrics should show all metrics
### Metrics Not Updating?
1. **Backend**: Ensure backend is running with metrics collection enabled
2. **Prometheus**: Check scrape interval in prometheus.yml
3. **Queue Metrics**: Verify queue metrics collection started (check backend logs for "Queue Metrics ✅")
---
## 📚 Additional Resources
- **Prometheus Query Language**: https://prometheus.io/docs/prometheus/latest/querying/basics/
- **Grafana Dashboard Guide**: https://grafana.com/docs/grafana/latest/dashboards/
- **Node Exporter Metrics**: https://github.com/prometheus/node_exporter
- **Redis Exporter Metrics**: https://github.com/oliver006/redis_exporter
- **BullMQ Monitoring**: https://docs.bullmq.io/guide/metrics