Re_Backend/monitoring/DASHBOARD_METRICS_REFERENCE.md

# RE Workflow Dashboard - Metrics Reference

## 📊 Complete KPI List with Data Sources

### **Section 1: API Overview**

| Panel Name | Metric Query | Data Source | What It Measures |
|------------|--------------|-------------|------------------|
| **Request Rate** | `sum(rate(http_requests_total{job="re-workflow-backend"}[5m]))` | Backend metrics | HTTP requests per second (all endpoints) |
| **Error Rate** | `sum(rate(http_request_errors_total{job="re-workflow-backend"}[5m])) / sum(rate(http_requests_total{job="re-workflow-backend"}[5m]))` | Backend metrics | Percentage of failed HTTP requests |
| **P95 Latency** | `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="re-workflow-backend"}[5m])) by (le))` | Backend metrics | 95th percentile response time (seconds) |
| **API Status** | `up{job="re-workflow-backend"}` | Prometheus | Backend service up/down status (1=up, 0=down) |
| **Request Rate by Method** | `sum(rate(http_requests_total{job="re-workflow-backend"}[5m])) by (method)` | Backend metrics | Requests per method (GET, POST, etc.) |
| **Response Time Percentiles** | `histogram_quantile(0.50/0.95/0.99, ...)` | Backend metrics | Response time distribution (P50, P95, P99) |

---

### **Section 2: Logs**

| Panel Name | Metric Query | Data Source | What It Measures |
|------------|--------------|-------------|------------------|
| **Errors (Time Range)** | `count_over_time({job="re-workflow-backend", level="error"}[...])` | Loki logs | Total error log entries in selected time range |
| **Warnings (Time Range)** | `count_over_time({job="re-workflow-backend", level="warn"}[...])` | Loki logs | Total warning log entries in selected time range |
| **TAT Breaches (Time Range)** | Log filter for TAT breaches | Loki logs | TAT breach events logged |
| **Auth Failures (Time Range)** | Log filter for auth failures | Loki logs | Authentication failure events |
| **Recent Errors & Warnings** | `{job="re-workflow-backend"} \|= "error" or "warn"` | Loki logs | Live log stream of errors and warnings |

---

### **Section 3: Node.js Runtime** (Process-Level Metrics)

| Panel Name | Metric Query | Data Source | What It Measures |
|------------|--------------|-------------|------------------|
| **Node.js Process Memory (Heap)** | `process_resident_memory_bytes{job="re-workflow-backend"}` <br> `nodejs_heap_size_used_bytes{job="re-workflow-backend"}` <br> `nodejs_heap_size_total_bytes{job="re-workflow-backend"}` | Node.js metrics (prom-client) | Node.js process memory usage: <br>- RSS (Resident Set Size) <br>- Heap Used <br>- Heap Total |
| **Node.js Event Loop Lag** | `nodejs_eventloop_lag_seconds{job="re-workflow-backend"}` | Node.js metrics | Event loop lag in seconds (high = performance issue) |
| **Node.js Active Handles & Requests** | `nodejs_active_handles_total{job="re-workflow-backend"}` <br> `nodejs_active_requests_total{job="re-workflow-backend"}` | Node.js metrics | Active file handles and pending async requests |
| **Node.js Process CPU Usage** | `rate(process_cpu_seconds_total{job="re-workflow-backend"}[5m])` | Node.js metrics | CPU usage by Node.js process only (0-1 = 0-100%) |

**Key Point**: These metrics track the **Node.js application process** specifically, not the entire host system.

---

### **Section 4: Redis & Queue Status**

| Panel Name | Metric Query | Data Source | What It Measures |
|------------|--------------|-------------|------------------|
| **Redis Status** | `redis_up` | Redis Exporter | Redis server status (1=up, 0=down) |
| **Redis Connections** | `redis_connected_clients` | Redis Exporter | Number of active client connections to Redis |
| **Redis Memory** | `redis_memory_used_bytes` | Redis Exporter | Memory used by Redis (bytes) |
| **TAT Queue Waiting** | `queue_jobs_waiting{queue_name="tatQueue"}` | Backend queue metrics | Jobs waiting in TAT notification queue |
| **Pause/Resume Queue Waiting** | `queue_jobs_waiting{queue_name="pauseResumeQueue"}` | Backend queue metrics | Jobs waiting in pause/resume queue |
| **TAT Queue Failed** | `queue_jobs_failed{queue_name="tatQueue"}` | Backend queue metrics | Failed TAT notification jobs (should be 0) |
| **Pause/Resume Queue Failed** | `queue_jobs_failed{queue_name="pauseResumeQueue"}` | Backend queue metrics | Failed pause/resume jobs (should be 0) |
| **All Queues - Job Status** | `queue_jobs_waiting` <br> `queue_jobs_active` <br> `queue_jobs_delayed` | Backend queue metrics | Timeline of job status across all queues (stacked) |
| **Redis Commands Rate** | `rate(redis_commands_processed_total[1m])` | Redis Exporter | Redis commands executed per second |

**Key Point**: Queue metrics are collected by the backend every 15 seconds via BullMQ queue API.

---

### **Section 5: System Resources (Host)** (Host-Level Metrics)

| Panel Name | Metric Query | Data Source | What It Measures |
|------------|--------------|-------------|------------------|
| **Host CPU Usage (All Cores)** | `100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)` | Node Exporter | Total CPU usage across all cores on host machine (%) |
| **Host Memory Usage (RAM)** | `(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100` | Node Exporter | RAM usage on host machine (%) |
| **Host Disk Usage (/root)** | `100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"}) * 100)` | Node Exporter | Disk usage of root filesystem (%) |
| **Disk Space Left** | `node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"}` | Node Exporter | Available disk space in gigabytes |

**Key Point**: These metrics track the **entire host system**, not just the Node.js process.

---

## 🔍 Data Source Summary

| Exporter/Service | Port | Metrics Provided | Collection Interval |
|------------------|------|------------------|---------------------|
| **RE Workflow Backend** | 5000 | HTTP metrics, custom business metrics, Node.js runtime | 10s (Prometheus scrape) |
| **Node Exporter** | 9100 | Host system metrics (CPU, memory, disk, network) | 15s (Prometheus scrape) |
| **Redis Exporter** | 9121 | Redis server metrics (connections, memory, commands) | 15s (Prometheus scrape) |
| **Queue Metrics** | 5000 | BullMQ queue job counts (via backend) | 15s (internal collection) |
| **Loki** | 3100 | Application logs | Real-time streaming |

---

## 🎯 Renamed Panels for Clarity

### Before → After

**Node.js Runtime Section:**
- ❌ "Memory Usage" → ✅ "Node.js Process Memory (Heap)"
- ❌ "CPU Usage" → ✅ "Node.js Process CPU Usage"
- ❌ "Event Loop Lag" → ✅ "Node.js Event Loop Lag"
- ❌ "Active Handles & Requests" → ✅ "Node.js Active Handles & Requests"

**System Resources Section:**
- ❌ "System CPU Usage" → ✅ "Host CPU Usage (All Cores)"
- ❌ "System Memory Usage" → ✅ "Host Memory Usage (RAM)"
- ❌ "System Disk Usage" → ✅ "Host Disk Usage (/root)"

---

## 📈 Understanding the Difference

### **Process vs Host Metrics**

| Aspect | Node.js Process Metrics | Host System Metrics |
|--------|------------------------|---------------------|
| **Scope** | Single Node.js application | Entire server/container |
| **CPU** | CPU used by Node.js only | CPU used by all processes |
| **Memory** | Node.js heap memory | Total RAM on machine |
| **Purpose** | Application performance | Infrastructure health |
| **Example Use** | Detect memory leaks in app | Ensure server has capacity |

**Example Scenario:**
- **Node.js Process CPU**: 15% → Your app is using 15% of one CPU core
- **Host CPU Usage**: 75% → The entire server is at 75% CPU (all processes combined)

---

## 🚨 Alert Thresholds (Recommended)

| Metric | Warning | Critical | Action |
|--------|---------|----------|--------|
| **Node.js Process Memory** | 80% of heap | 90% of heap | Investigate memory leaks |
| **Host Memory Usage** | 70% | 85% | Scale up or optimize |
| **Host CPU Usage** | 60% | 80% | Scale horizontally |
| **Redis Memory** | 500MB | 1GB | Review Redis usage |
| **Queue Jobs Waiting** | >10 | >50 | Check worker health |
| **Queue Jobs Failed** | >0 | >5 | Immediate investigation |
| **Event Loop Lag** | >100ms | >500ms | Performance optimization needed |

---

## 🔧 Troubleshooting

### No Data Showing?

1. **Check Prometheus Targets**: http://localhost:9090/targets
   - All targets should show "UP" status

2. **Test Metric Availability**:
   ```promql
   up{job="re-workflow-backend"}
   ```
   Should return `1`

3. **Check Time Range**: Set to "Last 15 minutes" in Grafana

4. **Verify Backend**: http://localhost:5000/metrics should show all metrics

### Metrics Not Updating?

1. **Backend**: Ensure backend is running with metrics collection enabled
2. **Prometheus**: Check scrape interval in prometheus.yml
3. **Queue Metrics**: Verify queue metrics collection started (check backend logs for "Queue Metrics ✅")

---

## 📚 Additional Resources

- **Prometheus Query Language**: https://prometheus.io/docs/prometheus/latest/querying/basics/
- **Grafana Dashboard Guide**: https://grafana.com/docs/grafana/latest/dashboards/
- **Node Exporter Metrics**: https://github.com/prometheus/node_exporter
- **Redis Exporter Metrics**: https://github.com/oliver006/redis_exporter
- **BullMQ Monitoring**: https://docs.bullmq.io/guide/metrics