168 lines
9.1 KiB
Markdown
168 lines
9.1 KiB
Markdown
# RE Workflow Dashboard - Metrics Reference
|
|
|
|
## 📊 Complete KPI List with Data Sources
|
|
|
|
### **Section 1: API Overview**
|
|
|
|
| Panel Name | Metric Query | Data Source | What It Measures |
|
|
|------------|--------------|-------------|------------------|
|
|
| **Request Rate** | `sum(rate(http_requests_total{job="re-workflow-backend"}[5m]))` | Backend metrics | HTTP requests per second (all endpoints) |
|
|
| **Error Rate** | `sum(rate(http_request_errors_total{job="re-workflow-backend"}[5m])) / sum(rate(http_requests_total{job="re-workflow-backend"}[5m]))` | Backend metrics | Percentage of failed HTTP requests |
|
|
| **P95 Latency** | `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="re-workflow-backend"}[5m])) by (le))` | Backend metrics | 95th percentile response time (seconds) |
|
|
| **API Status** | `up{job="re-workflow-backend"}` | Prometheus | Backend service up/down status (1=up, 0=down) |
|
|
| **Request Rate by Method** | `sum(rate(http_requests_total{job="re-workflow-backend"}[5m])) by (method)` | Backend metrics | Requests per method (GET, POST, etc.) |
|
|
| **Response Time Percentiles** | `histogram_quantile(0.50/0.95/0.99, ...)` | Backend metrics | Response time distribution (P50, P95, P99) |
|
|
|
|
---
|
|
|
|
### **Section 2: Logs**
|
|
|
|
| Panel Name | Metric Query | Data Source | What It Measures |
|
|
|------------|--------------|-------------|------------------|
|
|
| **Errors (Time Range)** | `count_over_time({job="re-workflow-backend", level="error"}[...])` | Loki logs | Total error log entries in selected time range |
|
|
| **Warnings (Time Range)** | `count_over_time({job="re-workflow-backend", level="warn"}[...])` | Loki logs | Total warning log entries in selected time range |
|
|
| **TAT Breaches (Time Range)** | Log filter for TAT breaches | Loki logs | TAT breach events logged |
|
|
| **Auth Failures (Time Range)** | Log filter for auth failures | Loki logs | Authentication failure events |
|
|
| **Recent Errors & Warnings** | `{job="re-workflow-backend"} \|= "error" or "warn"` | Loki logs | Live log stream of errors and warnings |
|
|
|
|
---
|
|
|
|
### **Section 3: Node.js Runtime** (Process-Level Metrics)
|
|
|
|
| Panel Name | Metric Query | Data Source | What It Measures |
|
|
|------------|--------------|-------------|------------------|
|
|
| **Node.js Process Memory (Heap)** | `process_resident_memory_bytes{job="re-workflow-backend"}` <br> `nodejs_heap_size_used_bytes{job="re-workflow-backend"}` <br> `nodejs_heap_size_total_bytes{job="re-workflow-backend"}` | Node.js metrics (prom-client) | Node.js process memory usage: <br>- RSS (Resident Set Size) <br>- Heap Used <br>- Heap Total |
|
|
| **Node.js Event Loop Lag** | `nodejs_eventloop_lag_seconds{job="re-workflow-backend"}` | Node.js metrics | Event loop lag in seconds (high = performance issue) |
|
|
| **Node.js Active Handles & Requests** | `nodejs_active_handles_total{job="re-workflow-backend"}` <br> `nodejs_active_requests_total{job="re-workflow-backend"}` | Node.js metrics | Active file handles and pending async requests |
|
|
| **Node.js Process CPU Usage** | `rate(process_cpu_seconds_total{job="re-workflow-backend"}[5m])` | Node.js metrics | CPU usage by Node.js process only (0-1 = 0-100%) |
|
|
|
|
**Key Point**: These metrics track the **Node.js application process** specifically, not the entire host system.
|
|
|
|
---
|
|
|
|
### **Section 4: Redis & Queue Status**
|
|
|
|
| Panel Name | Metric Query | Data Source | What It Measures |
|
|
|------------|--------------|-------------|------------------|
|
|
| **Redis Status** | `redis_up` | Redis Exporter | Redis server status (1=up, 0=down) |
|
|
| **Redis Connections** | `redis_connected_clients` | Redis Exporter | Number of active client connections to Redis |
|
|
| **Redis Memory** | `redis_memory_used_bytes` | Redis Exporter | Memory used by Redis (bytes) |
|
|
| **TAT Queue Waiting** | `queue_jobs_waiting{queue_name="tatQueue"}` | Backend queue metrics | Jobs waiting in TAT notification queue |
|
|
| **Pause/Resume Queue Waiting** | `queue_jobs_waiting{queue_name="pauseResumeQueue"}` | Backend queue metrics | Jobs waiting in pause/resume queue |
|
|
| **TAT Queue Failed** | `queue_jobs_failed{queue_name="tatQueue"}` | Backend queue metrics | Failed TAT notification jobs (should be 0) |
|
|
| **Pause/Resume Queue Failed** | `queue_jobs_failed{queue_name="pauseResumeQueue"}` | Backend queue metrics | Failed pause/resume jobs (should be 0) |
|
|
| **All Queues - Job Status** | `queue_jobs_waiting` <br> `queue_jobs_active` <br> `queue_jobs_delayed` | Backend queue metrics | Timeline of job status across all queues (stacked) |
|
|
| **Redis Commands Rate** | `rate(redis_commands_processed_total[1m])` | Redis Exporter | Redis commands executed per second |
|
|
|
|
**Key Point**: Queue metrics are collected by the backend every 15 seconds via BullMQ queue API.
|
|
|
|
---
|
|
|
|
### **Section 5: System Resources (Host)** (Host-Level Metrics)
|
|
|
|
| Panel Name | Metric Query | Data Source | What It Measures |
|
|
|------------|--------------|-------------|------------------|
|
|
| **Host CPU Usage (All Cores)** | `100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)` | Node Exporter | Total CPU usage across all cores on host machine (%) |
|
|
| **Host Memory Usage (RAM)** | `(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100` | Node Exporter | RAM usage on host machine (%) |
|
|
| **Host Disk Usage (/root)** | `100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"}) * 100)` | Node Exporter | Disk usage of root filesystem (%) |
|
|
| **Disk Space Left** | `node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"}` | Node Exporter | Available disk space in gigabytes |
|
|
|
|
**Key Point**: These metrics track the **entire host system**, not just the Node.js process.
|
|
|
|
---
|
|
|
|
## 🔍 Data Source Summary
|
|
|
|
| Exporter/Service | Port | Metrics Provided | Collection Interval |
|
|
|------------------|------|------------------|---------------------|
|
|
| **RE Workflow Backend** | 5000 | HTTP metrics, custom business metrics, Node.js runtime | 10s (Prometheus scrape) |
|
|
| **Node Exporter** | 9100 | Host system metrics (CPU, memory, disk, network) | 15s (Prometheus scrape) |
|
|
| **Redis Exporter** | 9121 | Redis server metrics (connections, memory, commands) | 15s (Prometheus scrape) |
|
|
| **Queue Metrics** | 5000 | BullMQ queue job counts (via backend) | 15s (internal collection) |
|
|
| **Loki** | 3100 | Application logs | Real-time streaming |
|
|
|
|
---
|
|
|
|
## 🎯 Renamed Panels for Clarity
|
|
|
|
### Before → After
|
|
|
|
**Node.js Runtime Section:**
|
|
- ❌ "Memory Usage" → ✅ "Node.js Process Memory (Heap)"
|
|
- ❌ "CPU Usage" → ✅ "Node.js Process CPU Usage"
|
|
- ❌ "Event Loop Lag" → ✅ "Node.js Event Loop Lag"
|
|
- ❌ "Active Handles & Requests" → ✅ "Node.js Active Handles & Requests"
|
|
|
|
**System Resources Section:**
|
|
- ❌ "System CPU Usage" → ✅ "Host CPU Usage (All Cores)"
|
|
- ❌ "System Memory Usage" → ✅ "Host Memory Usage (RAM)"
|
|
- ❌ "System Disk Usage" → ✅ "Host Disk Usage (/root)"
|
|
|
|
---
|
|
|
|
## 📈 Understanding the Difference
|
|
|
|
### **Process vs Host Metrics**
|
|
|
|
| Aspect | Node.js Process Metrics | Host System Metrics |
|
|
|--------|------------------------|---------------------|
|
|
| **Scope** | Single Node.js application | Entire server/container |
|
|
| **CPU** | CPU used by Node.js only | CPU used by all processes |
|
|
| **Memory** | Node.js heap memory | Total RAM on machine |
|
|
| **Purpose** | Application performance | Infrastructure health |
|
|
| **Example Use** | Detect memory leaks in app | Ensure server has capacity |
|
|
|
|
**Example Scenario:**
|
|
- **Node.js Process CPU**: 15% → Your app is using 15% of one CPU core
|
|
- **Host CPU Usage**: 75% → The entire server is at 75% CPU (all processes combined)
|
|
|
|
---
|
|
|
|
## 🚨 Alert Thresholds (Recommended)
|
|
|
|
| Metric | Warning | Critical | Action |
|
|
|--------|---------|----------|--------|
|
|
| **Node.js Process Memory** | 80% of heap | 90% of heap | Investigate memory leaks |
|
|
| **Host Memory Usage** | 70% | 85% | Scale up or optimize |
|
|
| **Host CPU Usage** | 60% | 80% | Scale horizontally |
|
|
| **Redis Memory** | 500MB | 1GB | Review Redis usage |
|
|
| **Queue Jobs Waiting** | >10 | >50 | Check worker health |
|
|
| **Queue Jobs Failed** | >0 | >5 | Immediate investigation |
|
|
| **Event Loop Lag** | >100ms | >500ms | Performance optimization needed |
|
|
|
|
---
|
|
|
|
## 🔧 Troubleshooting
|
|
|
|
### No Data Showing?
|
|
|
|
1. **Check Prometheus Targets**: http://localhost:9090/targets
|
|
- All targets should show "UP" status
|
|
|
|
2. **Test Metric Availability**:
|
|
```promql
|
|
up{job="re-workflow-backend"}
|
|
```
|
|
Should return `1`
|
|
|
|
3. **Check Time Range**: Set to "Last 15 minutes" in Grafana
|
|
|
|
4. **Verify Backend**: http://localhost:5000/metrics should show all metrics
|
|
|
|
### Metrics Not Updating?
|
|
|
|
1. **Backend**: Ensure backend is running with metrics collection enabled
|
|
2. **Prometheus**: Check scrape interval in prometheus.yml
|
|
3. **Queue Metrics**: Verify queue metrics collection started (check backend logs for "Queue Metrics ✅")
|
|
|
|
---
|
|
|
|
## 📚 Additional Resources
|
|
|
|
- **Prometheus Query Language**: https://prometheus.io/docs/prometheus/latest/querying/basics/
|
|
- **Grafana Dashboard Guide**: https://grafana.com/docs/grafana/latest/dashboards/
|
|
- **Node Exporter Metrics**: https://github.com/prometheus/node_exporter
|
|
- **Redis Exporter Metrics**: https://github.com/oliver006/redis_exporter
|
|
- **BullMQ Monitoring**: https://docs.bullmq.io/guide/metrics
|
|
|