# RE Workflow Dashboard - Metrics Reference ## 📊 Complete KPI List with Data Sources ### **Section 1: API Overview** | Panel Name | Metric Query | Data Source | What It Measures | |------------|--------------|-------------|------------------| | **Request Rate** | `sum(rate(http_requests_total{job="re-workflow-backend"}[5m]))` | Backend metrics | HTTP requests per second (all endpoints) | | **Error Rate** | `sum(rate(http_request_errors_total{job="re-workflow-backend"}[5m])) / sum(rate(http_requests_total{job="re-workflow-backend"}[5m]))` | Backend metrics | Percentage of failed HTTP requests | | **P95 Latency** | `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="re-workflow-backend"}[5m])) by (le))` | Backend metrics | 95th percentile response time (seconds) | | **API Status** | `up{job="re-workflow-backend"}` | Prometheus | Backend service up/down status (1=up, 0=down) | | **Request Rate by Method** | `sum(rate(http_requests_total{job="re-workflow-backend"}[5m])) by (method)` | Backend metrics | Requests per method (GET, POST, etc.) | | **Response Time Percentiles** | `histogram_quantile(0.50/0.95/0.99, ...)` | Backend metrics | Response time distribution (P50, P95, P99) | --- ### **Section 2: Logs** | Panel Name | Metric Query | Data Source | What It Measures | |------------|--------------|-------------|------------------| | **Errors (Time Range)** | `count_over_time({job="re-workflow-backend", level="error"}[...])` | Loki logs | Total error log entries in selected time range | | **Warnings (Time Range)** | `count_over_time({job="re-workflow-backend", level="warn"}[...])` | Loki logs | Total warning log entries in selected time range | | **TAT Breaches (Time Range)** | Log filter for TAT breaches | Loki logs | TAT breach events logged | | **Auth Failures (Time Range)** | Log filter for auth failures | Loki logs | Authentication failure events | | **Recent Errors & Warnings** | `{job="re-workflow-backend"} \|= "error" or "warn"` | Loki logs | Live log stream of errors and warnings | --- ### **Section 3: Node.js Runtime** (Process-Level Metrics) | Panel Name | Metric Query | Data Source | What It Measures | |------------|--------------|-------------|------------------| | **Node.js Process Memory (Heap)** | `process_resident_memory_bytes{job="re-workflow-backend"}`
`nodejs_heap_size_used_bytes{job="re-workflow-backend"}`
`nodejs_heap_size_total_bytes{job="re-workflow-backend"}` | Node.js metrics (prom-client) | Node.js process memory usage:
- RSS (Resident Set Size)
- Heap Used
- Heap Total | | **Node.js Event Loop Lag** | `nodejs_eventloop_lag_seconds{job="re-workflow-backend"}` | Node.js metrics | Event loop lag in seconds (high = performance issue) | | **Node.js Active Handles & Requests** | `nodejs_active_handles_total{job="re-workflow-backend"}`
`nodejs_active_requests_total{job="re-workflow-backend"}` | Node.js metrics | Active file handles and pending async requests | | **Node.js Process CPU Usage** | `rate(process_cpu_seconds_total{job="re-workflow-backend"}[5m])` | Node.js metrics | CPU usage by Node.js process only (0-1 = 0-100%) | **Key Point**: These metrics track the **Node.js application process** specifically, not the entire host system. --- ### **Section 4: Redis & Queue Status** | Panel Name | Metric Query | Data Source | What It Measures | |------------|--------------|-------------|------------------| | **Redis Status** | `redis_up` | Redis Exporter | Redis server status (1=up, 0=down) | | **Redis Connections** | `redis_connected_clients` | Redis Exporter | Number of active client connections to Redis | | **Redis Memory** | `redis_memory_used_bytes` | Redis Exporter | Memory used by Redis (bytes) | | **TAT Queue Waiting** | `queue_jobs_waiting{queue_name="tatQueue"}` | Backend queue metrics | Jobs waiting in TAT notification queue | | **Pause/Resume Queue Waiting** | `queue_jobs_waiting{queue_name="pauseResumeQueue"}` | Backend queue metrics | Jobs waiting in pause/resume queue | | **TAT Queue Failed** | `queue_jobs_failed{queue_name="tatQueue"}` | Backend queue metrics | Failed TAT notification jobs (should be 0) | | **Pause/Resume Queue Failed** | `queue_jobs_failed{queue_name="pauseResumeQueue"}` | Backend queue metrics | Failed pause/resume jobs (should be 0) | | **All Queues - Job Status** | `queue_jobs_waiting`
`queue_jobs_active`
`queue_jobs_delayed` | Backend queue metrics | Timeline of job status across all queues (stacked) | | **Redis Commands Rate** | `rate(redis_commands_processed_total[1m])` | Redis Exporter | Redis commands executed per second | **Key Point**: Queue metrics are collected by the backend every 15 seconds via BullMQ queue API. --- ### **Section 5: System Resources (Host)** (Host-Level Metrics) | Panel Name | Metric Query | Data Source | What It Measures | |------------|--------------|-------------|------------------| | **Host CPU Usage (All Cores)** | `100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)` | Node Exporter | Total CPU usage across all cores on host machine (%) | | **Host Memory Usage (RAM)** | `(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100` | Node Exporter | RAM usage on host machine (%) | | **Host Disk Usage (/root)** | `100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"}) * 100)` | Node Exporter | Disk usage of root filesystem (%) | | **Disk Space Left** | `node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"}` | Node Exporter | Available disk space in gigabytes | **Key Point**: These metrics track the **entire host system**, not just the Node.js process. --- ## 🔍 Data Source Summary | Exporter/Service | Port | Metrics Provided | Collection Interval | |------------------|------|------------------|---------------------| | **RE Workflow Backend** | 5000 | HTTP metrics, custom business metrics, Node.js runtime | 10s (Prometheus scrape) | | **Node Exporter** | 9100 | Host system metrics (CPU, memory, disk, network) | 15s (Prometheus scrape) | | **Redis Exporter** | 9121 | Redis server metrics (connections, memory, commands) | 15s (Prometheus scrape) | | **Queue Metrics** | 5000 | BullMQ queue job counts (via backend) | 15s (internal collection) | | **Loki** | 3100 | Application logs | Real-time streaming | --- ## 🎯 Renamed Panels for Clarity ### Before → After **Node.js Runtime Section:** - ❌ "Memory Usage" → ✅ "Node.js Process Memory (Heap)" - ❌ "CPU Usage" → ✅ "Node.js Process CPU Usage" - ❌ "Event Loop Lag" → ✅ "Node.js Event Loop Lag" - ❌ "Active Handles & Requests" → ✅ "Node.js Active Handles & Requests" **System Resources Section:** - ❌ "System CPU Usage" → ✅ "Host CPU Usage (All Cores)" - ❌ "System Memory Usage" → ✅ "Host Memory Usage (RAM)" - ❌ "System Disk Usage" → ✅ "Host Disk Usage (/root)" --- ## 📈 Understanding the Difference ### **Process vs Host Metrics** | Aspect | Node.js Process Metrics | Host System Metrics | |--------|------------------------|---------------------| | **Scope** | Single Node.js application | Entire server/container | | **CPU** | CPU used by Node.js only | CPU used by all processes | | **Memory** | Node.js heap memory | Total RAM on machine | | **Purpose** | Application performance | Infrastructure health | | **Example Use** | Detect memory leaks in app | Ensure server has capacity | **Example Scenario:** - **Node.js Process CPU**: 15% → Your app is using 15% of one CPU core - **Host CPU Usage**: 75% → The entire server is at 75% CPU (all processes combined) --- ## 🚨 Alert Thresholds (Recommended) | Metric | Warning | Critical | Action | |--------|---------|----------|--------| | **Node.js Process Memory** | 80% of heap | 90% of heap | Investigate memory leaks | | **Host Memory Usage** | 70% | 85% | Scale up or optimize | | **Host CPU Usage** | 60% | 80% | Scale horizontally | | **Redis Memory** | 500MB | 1GB | Review Redis usage | | **Queue Jobs Waiting** | >10 | >50 | Check worker health | | **Queue Jobs Failed** | >0 | >5 | Immediate investigation | | **Event Loop Lag** | >100ms | >500ms | Performance optimization needed | --- ## 🔧 Troubleshooting ### No Data Showing? 1. **Check Prometheus Targets**: http://localhost:9090/targets - All targets should show "UP" status 2. **Test Metric Availability**: ```promql up{job="re-workflow-backend"} ``` Should return `1` 3. **Check Time Range**: Set to "Last 15 minutes" in Grafana 4. **Verify Backend**: http://localhost:5000/metrics should show all metrics ### Metrics Not Updating? 1. **Backend**: Ensure backend is running with metrics collection enabled 2. **Prometheus**: Check scrape interval in prometheus.yml 3. **Queue Metrics**: Verify queue metrics collection started (check backend logs for "Queue Metrics ✅") --- ## 📚 Additional Resources - **Prometheus Query Language**: https://prometheus.io/docs/prometheus/latest/querying/basics/ - **Grafana Dashboard Guide**: https://grafana.com/docs/grafana/latest/dashboards/ - **Node Exporter Metrics**: https://github.com/prometheus/node_exporter - **Redis Exporter Metrics**: https://github.com/oliver006/redis_exporter - **BullMQ Monitoring**: https://docs.bullmq.io/guide/metrics