9.1 KiB
9.1 KiB
RE Workflow Dashboard - Metrics Reference
📊 Complete KPI List with Data Sources
Section 1: API Overview
| Panel Name | Metric Query | Data Source | What It Measures |
|---|---|---|---|
| Request Rate | sum(rate(http_requests_total{job="re-workflow-backend"}[5m])) |
Backend metrics | HTTP requests per second (all endpoints) |
| Error Rate | sum(rate(http_request_errors_total{job="re-workflow-backend"}[5m])) / sum(rate(http_requests_total{job="re-workflow-backend"}[5m])) |
Backend metrics | Percentage of failed HTTP requests |
| P95 Latency | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="re-workflow-backend"}[5m])) by (le)) |
Backend metrics | 95th percentile response time (seconds) |
| API Status | up{job="re-workflow-backend"} |
Prometheus | Backend service up/down status (1=up, 0=down) |
| Request Rate by Method | sum(rate(http_requests_total{job="re-workflow-backend"}[5m])) by (method) |
Backend metrics | Requests per method (GET, POST, etc.) |
| Response Time Percentiles | histogram_quantile(0.50/0.95/0.99, ...) |
Backend metrics | Response time distribution (P50, P95, P99) |
Section 2: Logs
| Panel Name | Metric Query | Data Source | What It Measures |
|---|---|---|---|
| Errors (Time Range) | count_over_time({job="re-workflow-backend", level="error"}[...]) |
Loki logs | Total error log entries in selected time range |
| Warnings (Time Range) | count_over_time({job="re-workflow-backend", level="warn"}[...]) |
Loki logs | Total warning log entries in selected time range |
| TAT Breaches (Time Range) | Log filter for TAT breaches | Loki logs | TAT breach events logged |
| Auth Failures (Time Range) | Log filter for auth failures | Loki logs | Authentication failure events |
| Recent Errors & Warnings | {job="re-workflow-backend"} |= "error" or "warn" |
Loki logs | Live log stream of errors and warnings |
Section 3: Node.js Runtime (Process-Level Metrics)
| Panel Name | Metric Query | Data Source | What It Measures |
|---|---|---|---|
| Node.js Process Memory (Heap) | process_resident_memory_bytes{job="re-workflow-backend"} nodejs_heap_size_used_bytes{job="re-workflow-backend"} nodejs_heap_size_total_bytes{job="re-workflow-backend"} |
Node.js metrics (prom-client) | Node.js process memory usage: - RSS (Resident Set Size) - Heap Used - Heap Total |
| Node.js Event Loop Lag | nodejs_eventloop_lag_seconds{job="re-workflow-backend"} |
Node.js metrics | Event loop lag in seconds (high = performance issue) |
| Node.js Active Handles & Requests | nodejs_active_handles_total{job="re-workflow-backend"} nodejs_active_requests_total{job="re-workflow-backend"} |
Node.js metrics | Active file handles and pending async requests |
| Node.js Process CPU Usage | rate(process_cpu_seconds_total{job="re-workflow-backend"}[5m]) |
Node.js metrics | CPU usage by Node.js process only (0-1 = 0-100%) |
Key Point: These metrics track the Node.js application process specifically, not the entire host system.
Section 4: Redis & Queue Status
| Panel Name | Metric Query | Data Source | What It Measures |
|---|---|---|---|
| Redis Status | redis_up |
Redis Exporter | Redis server status (1=up, 0=down) |
| Redis Connections | redis_connected_clients |
Redis Exporter | Number of active client connections to Redis |
| Redis Memory | redis_memory_used_bytes |
Redis Exporter | Memory used by Redis (bytes) |
| TAT Queue Waiting | queue_jobs_waiting{queue_name="tatQueue"} |
Backend queue metrics | Jobs waiting in TAT notification queue |
| Pause/Resume Queue Waiting | queue_jobs_waiting{queue_name="pauseResumeQueue"} |
Backend queue metrics | Jobs waiting in pause/resume queue |
| TAT Queue Failed | queue_jobs_failed{queue_name="tatQueue"} |
Backend queue metrics | Failed TAT notification jobs (should be 0) |
| Pause/Resume Queue Failed | queue_jobs_failed{queue_name="pauseResumeQueue"} |
Backend queue metrics | Failed pause/resume jobs (should be 0) |
| All Queues - Job Status | queue_jobs_waiting queue_jobs_active queue_jobs_delayed |
Backend queue metrics | Timeline of job status across all queues (stacked) |
| Redis Commands Rate | rate(redis_commands_processed_total[1m]) |
Redis Exporter | Redis commands executed per second |
Key Point: Queue metrics are collected by the backend every 15 seconds via BullMQ queue API.
Section 5: System Resources (Host) (Host-Level Metrics)
| Panel Name | Metric Query | Data Source | What It Measures |
|---|---|---|---|
| Host CPU Usage (All Cores) | 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) |
Node Exporter | Total CPU usage across all cores on host machine (%) |
| Host Memory Usage (RAM) | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 |
Node Exporter | RAM usage on host machine (%) |
| Host Disk Usage (/root) | 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"}) * 100) |
Node Exporter | Disk usage of root filesystem (%) |
| Disk Space Left | node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} |
Node Exporter | Available disk space in gigabytes |
Key Point: These metrics track the entire host system, not just the Node.js process.
🔍 Data Source Summary
| Exporter/Service | Port | Metrics Provided | Collection Interval |
|---|---|---|---|
| RE Workflow Backend | 5000 | HTTP metrics, custom business metrics, Node.js runtime | 10s (Prometheus scrape) |
| Node Exporter | 9100 | Host system metrics (CPU, memory, disk, network) | 15s (Prometheus scrape) |
| Redis Exporter | 9121 | Redis server metrics (connections, memory, commands) | 15s (Prometheus scrape) |
| Queue Metrics | 5000 | BullMQ queue job counts (via backend) | 15s (internal collection) |
| Loki | 3100 | Application logs | Real-time streaming |
🎯 Renamed Panels for Clarity
Before → After
Node.js Runtime Section:
- ❌ "Memory Usage" → ✅ "Node.js Process Memory (Heap)"
- ❌ "CPU Usage" → ✅ "Node.js Process CPU Usage"
- ❌ "Event Loop Lag" → ✅ "Node.js Event Loop Lag"
- ❌ "Active Handles & Requests" → ✅ "Node.js Active Handles & Requests"
System Resources Section:
- ❌ "System CPU Usage" → ✅ "Host CPU Usage (All Cores)"
- ❌ "System Memory Usage" → ✅ "Host Memory Usage (RAM)"
- ❌ "System Disk Usage" → ✅ "Host Disk Usage (/root)"
📈 Understanding the Difference
Process vs Host Metrics
| Aspect | Node.js Process Metrics | Host System Metrics |
|---|---|---|
| Scope | Single Node.js application | Entire server/container |
| CPU | CPU used by Node.js only | CPU used by all processes |
| Memory | Node.js heap memory | Total RAM on machine |
| Purpose | Application performance | Infrastructure health |
| Example Use | Detect memory leaks in app | Ensure server has capacity |
Example Scenario:
- Node.js Process CPU: 15% → Your app is using 15% of one CPU core
- Host CPU Usage: 75% → The entire server is at 75% CPU (all processes combined)
🚨 Alert Thresholds (Recommended)
| Metric | Warning | Critical | Action |
|---|---|---|---|
| Node.js Process Memory | 80% of heap | 90% of heap | Investigate memory leaks |
| Host Memory Usage | 70% | 85% | Scale up or optimize |
| Host CPU Usage | 60% | 80% | Scale horizontally |
| Redis Memory | 500MB | 1GB | Review Redis usage |
| Queue Jobs Waiting | >10 | >50 | Check worker health |
| Queue Jobs Failed | >0 | >5 | Immediate investigation |
| Event Loop Lag | >100ms | >500ms | Performance optimization needed |
🔧 Troubleshooting
No Data Showing?
-
Check Prometheus Targets: http://localhost:9090/targets
- All targets should show "UP" status
-
Test Metric Availability:
up{job="re-workflow-backend"}Should return
1 -
Check Time Range: Set to "Last 15 minutes" in Grafana
-
Verify Backend: http://localhost:5000/metrics should show all metrics
Metrics Not Updating?
- Backend: Ensure backend is running with metrics collection enabled
- Prometheus: Check scrape interval in prometheus.yml
- Queue Metrics: Verify queue metrics collection started (check backend logs for "Queue Metrics ✅")
📚 Additional Resources
- Prometheus Query Language: https://prometheus.io/docs/prometheus/latest/querying/basics/
- Grafana Dashboard Guide: https://grafana.com/docs/grafana/latest/dashboards/
- Node Exporter Metrics: https://github.com/prometheus/node_exporter
- Redis Exporter Metrics: https://github.com/oliver006/redis_exporter
- BullMQ Monitoring: https://docs.bullmq.io/guide/metrics