Re_Backend/monitoring/DASHBOARD_METRICS_REFERENCE.md

9.1 KiB

RE Workflow Dashboard - Metrics Reference

📊 Complete KPI List with Data Sources

Section 1: API Overview

Panel Name Metric Query Data Source What It Measures
Request Rate sum(rate(http_requests_total{job="re-workflow-backend"}[5m])) Backend metrics HTTP requests per second (all endpoints)
Error Rate sum(rate(http_request_errors_total{job="re-workflow-backend"}[5m])) / sum(rate(http_requests_total{job="re-workflow-backend"}[5m])) Backend metrics Percentage of failed HTTP requests
P95 Latency histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="re-workflow-backend"}[5m])) by (le)) Backend metrics 95th percentile response time (seconds)
API Status up{job="re-workflow-backend"} Prometheus Backend service up/down status (1=up, 0=down)
Request Rate by Method sum(rate(http_requests_total{job="re-workflow-backend"}[5m])) by (method) Backend metrics Requests per method (GET, POST, etc.)
Response Time Percentiles histogram_quantile(0.50/0.95/0.99, ...) Backend metrics Response time distribution (P50, P95, P99)

Section 2: Logs

Panel Name Metric Query Data Source What It Measures
Errors (Time Range) count_over_time({job="re-workflow-backend", level="error"}[...]) Loki logs Total error log entries in selected time range
Warnings (Time Range) count_over_time({job="re-workflow-backend", level="warn"}[...]) Loki logs Total warning log entries in selected time range
TAT Breaches (Time Range) Log filter for TAT breaches Loki logs TAT breach events logged
Auth Failures (Time Range) Log filter for auth failures Loki logs Authentication failure events
Recent Errors & Warnings {job="re-workflow-backend"} |= "error" or "warn" Loki logs Live log stream of errors and warnings

Section 3: Node.js Runtime (Process-Level Metrics)

Panel Name Metric Query Data Source What It Measures
Node.js Process Memory (Heap) process_resident_memory_bytes{job="re-workflow-backend"}
nodejs_heap_size_used_bytes{job="re-workflow-backend"}
nodejs_heap_size_total_bytes{job="re-workflow-backend"}
Node.js metrics (prom-client) Node.js process memory usage:
- RSS (Resident Set Size)
- Heap Used
- Heap Total
Node.js Event Loop Lag nodejs_eventloop_lag_seconds{job="re-workflow-backend"} Node.js metrics Event loop lag in seconds (high = performance issue)
Node.js Active Handles & Requests nodejs_active_handles_total{job="re-workflow-backend"}
nodejs_active_requests_total{job="re-workflow-backend"}
Node.js metrics Active file handles and pending async requests
Node.js Process CPU Usage rate(process_cpu_seconds_total{job="re-workflow-backend"}[5m]) Node.js metrics CPU usage by Node.js process only (0-1 = 0-100%)

Key Point: These metrics track the Node.js application process specifically, not the entire host system.


Section 4: Redis & Queue Status

Panel Name Metric Query Data Source What It Measures
Redis Status redis_up Redis Exporter Redis server status (1=up, 0=down)
Redis Connections redis_connected_clients Redis Exporter Number of active client connections to Redis
Redis Memory redis_memory_used_bytes Redis Exporter Memory used by Redis (bytes)
TAT Queue Waiting queue_jobs_waiting{queue_name="tatQueue"} Backend queue metrics Jobs waiting in TAT notification queue
Pause/Resume Queue Waiting queue_jobs_waiting{queue_name="pauseResumeQueue"} Backend queue metrics Jobs waiting in pause/resume queue
TAT Queue Failed queue_jobs_failed{queue_name="tatQueue"} Backend queue metrics Failed TAT notification jobs (should be 0)
Pause/Resume Queue Failed queue_jobs_failed{queue_name="pauseResumeQueue"} Backend queue metrics Failed pause/resume jobs (should be 0)
All Queues - Job Status queue_jobs_waiting
queue_jobs_active
queue_jobs_delayed
Backend queue metrics Timeline of job status across all queues (stacked)
Redis Commands Rate rate(redis_commands_processed_total[1m]) Redis Exporter Redis commands executed per second

Key Point: Queue metrics are collected by the backend every 15 seconds via BullMQ queue API.


Section 5: System Resources (Host) (Host-Level Metrics)

Panel Name Metric Query Data Source What It Measures
Host CPU Usage (All Cores) 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) Node Exporter Total CPU usage across all cores on host machine (%)
Host Memory Usage (RAM) (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 Node Exporter RAM usage on host machine (%)
Host Disk Usage (/root) 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"}) * 100) Node Exporter Disk usage of root filesystem (%)
Disk Space Left node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} Node Exporter Available disk space in gigabytes

Key Point: These metrics track the entire host system, not just the Node.js process.


🔍 Data Source Summary

Exporter/Service Port Metrics Provided Collection Interval
RE Workflow Backend 5000 HTTP metrics, custom business metrics, Node.js runtime 10s (Prometheus scrape)
Node Exporter 9100 Host system metrics (CPU, memory, disk, network) 15s (Prometheus scrape)
Redis Exporter 9121 Redis server metrics (connections, memory, commands) 15s (Prometheus scrape)
Queue Metrics 5000 BullMQ queue job counts (via backend) 15s (internal collection)
Loki 3100 Application logs Real-time streaming

🎯 Renamed Panels for Clarity

Before → After

Node.js Runtime Section:

  • "Memory Usage" → "Node.js Process Memory (Heap)"
  • "CPU Usage" → "Node.js Process CPU Usage"
  • "Event Loop Lag" → "Node.js Event Loop Lag"
  • "Active Handles & Requests" → "Node.js Active Handles & Requests"

System Resources Section:

  • "System CPU Usage" → "Host CPU Usage (All Cores)"
  • "System Memory Usage" → "Host Memory Usage (RAM)"
  • "System Disk Usage" → "Host Disk Usage (/root)"

📈 Understanding the Difference

Process vs Host Metrics

Aspect Node.js Process Metrics Host System Metrics
Scope Single Node.js application Entire server/container
CPU CPU used by Node.js only CPU used by all processes
Memory Node.js heap memory Total RAM on machine
Purpose Application performance Infrastructure health
Example Use Detect memory leaks in app Ensure server has capacity

Example Scenario:

  • Node.js Process CPU: 15% → Your app is using 15% of one CPU core
  • Host CPU Usage: 75% → The entire server is at 75% CPU (all processes combined)

Metric Warning Critical Action
Node.js Process Memory 80% of heap 90% of heap Investigate memory leaks
Host Memory Usage 70% 85% Scale up or optimize
Host CPU Usage 60% 80% Scale horizontally
Redis Memory 500MB 1GB Review Redis usage
Queue Jobs Waiting >10 >50 Check worker health
Queue Jobs Failed >0 >5 Immediate investigation
Event Loop Lag >100ms >500ms Performance optimization needed

🔧 Troubleshooting

No Data Showing?

  1. Check Prometheus Targets: http://localhost:9090/targets

    • All targets should show "UP" status
  2. Test Metric Availability:

    up{job="re-workflow-backend"}
    

    Should return 1

  3. Check Time Range: Set to "Last 15 minutes" in Grafana

  4. Verify Backend: http://localhost:5000/metrics should show all metrics

Metrics Not Updating?

  1. Backend: Ensure backend is running with metrics collection enabled
  2. Prometheus: Check scrape interval in prometheus.yml
  3. Queue Metrics: Verify queue metrics collection started (check backend logs for "Queue Metrics ")

📚 Additional Resources