laxmanhalaki 9089e8c035 email template flow added with test account and templates for all cenerios

2025-12-04 20:58:32 +05:30

9.1 KiB

Raw Blame History

RE Workflow Dashboard - Metrics Reference

📊 Complete KPI List with Data Sources

Section 1: API Overview

Panel Name	Metric Query	Data Source	What It Measures
Request Rate	`sum(rate(http_requests_total{job="re-workflow-backend"}[5m]))`	Backend metrics	HTTP requests per second (all endpoints)
Error Rate	`sum(rate(http_request_errors_total{job="re-workflow-backend"}[5m])) / sum(rate(http_requests_total{job="re-workflow-backend"}[5m]))`	Backend metrics	Percentage of failed HTTP requests
P95 Latency	`histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="re-workflow-backend"}[5m])) by (le))`	Backend metrics	95th percentile response time (seconds)
API Status	`up{job="re-workflow-backend"}`	Prometheus	Backend service up/down status (1=up, 0=down)
Request Rate by Method	`sum(rate(http_requests_total{job="re-workflow-backend"}[5m])) by (method)`	Backend metrics	Requests per method (GET, POST, etc.)
Response Time Percentiles	`histogram_quantile(0.50/0.95/0.99, ...)`	Backend metrics	Response time distribution (P50, P95, P99)

Section 2: Logs

Panel Name	Metric Query	Data Source	What It Measures
Errors (Time Range)	`count_over_time({job="re-workflow-backend", level="error"}[...])`	Loki logs	Total error log entries in selected time range
Warnings (Time Range)	`count_over_time({job="re-workflow-backend", level="warn"}[...])`	Loki logs	Total warning log entries in selected time range
TAT Breaches (Time Range)	Log filter for TAT breaches	Loki logs	TAT breach events logged
Auth Failures (Time Range)	Log filter for auth failures	Loki logs	Authentication failure events
Recent Errors & Warnings	`{job="re-workflow-backend"} \|= "error" or "warn"`	Loki logs	Live log stream of errors and warnings

Section 3: Node.js Runtime (Process-Level Metrics)

Panel Name	Metric Query	Data Source	What It Measures
Node.js Process Memory (Heap)	`process_resident_memory_bytes{job="re-workflow-backend"}` `nodejs_heap_size_used_bytes{job="re-workflow-backend"}` `nodejs_heap_size_total_bytes{job="re-workflow-backend"}`	Node.js metrics (prom-client)	Node.js process memory usage: - RSS (Resident Set Size) - Heap Used - Heap Total
Node.js Event Loop Lag	`nodejs_eventloop_lag_seconds{job="re-workflow-backend"}`	Node.js metrics	Event loop lag in seconds (high = performance issue)
Node.js Active Handles & Requests	`nodejs_active_handles_total{job="re-workflow-backend"}` `nodejs_active_requests_total{job="re-workflow-backend"}`	Node.js metrics	Active file handles and pending async requests
Node.js Process CPU Usage	`rate(process_cpu_seconds_total{job="re-workflow-backend"}[5m])`	Node.js metrics	CPU usage by Node.js process only (0-1 = 0-100%)

Key Point: These metrics track the Node.js application process specifically, not the entire host system.

Section 4: Redis & Queue Status

Panel Name	Metric Query	Data Source	What It Measures
Redis Status	`redis_up`	Redis Exporter	Redis server status (1=up, 0=down)
Redis Connections	`redis_connected_clients`	Redis Exporter	Number of active client connections to Redis
Redis Memory	`redis_memory_used_bytes`	Redis Exporter	Memory used by Redis (bytes)
TAT Queue Waiting	`queue_jobs_waiting{queue_name="tatQueue"}`	Backend queue metrics	Jobs waiting in TAT notification queue
Pause/Resume Queue Waiting	`queue_jobs_waiting{queue_name="pauseResumeQueue"}`	Backend queue metrics	Jobs waiting in pause/resume queue
TAT Queue Failed	`queue_jobs_failed{queue_name="tatQueue"}`	Backend queue metrics	Failed TAT notification jobs (should be 0)
Pause/Resume Queue Failed	`queue_jobs_failed{queue_name="pauseResumeQueue"}`	Backend queue metrics	Failed pause/resume jobs (should be 0)
All Queues - Job Status	`queue_jobs_waiting` `queue_jobs_active` `queue_jobs_delayed`	Backend queue metrics	Timeline of job status across all queues (stacked)
Redis Commands Rate	`rate(redis_commands_processed_total[1m])`	Redis Exporter	Redis commands executed per second

Key Point: Queue metrics are collected by the backend every 15 seconds via BullMQ queue API.

Section 5: System Resources (Host) (Host-Level Metrics)

Panel Name	Metric Query	Data Source	What It Measures
Host CPU Usage (All Cores)	`100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`	Node Exporter	Total CPU usage across all cores on host machine (%)
Host Memory Usage (RAM)	`(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100`	Node Exporter	RAM usage on host machine (%)
Host Disk Usage (/root)	`100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"} / node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"}) * 100)`	Node Exporter	Disk usage of root filesystem (%)
Disk Space Left	`node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"}`	Node Exporter	Available disk space in gigabytes

Key Point: These metrics track the entire host system, not just the Node.js process.

🔍 Data Source Summary

Exporter/Service	Port	Metrics Provided	Collection Interval
RE Workflow Backend	5000	HTTP metrics, custom business metrics, Node.js runtime	10s (Prometheus scrape)
Node Exporter	9100	Host system metrics (CPU, memory, disk, network)	15s (Prometheus scrape)
Redis Exporter	9121	Redis server metrics (connections, memory, commands)	15s (Prometheus scrape)
Queue Metrics	5000	BullMQ queue job counts (via backend)	15s (internal collection)
Loki	3100	Application logs	Real-time streaming

🎯 Renamed Panels for Clarity

Before → After

Node.js Runtime Section:

❌ "Memory Usage" → ✅ "Node.js Process Memory (Heap)"
❌ "CPU Usage" → ✅ "Node.js Process CPU Usage"
❌ "Event Loop Lag" → ✅ "Node.js Event Loop Lag"
❌ "Active Handles & Requests" → ✅ "Node.js Active Handles & Requests"

System Resources Section:

❌ "System CPU Usage" → ✅ "Host CPU Usage (All Cores)"
❌ "System Memory Usage" → ✅ "Host Memory Usage (RAM)"
❌ "System Disk Usage" → ✅ "Host Disk Usage (/root)"

📈 Understanding the Difference

Process vs Host Metrics

Aspect	Node.js Process Metrics	Host System Metrics
Scope	Single Node.js application	Entire server/container
CPU	CPU used by Node.js only	CPU used by all processes
Memory	Node.js heap memory	Total RAM on machine
Purpose	Application performance	Infrastructure health
Example Use	Detect memory leaks in app	Ensure server has capacity

Example Scenario:

Node.js Process CPU: 15% → Your app is using 15% of one CPU core
Host CPU Usage: 75% → The entire server is at 75% CPU (all processes combined)

🚨 Alert Thresholds (Recommended)

Metric	Warning	Critical	Action
Node.js Process Memory	80% of heap	90% of heap	Investigate memory leaks
Host Memory Usage	70%	85%	Scale up or optimize
Host CPU Usage	60%	80%	Scale horizontally
Redis Memory	500MB	1GB	Review Redis usage
Queue Jobs Waiting	>10	>50	Check worker health
Queue Jobs Failed	>0	>5	Immediate investigation
Event Loop Lag	>100ms	>500ms	Performance optimization needed

🔧 Troubleshooting

No Data Showing?

Check Prometheus Targets: http://localhost:9090/targets
- All targets should show "UP" status
Test Metric Availability:
```
up{job="re-workflow-backend"}
```
Should return 1
Check Time Range: Set to "Last 15 minutes" in Grafana
Verify Backend: http://localhost:5000/metrics should show all metrics

Metrics Not Updating?

Backend: Ensure backend is running with metrics collection enabled
Prometheus: Check scrape interval in prometheus.yml
Queue Metrics: Verify queue metrics collection started (check backend logs for "Queue Metrics ✅")

📚 Additional Resources

Prometheus Query Language: https://prometheus.io/docs/prometheus/latest/querying/basics/
Grafana Dashboard Guide: https://grafana.com/docs/grafana/latest/dashboards/
Node Exporter Metrics: https://github.com/prometheus/node_exporter
Redis Exporter Metrics: https://github.com/oliver006/redis_exporter
BullMQ Monitoring: https://docs.bullmq.io/guide/metrics

9.1 KiB Raw Blame History