History

laxmanhalaki 9089e8c035 email template flow added with test account and templates for all cenerios		2025-12-04 20:58:32 +05:30
..
alertmanager	docker setup done along with add spectotor and approver hndled from backend, dashboard created for metrics	2025-12-02 21:15:33 +05:30
grafana	email template flow added with test account and templates for all cenerios	2025-12-04 20:58:32 +05:30
loki	docker setup done along with add spectotor and approver hndled from backend, dashboard created for metrics	2025-12-02 21:15:33 +05:30
prometheus	email template flow added with test account and templates for all cenerios	2025-12-04 20:58:32 +05:30
promtail	docker setup done along with add spectotor and approver hndled from backend, dashboard created for metrics	2025-12-02 21:15:33 +05:30
.env.example	email template flow added with test account and templates for all cenerios	2025-12-04 20:58:32 +05:30
.gitignore	email template flow added with test account and templates for all cenerios	2025-12-04 20:58:32 +05:30
DASHBOARD_METRICS_REFERENCE.md	email template flow added with test account and templates for all cenerios	2025-12-04 20:58:32 +05:30
delete	email template flow added with test account and templates for all cenerios	2025-12-04 20:58:32 +05:30
docker-compose.monitoring.yml	email template flow added with test account and templates for all cenerios	2025-12-04 20:58:32 +05:30
README.md	email template flow added with test account and templates for all cenerios	2025-12-04 20:58:32 +05:30
REDIS_MIGRATION.md	email template flow added with test account and templates for all cenerios	2025-12-04 20:58:32 +05:30
start-monitoring.bat	docker setup done along with add spectotor and approver hndled from backend, dashboard created for metrics	2025-12-02 21:15:33 +05:30
stop-monitoring.bat	docker setup done along with add spectotor and approver hndled from backend, dashboard created for metrics	2025-12-02 21:15:33 +05:30

README.md

RE Workflow Monitoring Stack

Complete monitoring solution with Grafana, Prometheus, Loki, and Promtail for the RE Workflow Management System.

🏗️ Architecture

┌────────────────────────────────────────────────────────────────────────┐
│                         RE Workflow System                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐    │
│  │  Node.js API    │────│   PostgreSQL    │────│     Redis       │    │
│  │  (Port 5000)    │    │   (Port 5432)   │    │  (Port 6379)    │    │
│  └────────┬────────┘    └─────────────────┘    └─────────────────┘    │
│           │                                                             │
│           │ /metrics endpoint                                          │
│           │ Log files (./logs/)                                        │
│           ▼                                                             │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                    Monitoring Stack                               │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │  │
│  │  │  Prometheus │──│    Loki     │──│       Promtail          │  │  │
│  │  │  (Port 9090)│  │ (Port 3100) │  │ (Collects log files)    │  │  │
│  │  └──────┬──────┘  └──────┬──────┘  └─────────────────────────┘  │  │
│  │         │                │                                        │  │
│  │         └────────┬───────┘                                        │  │
│  │                  ▼                                                 │  │
│  │         ┌─────────────────┐                                       │  │
│  │         │    Grafana      │                                       │  │
│  │         │  (Port 3001)    │◄── Pre-configured Dashboards          │  │
│  │         └─────────────────┘                                       │  │
│  └─────────────────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────────────┘

📦 What's Included

The monitoring stack includes:

Redis - In-memory data store for BullMQ job queues
Prometheus - Metrics collection and storage
Grafana - Visualization and dashboards
Loki - Log aggregation
Promtail - Log shipping agent
Node Exporter - Host system metrics
Redis Exporter - Redis server metrics
Alertmanager - Alert routing and notifications

🚀 Quick Start

Prerequisites

Docker Desktop installed and running
WSL2 enabled (recommended for Windows)
Backend API running on port 5000

Step 1: Start Monitoring Stack

# Navigate to monitoring folder
cd C:\Laxman\Royal_Enfield\Re_Backend\monitoring

# Start all monitoring services
docker-compose -f docker-compose.monitoring.yml up -d

# Check status
docker ps

Step 2: Configure Backend Environment

Add these to your backend .env file:

# Loki configuration (for direct log shipping from Winston)
LOKI_HOST=http://localhost:3100

# Optional: Basic auth if enabled
# LOKI_USER=your_username
# LOKI_PASSWORD=your_password

Step 3: Access Dashboards

Service	URL	Credentials
Grafana	http://localhost:3001	admin / REWorkflow@2024
Prometheus	http://localhost:9090	-
Loki	http://localhost:3100	-
Alertmanager	http://localhost:9093	-

📊 Available Dashboards

RE Workflow Overview (Enhanced!)

URL: http://localhost:3001/d/re-workflow-overview

Sections:

📊 API Overview
- Request rate, error rate, response times
- HTTP status codes distribution
🔴 Redis & Queue Status (NEW!)
- Redis connection status (Up/Down)
- Redis active connections
- Redis memory usage
- TAT Queue waiting/failed jobs
- Pause/Resume Queue waiting/failed jobs
- All queues job status timeline
- Redis commands rate
💻 System Resources (NEW!)
- System CPU Usage (gauge)
- System Memory Usage (gauge)
- System Disk Usage (gauge)
- Disk Space Left (GB available)
🔄 Business Metrics
- Workflow operations
- TAT breaches
- Node.js process metrics

Refresh Rate: Auto-refresh every 30 seconds

1. RE Workflow Overview

Pre-configured dashboard with:

API Metrics: Request rate, error rate, latency percentiles
Logs Overview: Error count, warnings, TAT breaches
Node.js Runtime: Memory usage, event loop lag, CPU

2. Custom LogQL Queries

Purpose	Query
All errors	`{app="re-workflow"} \| json \| level="error"`
TAT breaches	`{app="re-workflow"} \| json \| tatEvent="breached"`
Auth failures	`{app="re-workflow"} \| json \| authEvent="auth_failure"`
Slow requests (>3s)	`{app="re-workflow"} \| json \| duration>3000`
By user	`{app="re-workflow"} \| json \| userId="USER-ID"`
By request	`{app="re-workflow"} \| json \| requestId="REQ-XXX"`

3. PromQL Queries (Prometheus)

Purpose	Query
Request rate	`rate(http_requests_total{job="re-workflow-backend"}[5m])`
Error rate	`rate(http_request_errors_total[5m]) / rate(http_requests_total[5m])`
P95 latency	`histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))`
Memory usage	`process_resident_memory_bytes{job="re-workflow-backend"}`
Event loop lag	`nodejs_eventloop_lag_seconds{job="re-workflow-backend"}`

📁 File Structure

monitoring/
├── docker-compose.monitoring.yml    # Main compose file
├── prometheus/
│   ├── prometheus.yml               # Prometheus configuration
│   └── alert.rules.yml              # Alert rules
├── loki/
│   └── loki-config.yml              # Loki configuration
├── promtail/
│   └── promtail-config.yml          # Promtail log shipper config
├── alertmanager/
│   └── alertmanager.yml             # Alert notification config
└── grafana/
    ├── provisioning/
    │   ├── datasources/
    │   │   └── datasources.yml      # Auto-configure data sources
    │   └── dashboards/
    │       └── dashboards.yml       # Dashboard provisioning
    └── dashboards/
        └── re-workflow-overview.json # Pre-built dashboard

🔧 Configuration

Prometheus Scrape Targets

Edit prometheus/prometheus.yml to add/modify scrape targets:

scrape_configs:
  - job_name: 're-workflow-backend'
    static_configs:
      # For local development (backend outside Docker)
      - targets: ['host.docker.internal:5000']
      # For Docker deployment (backend in Docker)
      # - targets: ['re_workflow_backend:5000']

Log Retention

Edit loki/loki-config.yml:

limits_config:
  retention_period: 15d  # Adjust retention period

Alert Notifications

Edit alertmanager/alertmanager.yml to configure:

Email notifications
Slack webhooks
Custom webhook endpoints

🛠️ Common Commands

# Start services
docker-compose -f docker-compose.monitoring.yml up -d

# Stop services
docker-compose -f docker-compose.monitoring.yml down

# View logs
docker-compose -f docker-compose.monitoring.yml logs -f

# View specific service logs
docker-compose -f docker-compose.monitoring.yml logs -f grafana

# Restart a service
docker-compose -f docker-compose.monitoring.yml restart prometheus

# Check service health
docker ps

# Remove all data (fresh start)
docker-compose -f docker-compose.monitoring.yml down -v

⚡ Metrics Exposed by Backend

The backend exposes these metrics at /metrics:

HTTP Metrics

http_requests_total - Total HTTP requests (by method, route, status)
http_request_duration_seconds - Request latency histogram
http_request_errors_total - Error count (4xx, 5xx)
http_active_connections - Current active connections

Business Metrics

tat_breaches_total - TAT breach events
pending_workflows_count - Pending workflow gauge
workflow_operations_total - Workflow operation count
auth_events_total - Authentication events

Node.js Runtime

nodejs_heap_size_* - Heap memory metrics
nodejs_eventloop_lag_* - Event loop lag
process_cpu_* - CPU usage
process_resident_memory_bytes - RSS memory

🔒 Security Notes

Change default passwords in production
Enable TLS for external access
Configure firewall to restrict access to monitoring ports
Use reverse proxy (nginx) for HTTPS

🐛 Troubleshooting

Prometheus can't scrape backend

Ensure backend is running on port 5000
Check /metrics endpoint: curl http://localhost:5000/metrics
For Docker: use host.docker.internal:5000

Logs not appearing in Loki

Check Promtail logs: docker logs re_promtail
Verify log file path in promtail-config.yml
Ensure backend has LOKI_HOST configured

Grafana dashboards empty

Wait 30-60 seconds for data collection
Check data source configuration in Grafana
Verify time range selection

Docker memory issues

# Increase Docker Desktop memory allocation
# Settings → Resources → Memory → 4GB+

📞 Support

For issues with the monitoring stack:

Check container logs: docker logs <container_name>
Verify configuration files syntax
Ensure Docker Desktop is running with sufficient resources