Re_Backend/monitoring/README.md

# RE Workflow Monitoring Stack

Complete monitoring solution with **Grafana**, **Prometheus**, **Loki**, and **Promtail** for the RE Workflow Management System.

## 🏗️ Architecture

```
┌────────────────────────────────────────────────────────────────────────┐
│                         RE Workflow System                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐    │
│  │  Node.js API    │────│   PostgreSQL    │────│     Redis       │    │
│  │  (Port 5000)    │    │   (Port 5432)   │    │  (Port 6379)    │    │
│  └────────┬────────┘    └─────────────────┘    └─────────────────┘    │
│           │                                                             │
│           │ /metrics endpoint                                          │
│           │ Log files (./logs/)                                        │
│           ▼                                                             │
│  ┌─────────────────────────────────────────────────────────────────┐  │
│  │                    Monitoring Stack                               │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │  │
│  │  │  Prometheus │──│    Loki     │──│       Promtail          │  │  │
│  │  │  (Port 9090)│  │ (Port 3100) │  │ (Collects log files)    │  │  │
│  │  └──────┬──────┘  └──────┬──────┘  └─────────────────────────┘  │  │
│  │         │                │                                        │  │
│  │         └────────┬───────┘                                        │  │
│  │                  ▼                                                 │  │
│  │         ┌─────────────────┐                                       │  │
│  │         │    Grafana      │                                       │  │
│  │         │  (Port 3001)    │◄── Pre-configured Dashboards          │  │
│  │         └─────────────────┘                                       │  │
│  └─────────────────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────────────┘
```

## 📦 What's Included

The monitoring stack includes:
- **Redis** - In-memory data store for BullMQ job queues
- **Prometheus** - Metrics collection and storage
- **Grafana** - Visualization and dashboards
- **Loki** - Log aggregation
- **Promtail** - Log shipping agent
- **Node Exporter** - Host system metrics
- **Redis Exporter** - Redis server metrics
- **Alertmanager** - Alert routing and notifications

## 🚀 Quick Start

### Prerequisites

- **Docker Desktop** installed and running
- **WSL2** enabled (recommended for Windows)
- Backend API running on port 5000

### Step 1: Start Monitoring Stack

```powershell
# Navigate to monitoring folder
cd C:\Laxman\Royal_Enfield\Re_Backend\monitoring

# Start all monitoring services
docker-compose -f docker-compose.monitoring.yml up -d

# Check status
docker ps
```

### Step 2: Configure Backend Environment

Add these to your backend `.env` file:

```env
# Loki configuration (for direct log shipping from Winston)
LOKI_HOST=http://localhost:3100

# Optional: Basic auth if enabled
# LOKI_USER=your_username
# LOKI_PASSWORD=your_password
```

### Step 3: Access Dashboards

| Service | URL | Credentials |
|---------|-----|-------------|
| **Grafana** | http://localhost:3001 | admin / REWorkflow@2024 |
| **Prometheus** | http://localhost:9090 | - |
| **Loki** | http://localhost:3100 | - |
| **Alertmanager** | http://localhost:9093 | - |

## 📊 Available Dashboards

### **RE Workflow Overview** (Enhanced!)
**URL**: http://localhost:3001/d/re-workflow-overview

**Sections:**

1. **📊 API Overview**
   - Request rate, error rate, response times
   - HTTP status codes distribution

2. **🔴 Redis & Queue Status** (NEW!)
   - Redis connection status (Up/Down)
   - Redis active connections
   - Redis memory usage
   - TAT Queue waiting/failed jobs
   - Pause/Resume Queue waiting/failed jobs
   - All queues job status timeline
   - Redis commands rate

3. **💻 System Resources** (NEW!)
   - System CPU Usage (gauge)
   - System Memory Usage (gauge)
   - System Disk Usage (gauge)
   - Disk Space Left (GB available)

4. **🔄 Business Metrics**
   - Workflow operations
   - TAT breaches
   - Node.js process metrics

**Refresh Rate**: Auto-refresh every 30 seconds

### 1. RE Workflow Overview
Pre-configured dashboard with:
- **API Metrics**: Request rate, error rate, latency percentiles
- **Logs Overview**: Error count, warnings, TAT breaches
- **Node.js Runtime**: Memory usage, event loop lag, CPU

### 2. Custom LogQL Queries

| Purpose | Query |
|---------|-------|
| All errors | `{app="re-workflow"} \| json \| level="error"` |
| TAT breaches | `{app="re-workflow"} \| json \| tatEvent="breached"` |
| Auth failures | `{app="re-workflow"} \| json \| authEvent="auth_failure"` |
| Slow requests (>3s) | `{app="re-workflow"} \| json \| duration>3000` |
| By user | `{app="re-workflow"} \| json \| userId="USER-ID"` |
| By request | `{app="re-workflow"} \| json \| requestId="REQ-XXX"` |

### 3. PromQL Queries (Prometheus)

| Purpose | Query |
|---------|-------|
| Request rate | `rate(http_requests_total{job="re-workflow-backend"}[5m])` |
| Error rate | `rate(http_request_errors_total[5m]) / rate(http_requests_total[5m])` |
| P95 latency | `histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))` |
| Memory usage | `process_resident_memory_bytes{job="re-workflow-backend"}` |
| Event loop lag | `nodejs_eventloop_lag_seconds{job="re-workflow-backend"}` |

## 📁 File Structure

```
monitoring/
├── docker-compose.monitoring.yml    # Main compose file
├── prometheus/
│   ├── prometheus.yml               # Prometheus configuration
│   └── alert.rules.yml              # Alert rules
├── loki/
│   └── loki-config.yml              # Loki configuration
├── promtail/
│   └── promtail-config.yml          # Promtail log shipper config
├── alertmanager/
│   └── alertmanager.yml             # Alert notification config
└── grafana/
    ├── provisioning/
    │   ├── datasources/
    │   │   └── datasources.yml      # Auto-configure data sources
    │   └── dashboards/
    │       └── dashboards.yml       # Dashboard provisioning
    └── dashboards/
        └── re-workflow-overview.json # Pre-built dashboard
```

## 🔧 Configuration

### Prometheus Scrape Targets

Edit `prometheus/prometheus.yml` to add/modify scrape targets:

```yaml
scrape_configs:
  - job_name: 're-workflow-backend'
    static_configs:
      # For local development (backend outside Docker)
      - targets: ['host.docker.internal:5000']
      # For Docker deployment (backend in Docker)
      # - targets: ['re_workflow_backend:5000']
```

### Log Retention

Edit `loki/loki-config.yml`:

```yaml
limits_config:
  retention_period: 15d  # Adjust retention period
```

### Alert Notifications

Edit `alertmanager/alertmanager.yml` to configure:
- **Email** notifications
- **Slack** webhooks
- **Custom** webhook endpoints

## 🛠️ Common Commands

```powershell
# Start services
docker-compose -f docker-compose.monitoring.yml up -d

# Stop services
docker-compose -f docker-compose.monitoring.yml down

# View logs
docker-compose -f docker-compose.monitoring.yml logs -f

# View specific service logs
docker-compose -f docker-compose.monitoring.yml logs -f grafana

# Restart a service
docker-compose -f docker-compose.monitoring.yml restart prometheus

# Check service health
docker ps

# Remove all data (fresh start)
docker-compose -f docker-compose.monitoring.yml down -v
```

## ⚡ Metrics Exposed by Backend

The backend exposes these metrics at `/metrics`:

### HTTP Metrics
- `http_requests_total` - Total HTTP requests (by method, route, status)
- `http_request_duration_seconds` - Request latency histogram
- `http_request_errors_total` - Error count (4xx, 5xx)
- `http_active_connections` - Current active connections

### Business Metrics
- `tat_breaches_total` - TAT breach events
- `pending_workflows_count` - Pending workflow gauge
- `workflow_operations_total` - Workflow operation count
- `auth_events_total` - Authentication events

### Node.js Runtime
- `nodejs_heap_size_*` - Heap memory metrics
- `nodejs_eventloop_lag_*` - Event loop lag
- `process_cpu_*` - CPU usage
- `process_resident_memory_bytes` - RSS memory

## 🔒 Security Notes

1. **Change default passwords** in production
2. **Enable TLS** for external access
3. **Configure firewall** to restrict access to monitoring ports
4. **Use reverse proxy** (nginx) for HTTPS

## 🐛 Troubleshooting

### Prometheus can't scrape backend
1. Ensure backend is running on port 5000
2. Check `/metrics` endpoint: `curl http://localhost:5000/metrics`
3. For Docker: use `host.docker.internal:5000`

### Logs not appearing in Loki
1. Check Promtail logs: `docker logs re_promtail`
2. Verify log file path in `promtail-config.yml`
3. Ensure backend has `LOKI_HOST` configured

### Grafana dashboards empty
1. Wait 30-60 seconds for data collection
2. Check data source configuration in Grafana
3. Verify time range selection

### Docker memory issues
```powershell
# Increase Docker Desktop memory allocation
# Settings → Resources → Memory → 4GB+
```

## 📞 Support

For issues with the monitoring stack:
1. Check container logs: `docker logs <container_name>`
2. Verify configuration files syntax
3. Ensure Docker Desktop is running with sufficient resources