Re_Backend/monitoring/README.md

249 lines
9.8 KiB
Markdown

# RE Workflow Monitoring Stack
Complete monitoring solution with **Grafana**, **Prometheus**, **Loki**, and **Promtail** for the RE Workflow Management System.
## 🏗️ Architecture
```
┌────────────────────────────────────────────────────────────────────────┐
│ RE Workflow System │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Node.js API │────│ PostgreSQL │────│ Redis │ │
│ │ (Port 5000) │ │ (Port 5432) │ │ (Port 6379) │ │
│ └────────┬────────┘ └─────────────────┘ └─────────────────┘ │
│ │ │
│ │ /metrics endpoint │
│ │ Log files (./logs/) │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Monitoring Stack │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ │
│ │ │ Prometheus │──│ Loki │──│ Promtail │ │ │
│ │ │ (Port 9090)│ │ (Port 3100) │ │ (Collects log files) │ │ │
│ │ └──────┬──────┘ └──────┬──────┘ └─────────────────────────┘ │ │
│ │ │ │ │ │
│ │ └────────┬───────┘ │ │
│ │ ▼ │ │
│ │ ┌─────────────────┐ │ │
│ │ │ Grafana │ │ │
│ │ │ (Port 3001) │◄── Pre-configured Dashboards │ │
│ │ └─────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────────┘
```
## 🚀 Quick Start
### Prerequisites
- **Docker Desktop** installed and running
- **WSL2** enabled (recommended for Windows)
- Backend API running on port 5000
### Step 1: Start Monitoring Stack
```powershell
# Navigate to monitoring folder
cd C:\Laxman\Royal_Enfield\Re_Backend\monitoring
# Start all monitoring services
docker-compose -f docker-compose.monitoring.yml up -d
# Check status
docker ps
```
### Step 2: Configure Backend Environment
Add these to your backend `.env` file:
```env
# Loki configuration (for direct log shipping from Winston)
LOKI_HOST=http://localhost:3100
# Optional: Basic auth if enabled
# LOKI_USER=your_username
# LOKI_PASSWORD=your_password
```
### Step 3: Access Dashboards
| Service | URL | Credentials |
|---------|-----|-------------|
| **Grafana** | http://localhost:3001 | admin / REWorkflow@2024 |
| **Prometheus** | http://localhost:9090 | - |
| **Loki** | http://localhost:3100 | - |
| **Alertmanager** | http://localhost:9093 | - |
## 📊 Available Dashboards
### 1. RE Workflow Overview
Pre-configured dashboard with:
- **API Metrics**: Request rate, error rate, latency percentiles
- **Logs Overview**: Error count, warnings, TAT breaches
- **Node.js Runtime**: Memory usage, event loop lag, CPU
### 2. Custom LogQL Queries
| Purpose | Query |
|---------|-------|
| All errors | `{app="re-workflow"} \| json \| level="error"` |
| TAT breaches | `{app="re-workflow"} \| json \| tatEvent="breached"` |
| Auth failures | `{app="re-workflow"} \| json \| authEvent="auth_failure"` |
| Slow requests (>3s) | `{app="re-workflow"} \| json \| duration>3000` |
| By user | `{app="re-workflow"} \| json \| userId="USER-ID"` |
| By request | `{app="re-workflow"} \| json \| requestId="REQ-XXX"` |
### 3. PromQL Queries (Prometheus)
| Purpose | Query |
|---------|-------|
| Request rate | `rate(http_requests_total{job="re-workflow-backend"}[5m])` |
| Error rate | `rate(http_request_errors_total[5m]) / rate(http_requests_total[5m])` |
| P95 latency | `histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))` |
| Memory usage | `process_resident_memory_bytes{job="re-workflow-backend"}` |
| Event loop lag | `nodejs_eventloop_lag_seconds{job="re-workflow-backend"}` |
## 📁 File Structure
```
monitoring/
├── docker-compose.monitoring.yml # Main compose file
├── prometheus/
│ ├── prometheus.yml # Prometheus configuration
│ └── alert.rules.yml # Alert rules
├── loki/
│ └── loki-config.yml # Loki configuration
├── promtail/
│ └── promtail-config.yml # Promtail log shipper config
├── alertmanager/
│ └── alertmanager.yml # Alert notification config
└── grafana/
├── provisioning/
│ ├── datasources/
│ │ └── datasources.yml # Auto-configure data sources
│ └── dashboards/
│ └── dashboards.yml # Dashboard provisioning
└── dashboards/
└── re-workflow-overview.json # Pre-built dashboard
```
## 🔧 Configuration
### Prometheus Scrape Targets
Edit `prometheus/prometheus.yml` to add/modify scrape targets:
```yaml
scrape_configs:
- job_name: 're-workflow-backend'
static_configs:
# For local development (backend outside Docker)
- targets: ['host.docker.internal:5000']
# For Docker deployment (backend in Docker)
# - targets: ['re_workflow_backend:5000']
```
### Log Retention
Edit `loki/loki-config.yml`:
```yaml
limits_config:
retention_period: 15d # Adjust retention period
```
### Alert Notifications
Edit `alertmanager/alertmanager.yml` to configure:
- **Email** notifications
- **Slack** webhooks
- **Custom** webhook endpoints
## 🛠️ Common Commands
```powershell
# Start services
docker-compose -f docker-compose.monitoring.yml up -d
# Stop services
docker-compose -f docker-compose.monitoring.yml down
# View logs
docker-compose -f docker-compose.monitoring.yml logs -f
# View specific service logs
docker-compose -f docker-compose.monitoring.yml logs -f grafana
# Restart a service
docker-compose -f docker-compose.monitoring.yml restart prometheus
# Check service health
docker ps
# Remove all data (fresh start)
docker-compose -f docker-compose.monitoring.yml down -v
```
## ⚡ Metrics Exposed by Backend
The backend exposes these metrics at `/metrics`:
### HTTP Metrics
- `http_requests_total` - Total HTTP requests (by method, route, status)
- `http_request_duration_seconds` - Request latency histogram
- `http_request_errors_total` - Error count (4xx, 5xx)
- `http_active_connections` - Current active connections
### Business Metrics
- `tat_breaches_total` - TAT breach events
- `pending_workflows_count` - Pending workflow gauge
- `workflow_operations_total` - Workflow operation count
- `auth_events_total` - Authentication events
### Node.js Runtime
- `nodejs_heap_size_*` - Heap memory metrics
- `nodejs_eventloop_lag_*` - Event loop lag
- `process_cpu_*` - CPU usage
- `process_resident_memory_bytes` - RSS memory
## 🔒 Security Notes
1. **Change default passwords** in production
2. **Enable TLS** for external access
3. **Configure firewall** to restrict access to monitoring ports
4. **Use reverse proxy** (nginx) for HTTPS
## 🐛 Troubleshooting
### Prometheus can't scrape backend
1. Ensure backend is running on port 5000
2. Check `/metrics` endpoint: `curl http://localhost:5000/metrics`
3. For Docker: use `host.docker.internal:5000`
### Logs not appearing in Loki
1. Check Promtail logs: `docker logs re_promtail`
2. Verify log file path in `promtail-config.yml`
3. Ensure backend has `LOKI_HOST` configured
### Grafana dashboards empty
1. Wait 30-60 seconds for data collection
2. Check data source configuration in Grafana
3. Verify time range selection
### Docker memory issues
```powershell
# Increase Docker Desktop memory allocation
# Settings → Resources → Memory → 4GB+
```
## 📞 Support
For issues with the monitoring stack:
1. Check container logs: `docker logs <container_name>`
2. Verify configuration files syntax
3. Ensure Docker Desktop is running with sufficient resources