292 lines
11 KiB
Markdown
292 lines
11 KiB
Markdown
# RE Workflow Monitoring Stack
|
|
|
|
Complete monitoring solution with **Grafana**, **Prometheus**, **Loki**, and **Promtail** for the RE Workflow Management System.
|
|
|
|
## 🏗️ Architecture
|
|
|
|
```
|
|
┌────────────────────────────────────────────────────────────────────────┐
|
|
│ RE Workflow System │
|
|
├────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
|
│ │ Node.js API │────│ PostgreSQL │────│ Redis │ │
|
|
│ │ (Port 5000) │ │ (Port 5432) │ │ (Port 6379) │ │
|
|
│ └────────┬────────┘ └─────────────────┘ └─────────────────┘ │
|
|
│ │ │
|
|
│ │ /metrics endpoint │
|
|
│ │ Log files (./logs/) │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
|
│ │ Monitoring Stack │ │
|
|
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ │
|
|
│ │ │ Prometheus │──│ Loki │──│ Promtail │ │ │
|
|
│ │ │ (Port 9090)│ │ (Port 3100) │ │ (Collects log files) │ │ │
|
|
│ │ └──────┬──────┘ └──────┬──────┘ └─────────────────────────┘ │ │
|
|
│ │ │ │ │ │
|
|
│ │ └────────┬───────┘ │ │
|
|
│ │ ▼ │ │
|
|
│ │ ┌─────────────────┐ │ │
|
|
│ │ │ Grafana │ │ │
|
|
│ │ │ (Port 3001) │◄── Pre-configured Dashboards │ │
|
|
│ │ └─────────────────┘ │ │
|
|
│ └─────────────────────────────────────────────────────────────────┘ │
|
|
└────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## 📦 What's Included
|
|
|
|
The monitoring stack includes:
|
|
- **Redis** - In-memory data store for BullMQ job queues
|
|
- **Prometheus** - Metrics collection and storage
|
|
- **Grafana** - Visualization and dashboards
|
|
- **Loki** - Log aggregation
|
|
- **Promtail** - Log shipping agent
|
|
- **Node Exporter** - Host system metrics
|
|
- **Redis Exporter** - Redis server metrics
|
|
- **Alertmanager** - Alert routing and notifications
|
|
|
|
## 🚀 Quick Start
|
|
|
|
### Prerequisites
|
|
|
|
- **Docker Desktop** installed and running
|
|
- **WSL2** enabled (recommended for Windows)
|
|
- Backend API running on port 5000
|
|
|
|
### Step 1: Start Monitoring Stack
|
|
|
|
```powershell
|
|
# Navigate to monitoring folder
|
|
cd C:\Laxman\Royal_Enfield\Re_Backend\monitoring
|
|
|
|
# Start all monitoring services
|
|
docker-compose -f docker-compose.monitoring.yml up -d
|
|
|
|
# Check status
|
|
docker ps
|
|
```
|
|
|
|
### Step 2: Configure Backend Environment
|
|
|
|
Add these to your backend `.env` file:
|
|
|
|
```env
|
|
# Loki configuration (for direct log shipping from Winston)
|
|
LOKI_HOST=http://localhost:3100
|
|
|
|
# Optional: Basic auth if enabled
|
|
# LOKI_USER=your_username
|
|
# LOKI_PASSWORD=your_password
|
|
```
|
|
|
|
### Step 3: Access Dashboards
|
|
|
|
| Service | URL | Credentials |
|
|
|---------|-----|-------------|
|
|
| **Grafana** | http://localhost:3001 | admin / REWorkflow@2024 |
|
|
| **Prometheus** | http://localhost:9090 | - |
|
|
| **Loki** | http://localhost:3100 | - |
|
|
| **Alertmanager** | http://localhost:9093 | - |
|
|
|
|
## 📊 Available Dashboards
|
|
|
|
### **RE Workflow Overview** (Enhanced!)
|
|
**URL**: http://localhost:3001/d/re-workflow-overview
|
|
|
|
**Sections:**
|
|
|
|
1. **📊 API Overview**
|
|
- Request rate, error rate, response times
|
|
- HTTP status codes distribution
|
|
|
|
2. **🔴 Redis & Queue Status** (NEW!)
|
|
- Redis connection status (Up/Down)
|
|
- Redis active connections
|
|
- Redis memory usage
|
|
- TAT Queue waiting/failed jobs
|
|
- Pause/Resume Queue waiting/failed jobs
|
|
- All queues job status timeline
|
|
- Redis commands rate
|
|
|
|
3. **💻 System Resources** (NEW!)
|
|
- System CPU Usage (gauge)
|
|
- System Memory Usage (gauge)
|
|
- System Disk Usage (gauge)
|
|
- Disk Space Left (GB available)
|
|
|
|
4. **🔄 Business Metrics**
|
|
- Workflow operations
|
|
- TAT breaches
|
|
- Node.js process metrics
|
|
|
|
**Refresh Rate**: Auto-refresh every 30 seconds
|
|
|
|
### 1. RE Workflow Overview
|
|
Pre-configured dashboard with:
|
|
- **API Metrics**: Request rate, error rate, latency percentiles
|
|
- **Logs Overview**: Error count, warnings, TAT breaches
|
|
- **Node.js Runtime**: Memory usage, event loop lag, CPU
|
|
|
|
### 2. Custom LogQL Queries
|
|
|
|
| Purpose | Query |
|
|
|---------|-------|
|
|
| All errors | `{app="re-workflow"} \| json \| level="error"` |
|
|
| TAT breaches | `{app="re-workflow"} \| json \| tatEvent="breached"` |
|
|
| Auth failures | `{app="re-workflow"} \| json \| authEvent="auth_failure"` |
|
|
| Slow requests (>3s) | `{app="re-workflow"} \| json \| duration>3000` |
|
|
| By user | `{app="re-workflow"} \| json \| userId="USER-ID"` |
|
|
| By request | `{app="re-workflow"} \| json \| requestId="REQ-XXX"` |
|
|
|
|
### 3. PromQL Queries (Prometheus)
|
|
|
|
| Purpose | Query |
|
|
|---------|-------|
|
|
| Request rate | `rate(http_requests_total{job="re-workflow-backend"}[5m])` |
|
|
| Error rate | `rate(http_request_errors_total[5m]) / rate(http_requests_total[5m])` |
|
|
| P95 latency | `histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))` |
|
|
| Memory usage | `process_resident_memory_bytes{job="re-workflow-backend"}` |
|
|
| Event loop lag | `nodejs_eventloop_lag_seconds{job="re-workflow-backend"}` |
|
|
|
|
## 📁 File Structure
|
|
|
|
```
|
|
monitoring/
|
|
├── docker-compose.monitoring.yml # Main compose file
|
|
├── prometheus/
|
|
│ ├── prometheus.yml # Prometheus configuration
|
|
│ └── alert.rules.yml # Alert rules
|
|
├── loki/
|
|
│ └── loki-config.yml # Loki configuration
|
|
├── promtail/
|
|
│ └── promtail-config.yml # Promtail log shipper config
|
|
├── alertmanager/
|
|
│ └── alertmanager.yml # Alert notification config
|
|
└── grafana/
|
|
├── provisioning/
|
|
│ ├── datasources/
|
|
│ │ └── datasources.yml # Auto-configure data sources
|
|
│ └── dashboards/
|
|
│ └── dashboards.yml # Dashboard provisioning
|
|
└── dashboards/
|
|
└── re-workflow-overview.json # Pre-built dashboard
|
|
```
|
|
|
|
## 🔧 Configuration
|
|
|
|
### Prometheus Scrape Targets
|
|
|
|
Edit `prometheus/prometheus.yml` to add/modify scrape targets:
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: 're-workflow-backend'
|
|
static_configs:
|
|
# For local development (backend outside Docker)
|
|
- targets: ['host.docker.internal:5000']
|
|
# For Docker deployment (backend in Docker)
|
|
# - targets: ['re_workflow_backend:5000']
|
|
```
|
|
|
|
### Log Retention
|
|
|
|
Edit `loki/loki-config.yml`:
|
|
|
|
```yaml
|
|
limits_config:
|
|
retention_period: 15d # Adjust retention period
|
|
```
|
|
|
|
### Alert Notifications
|
|
|
|
Edit `alertmanager/alertmanager.yml` to configure:
|
|
- **Email** notifications
|
|
- **Slack** webhooks
|
|
- **Custom** webhook endpoints
|
|
|
|
## 🛠️ Common Commands
|
|
|
|
```powershell
|
|
# Start services
|
|
docker-compose -f docker-compose.monitoring.yml up -d
|
|
|
|
# Stop services
|
|
docker-compose -f docker-compose.monitoring.yml down
|
|
|
|
# View logs
|
|
docker-compose -f docker-compose.monitoring.yml logs -f
|
|
|
|
# View specific service logs
|
|
docker-compose -f docker-compose.monitoring.yml logs -f grafana
|
|
|
|
# Restart a service
|
|
docker-compose -f docker-compose.monitoring.yml restart prometheus
|
|
|
|
# Check service health
|
|
docker ps
|
|
|
|
# Remove all data (fresh start)
|
|
docker-compose -f docker-compose.monitoring.yml down -v
|
|
```
|
|
|
|
## ⚡ Metrics Exposed by Backend
|
|
|
|
The backend exposes these metrics at `/metrics`:
|
|
|
|
### HTTP Metrics
|
|
- `http_requests_total` - Total HTTP requests (by method, route, status)
|
|
- `http_request_duration_seconds` - Request latency histogram
|
|
- `http_request_errors_total` - Error count (4xx, 5xx)
|
|
- `http_active_connections` - Current active connections
|
|
|
|
### Business Metrics
|
|
- `tat_breaches_total` - TAT breach events
|
|
- `pending_workflows_count` - Pending workflow gauge
|
|
- `workflow_operations_total` - Workflow operation count
|
|
- `auth_events_total` - Authentication events
|
|
|
|
### Node.js Runtime
|
|
- `nodejs_heap_size_*` - Heap memory metrics
|
|
- `nodejs_eventloop_lag_*` - Event loop lag
|
|
- `process_cpu_*` - CPU usage
|
|
- `process_resident_memory_bytes` - RSS memory
|
|
|
|
## 🔒 Security Notes
|
|
|
|
1. **Change default passwords** in production
|
|
2. **Enable TLS** for external access
|
|
3. **Configure firewall** to restrict access to monitoring ports
|
|
4. **Use reverse proxy** (nginx) for HTTPS
|
|
|
|
## 🐛 Troubleshooting
|
|
|
|
### Prometheus can't scrape backend
|
|
1. Ensure backend is running on port 5000
|
|
2. Check `/metrics` endpoint: `curl http://localhost:5000/metrics`
|
|
3. For Docker: use `host.docker.internal:5000`
|
|
|
|
### Logs not appearing in Loki
|
|
1. Check Promtail logs: `docker logs re_promtail`
|
|
2. Verify log file path in `promtail-config.yml`
|
|
3. Ensure backend has `LOKI_HOST` configured
|
|
|
|
### Grafana dashboards empty
|
|
1. Wait 30-60 seconds for data collection
|
|
2. Check data source configuration in Grafana
|
|
3. Verify time range selection
|
|
|
|
### Docker memory issues
|
|
```powershell
|
|
# Increase Docker Desktop memory allocation
|
|
# Settings → Resources → Memory → 4GB+
|
|
```
|
|
|
|
## 📞 Support
|
|
|
|
For issues with the monitoring stack:
|
|
1. Check container logs: `docker logs <container_name>`
|
|
2. Verify configuration files syntax
|
|
3. Ensure Docker Desktop is running with sufficient resources
|
|
|