Re_Backend/docs/LOKI_DEPLOYMENT_GUIDE.md

727 lines
15 KiB
Markdown

# Loki + Grafana Deployment Guide for RE Workflow
## Overview
This guide covers deploying **Loki with Grafana** for log aggregation in the RE Workflow application.
```
┌─────────────────────────┐ ┌─────────────────────────┐
│ RE Workflow Backend │──────────▶│ Loki │
│ (Node.js + Winston) │ HTTP │ (Log Storage) │
└─────────────────────────┘ :3100 └───────────┬─────────────┘
┌───────────▼─────────────┐
│ Grafana │
│ monitoring.cloudtopiaa │
│ (Your existing!) │
└─────────────────────────┘
```
**Why Loki + Grafana?**
- ✅ Lightweight - designed for logs (unlike ELK)
- ✅ Uses your existing Grafana instance
- ✅ Same query language as Prometheus (LogQL)
- ✅ Cost-effective - indexes labels, not content
---
# Part 1: Windows Development Setup
## Prerequisites (Windows)
- Docker Desktop for Windows installed
- WSL2 enabled (recommended)
- 4GB+ RAM available for Docker
---
## Step 1: Install Docker Desktop
1. Download from: https://www.docker.com/products/docker-desktop/
2. Run installer
3. Enable WSL2 integration when prompted
4. Restart computer
---
## Step 2: Create Project Directory
Open PowerShell as Administrator:
```powershell
# Create directory
mkdir C:\loki
cd C:\loki
```
---
## Step 3: Create Loki Configuration (Windows)
Create file `C:\loki\loki-config.yaml`:
```powershell
# Using PowerShell
notepad C:\loki\loki-config.yaml
```
**Paste this configuration:**
```yaml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
common:
instance_addr: 127.0.0.1
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
query_range:
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 100
schema_config:
configs:
- from: 2020-10-24
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
limits_config:
retention_period: 7d
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
```
---
## Step 4: Create Docker Compose (Windows)
Create file `C:\loki\docker-compose.yml`:
```powershell
notepad C:\loki\docker-compose.yml
```
**Paste this configuration:**
```yaml
version: '3.8'
services:
loki:
image: grafana/loki:2.9.2
container_name: loki
ports:
- "3100:3100"
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml
- loki-data:/loki
command: -config.file=/etc/loki/local-config.yaml
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3001:3000" # Using 3001 since 3000 is used by React frontend
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin123
volumes:
- grafana-data:/var/lib/grafana
depends_on:
- loki
restart: unless-stopped
volumes:
loki-data:
grafana-data:
```
---
## Step 5: Start Services (Windows)
```powershell
cd C:\loki
docker-compose up -d
```
**Wait 30 seconds for services to initialize.**
---
## Step 6: Verify Services (Windows)
```powershell
# Check containers are running
docker ps
# Test Loki health
Invoke-WebRequest -Uri http://localhost:3100/ready
# Or using curl (if installed)
curl http://localhost:3100/ready
```
---
## Step 7: Configure Grafana (Windows Dev)
1. Open browser: `http://localhost:3001` *(port 3001 to avoid conflict with React on 3000)*
2. Login: `admin` / `admin123`
3. Go to: **Connections → Data Sources → Add data source**
4. Select: **Loki**
5. Configure:
- URL: `http://loki:3100`
6. Click: **Save & Test**
---
## Step 8: Configure Backend .env (Windows Dev)
```env
# Development - Local Loki
LOKI_HOST=http://localhost:3100
```
---
## Windows Commands Reference
| Command | Purpose |
|---------|---------|
| `docker-compose up -d` | Start Loki + Grafana |
| `docker-compose down` | Stop services |
| `docker-compose logs -f loki` | View Loki logs |
| `docker-compose restart` | Restart services |
| `docker ps` | Check running containers |
---
# Part 2: Linux Production Setup (DevOps)
## Prerequisites (Linux)
- Ubuntu 20.04+ / CentOS 7+ / RHEL 8+
- Docker & Docker Compose installed
- 2GB+ RAM (4GB recommended)
- 10GB+ disk space
- Grafana running at `http://monitoring.cloudtopiaa.com/`
---
## Step 1: Install Docker (if not installed)
**Ubuntu/Debian:**
```bash
# Update packages
sudo apt update
# Install Docker
sudo apt install -y docker.io docker-compose
# Start Docker
sudo systemctl start docker
sudo systemctl enable docker
# Add user to docker group
sudo usermod -aG docker $USER
```
**CentOS/RHEL:**
```bash
# Install Docker
sudo yum install -y docker docker-compose
# Start Docker
sudo systemctl start docker
sudo systemctl enable docker
```
---
## Step 2: Create Loki Directory
```bash
sudo mkdir -p /opt/loki
cd /opt/loki
```
---
## Step 3: Create Loki Configuration (Linux)
```bash
sudo nano /opt/loki/loki-config.yaml
```
**Paste this configuration:**
```yaml
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
common:
instance_addr: 127.0.0.1
path_prefix: /tmp/loki
storage:
filesystem:
chunks_directory: /tmp/loki/chunks
rules_directory: /tmp/loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
query_range:
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 100
schema_config:
configs:
- from: 2020-10-24
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
ruler:
alertmanager_url: http://localhost:9093
limits_config:
retention_period: 30d
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
# Storage retention
compactor:
working_directory: /tmp/loki/compactor
retention_enabled: true
retention_delete_delay: 2h
delete_request_store: filesystem
```
---
## Step 4: Create Docker Compose (Linux Production)
```bash
sudo nano /opt/loki/docker-compose.yml
```
**Paste this configuration (Loki only - uses existing Grafana):**
```yaml
version: '3.8'
services:
loki:
image: grafana/loki:2.9.2
container_name: loki
ports:
- "3100:3100"
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml
- loki-data:/tmp/loki
command: -config.file=/etc/loki/local-config.yaml
networks:
- monitoring
restart: unless-stopped
healthcheck:
test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3100/ready || exit 1"]
interval: 30s
timeout: 10s
retries: 5
networks:
monitoring:
driver: bridge
volumes:
loki-data:
driver: local
```
---
## Step 5: Start Loki (Linux)
```bash
cd /opt/loki
sudo docker-compose up -d
```
**Wait 30 seconds for Loki to initialize.**
---
## Step 6: Verify Loki (Linux)
```bash
# Check container
sudo docker ps | grep loki
# Test Loki health
curl http://localhost:3100/ready
# Test Loki is accepting logs
curl http://localhost:3100/loki/api/v1/labels
```
**Expected response:**
```json
{"status":"success","data":[]}
```
---
## Step 7: Open Firewall Port (Linux)
**Ubuntu/Debian:**
```bash
sudo ufw allow 3100/tcp
sudo ufw reload
```
**CentOS/RHEL:**
```bash
sudo firewall-cmd --permanent --add-port=3100/tcp
sudo firewall-cmd --reload
```
---
## Step 8: Add Loki to Existing Grafana
1. **Open Grafana:** `http://monitoring.cloudtopiaa.com/`
2. **Login** with admin credentials
3. **Go to:** Connections → Data Sources → Add data source
4. **Select:** Loki
5. **Configure:**
| Field | Value |
|-------|-------|
| Name | `RE-Workflow-Logs` |
| URL | `http://<loki-server-ip>:3100` |
| Timeout | `60` |
6. **Click:** Save & Test
7. **Should see:** ✅ "Data source successfully connected"
---
## Step 9: Configure Backend .env (Production)
```env
# Production - Remote Loki
LOKI_HOST=http://<loki-server-ip>:3100
# LOKI_USER= # Optional: if basic auth enabled
# LOKI_PASSWORD= # Optional: if basic auth enabled
```
---
## Linux Commands Reference
| Command | Purpose |
|---------|---------|
| `sudo docker-compose up -d` | Start Loki |
| `sudo docker-compose down` | Stop Loki |
| `sudo docker-compose logs -f` | View logs |
| `sudo docker-compose restart` | Restart |
| `sudo docker ps` | Check containers |
---
## Step 10: Enable Basic Auth (Optional - Production)
For added security, enable basic auth:
```bash
# Install apache2-utils for htpasswd
sudo apt install apache2-utils
# Create password file
sudo htpasswd -c /opt/loki/.htpasswd lokiuser
# Update docker-compose.yml to use nginx reverse proxy with auth
```
---
# Part 3: Grafana Dashboard Setup
## Create Dashboard
1. Go to: `http://monitoring.cloudtopiaa.com/dashboards` (or `http://localhost:3001` for dev)
2. Click: **New → New Dashboard**
3. Add panels as described below
---
### Panel 1: Error Count (Stat)
**Query (LogQL):**
```
count_over_time({app="re-workflow"} |= "error" [24h])
```
- Visualization: **Stat**
- Title: "Errors (24h)"
---
### Panel 2: Error Timeline (Time Series)
**Query (LogQL):**
```
sum by (level) (count_over_time({app="re-workflow"} | json | level=~"error|warn" [5m]))
```
- Visualization: **Time Series**
- Title: "Errors Over Time"
---
### Panel 3: Recent Errors (Logs)
**Query (LogQL):**
```
{app="re-workflow"} | json | level="error"
```
- Visualization: **Logs**
- Title: "Recent Errors"
---
### Panel 4: TAT Breaches (Stat)
**Query (LogQL):**
```
count_over_time({app="re-workflow"} | json | tatEvent="breached" [24h])
```
- Visualization: **Stat**
- Title: "TAT Breaches"
- Color: Red
---
### Panel 5: Workflow Events (Pie)
**Query (LogQL):**
```
sum by (workflowEvent) (count_over_time({app="re-workflow"} | json | workflowEvent!="" [24h]))
```
- Visualization: **Pie Chart**
- Title: "Workflow Events"
---
### Panel 6: Auth Failures (Table)
**Query (LogQL):**
```
{app="re-workflow"} | json | authEvent="auth_failure"
```
- Visualization: **Table**
- Title: "Authentication Failures"
---
## Useful LogQL Queries
| Purpose | Query |
|---------|-------|
| All errors | `{app="re-workflow"} \| json \| level="error"` |
| Specific request | `{app="re-workflow"} \| json \| requestId="REQ-2024-001"` |
| User activity | `{app="re-workflow"} \| json \| userId="user-123"` |
| TAT breaches | `{app="re-workflow"} \| json \| tatEvent="breached"` |
| Auth failures | `{app="re-workflow"} \| json \| authEvent="auth_failure"` |
| Workflow created | `{app="re-workflow"} \| json \| workflowEvent="created"` |
| API errors (5xx) | `{app="re-workflow"} \| json \| statusCode>=500` |
| Slow requests | `{app="re-workflow"} \| json \| duration>3000` |
| Error rate | `sum(rate({app="re-workflow"} \| json \| level="error"[5m]))` |
| By department | `{app="re-workflow"} \| json \| department="Engineering"` |
---
# Part 4: Alerting Setup
## Alert 1: High Error Rate
1. Go to: **Alerting → Alert Rules → New Alert Rule**
2. Configure:
- Name: `RE Workflow - High Error Rate`
- Data source: `RE-Workflow-Logs`
- Query: `count_over_time({app="re-workflow"} | json | level="error" [5m])`
- Condition: IS ABOVE 10
3. Add notification (Slack, Email)
## Alert 2: TAT Breach
1. Create new alert rule
2. Configure:
- Name: `RE Workflow - TAT Breach`
- Query: `count_over_time({app="re-workflow"} | json | tatEvent="breached" [15m])`
- Condition: IS ABOVE 0
3. Add notification
## Alert 3: Auth Attack Detection
1. Create new alert rule
2. Configure:
- Name: `RE Workflow - Auth Attack`
- Query: `count_over_time({app="re-workflow"} | json | authEvent="auth_failure" [5m])`
- Condition: IS ABOVE 20
3. Add notification to Security team
---
# Part 5: Troubleshooting
## Windows Issues
### Docker Desktop not starting
```powershell
# Restart Docker Desktop service
Restart-Service docker
# Or restart Docker Desktop from system tray
```
### Port 3100 already in use
```powershell
# Find process using port
netstat -ano | findstr :3100
# Kill process
taskkill /PID <pid> /F
```
### WSL2 issues
```powershell
# Update WSL
wsl --update
# Restart WSL
wsl --shutdown
```
---
## Linux Issues
### Loki won't start
```bash
# Check logs
sudo docker logs loki
# Common fix - permissions
sudo chown -R 10001:10001 /opt/loki
```
### Grafana can't connect to Loki
```bash
# Verify Loki is healthy
curl http://localhost:3100/ready
# Check network from Grafana server
curl http://loki-server:3100/ready
# Restart Loki
sudo docker-compose restart
```
### Logs not appearing in Grafana
1. Check application env has correct `LOKI_HOST`
2. Verify network connectivity: `curl http://loki:3100/ready`
3. Check labels: `curl http://localhost:3100/loki/api/v1/labels`
4. Wait for application to send first logs
### High memory usage
```bash
# Reduce retention period in loki-config.yaml
limits_config:
retention_period: 7d # Reduce from 30d
```
---
# Quick Reference
## Environment Comparison
| Setting | Windows Dev | Linux Production |
|---------|-------------|------------------|
| LOKI_HOST | `http://localhost:3100` | `http://<server-ip>:3100` |
| Grafana URL | `http://localhost:3001` | `http://monitoring.cloudtopiaa.com` |
| Config Path | `C:\loki\` | `/opt/loki/` |
| Retention | 7 days | 30 days |
## Port Reference
| Service | Port | URL |
|---------|------|-----|
| Loki | 3100 | `http://server:3100` |
| Grafana (Dev) | 3001 | `http://localhost:3001` |
| Grafana (Prod) | 80/443 | `http://monitoring.cloudtopiaa.com/` |
| React Frontend | 3000 | `http://localhost:3000` |
---
# Verification Checklist
## Windows Development
- [ ] Docker Desktop running
- [ ] `docker ps` shows loki and grafana containers
- [ ] `http://localhost:3100/ready` returns "ready"
- [ ] `http://localhost:3001` shows Grafana login
- [ ] Loki data source connected in Grafana
- [ ] Backend `.env` has `LOKI_HOST=http://localhost:3100`
## Linux Production
- [ ] Loki container running (`docker ps`)
- [ ] `curl localhost:3100/ready` returns "ready"
- [ ] Firewall port 3100 open
- [ ] Grafana connected to Loki
- [ ] Backend `.env` has correct `LOKI_HOST`
- [ ] Logs appearing in Grafana Explore
- [ ] Dashboard created
- [ ] Alerts configured
---
# Contact
For issues with this setup:
- Backend logs: Check Grafana dashboard
- Infrastructure: Contact DevOps team