727 lines
15 KiB
Markdown
727 lines
15 KiB
Markdown
# Loki + Grafana Deployment Guide for RE Workflow
|
|
|
|
## Overview
|
|
|
|
This guide covers deploying **Loki with Grafana** for log aggregation in the RE Workflow application.
|
|
|
|
```
|
|
┌─────────────────────────┐ ┌─────────────────────────┐
|
|
│ RE Workflow Backend │──────────▶│ Loki │
|
|
│ (Node.js + Winston) │ HTTP │ (Log Storage) │
|
|
└─────────────────────────┘ :3100 └───────────┬─────────────┘
|
|
│
|
|
┌───────────▼─────────────┐
|
|
│ Grafana │
|
|
│ monitoring.cloudtopiaa │
|
|
│ (Your existing!) │
|
|
└─────────────────────────┘
|
|
```
|
|
|
|
**Why Loki + Grafana?**
|
|
- ✅ Lightweight - designed for logs (unlike ELK)
|
|
- ✅ Uses your existing Grafana instance
|
|
- ✅ Same query language as Prometheus (LogQL)
|
|
- ✅ Cost-effective - indexes labels, not content
|
|
|
|
---
|
|
|
|
# Part 1: Windows Development Setup
|
|
|
|
## Prerequisites (Windows)
|
|
|
|
- Docker Desktop for Windows installed
|
|
- WSL2 enabled (recommended)
|
|
- 4GB+ RAM available for Docker
|
|
|
|
---
|
|
|
|
## Step 1: Install Docker Desktop
|
|
|
|
1. Download from: https://www.docker.com/products/docker-desktop/
|
|
2. Run installer
|
|
3. Enable WSL2 integration when prompted
|
|
4. Restart computer
|
|
|
|
---
|
|
|
|
## Step 2: Create Project Directory
|
|
|
|
Open PowerShell as Administrator:
|
|
|
|
```powershell
|
|
# Create directory
|
|
mkdir C:\loki
|
|
cd C:\loki
|
|
```
|
|
|
|
---
|
|
|
|
## Step 3: Create Loki Configuration (Windows)
|
|
|
|
Create file `C:\loki\loki-config.yaml`:
|
|
|
|
```powershell
|
|
# Using PowerShell
|
|
notepad C:\loki\loki-config.yaml
|
|
```
|
|
|
|
**Paste this configuration:**
|
|
|
|
```yaml
|
|
auth_enabled: false
|
|
|
|
server:
|
|
http_listen_port: 3100
|
|
grpc_listen_port: 9096
|
|
|
|
common:
|
|
instance_addr: 127.0.0.1
|
|
path_prefix: /loki
|
|
storage:
|
|
filesystem:
|
|
chunks_directory: /loki/chunks
|
|
rules_directory: /loki/rules
|
|
replication_factor: 1
|
|
ring:
|
|
kvstore:
|
|
store: inmemory
|
|
|
|
query_range:
|
|
results_cache:
|
|
cache:
|
|
embedded_cache:
|
|
enabled: true
|
|
max_size_mb: 100
|
|
|
|
schema_config:
|
|
configs:
|
|
- from: 2020-10-24
|
|
store: tsdb
|
|
object_store: filesystem
|
|
schema: v13
|
|
index:
|
|
prefix: index_
|
|
period: 24h
|
|
|
|
limits_config:
|
|
retention_period: 7d
|
|
ingestion_rate_mb: 10
|
|
ingestion_burst_size_mb: 20
|
|
```
|
|
|
|
---
|
|
|
|
## Step 4: Create Docker Compose (Windows)
|
|
|
|
Create file `C:\loki\docker-compose.yml`:
|
|
|
|
```powershell
|
|
notepad C:\loki\docker-compose.yml
|
|
```
|
|
|
|
**Paste this configuration:**
|
|
|
|
```yaml
|
|
version: '3.8'
|
|
|
|
services:
|
|
loki:
|
|
image: grafana/loki:2.9.2
|
|
container_name: loki
|
|
ports:
|
|
- "3100:3100"
|
|
volumes:
|
|
- ./loki-config.yaml:/etc/loki/local-config.yaml
|
|
- loki-data:/loki
|
|
command: -config.file=/etc/loki/local-config.yaml
|
|
restart: unless-stopped
|
|
|
|
grafana:
|
|
image: grafana/grafana:latest
|
|
container_name: grafana
|
|
ports:
|
|
- "3001:3000" # Using 3001 since 3000 is used by React frontend
|
|
environment:
|
|
- GF_SECURITY_ADMIN_USER=admin
|
|
- GF_SECURITY_ADMIN_PASSWORD=admin123
|
|
volumes:
|
|
- grafana-data:/var/lib/grafana
|
|
depends_on:
|
|
- loki
|
|
restart: unless-stopped
|
|
|
|
volumes:
|
|
loki-data:
|
|
grafana-data:
|
|
```
|
|
|
|
---
|
|
|
|
## Step 5: Start Services (Windows)
|
|
|
|
```powershell
|
|
cd C:\loki
|
|
docker-compose up -d
|
|
```
|
|
|
|
**Wait 30 seconds for services to initialize.**
|
|
|
|
---
|
|
|
|
## Step 6: Verify Services (Windows)
|
|
|
|
```powershell
|
|
# Check containers are running
|
|
docker ps
|
|
|
|
# Test Loki health
|
|
Invoke-WebRequest -Uri http://localhost:3100/ready
|
|
|
|
# Or using curl (if installed)
|
|
curl http://localhost:3100/ready
|
|
```
|
|
|
|
---
|
|
|
|
## Step 7: Configure Grafana (Windows Dev)
|
|
|
|
1. Open browser: `http://localhost:3001` *(port 3001 to avoid conflict with React on 3000)*
|
|
2. Login: `admin` / `admin123`
|
|
3. Go to: **Connections → Data Sources → Add data source**
|
|
4. Select: **Loki**
|
|
5. Configure:
|
|
- URL: `http://loki:3100`
|
|
6. Click: **Save & Test**
|
|
|
|
---
|
|
|
|
## Step 8: Configure Backend .env (Windows Dev)
|
|
|
|
```env
|
|
# Development - Local Loki
|
|
LOKI_HOST=http://localhost:3100
|
|
```
|
|
|
|
---
|
|
|
|
## Windows Commands Reference
|
|
|
|
| Command | Purpose |
|
|
|---------|---------|
|
|
| `docker-compose up -d` | Start Loki + Grafana |
|
|
| `docker-compose down` | Stop services |
|
|
| `docker-compose logs -f loki` | View Loki logs |
|
|
| `docker-compose restart` | Restart services |
|
|
| `docker ps` | Check running containers |
|
|
|
|
---
|
|
|
|
# Part 2: Linux Production Setup (DevOps)
|
|
|
|
## Prerequisites (Linux)
|
|
|
|
- Ubuntu 20.04+ / CentOS 7+ / RHEL 8+
|
|
- Docker & Docker Compose installed
|
|
- 2GB+ RAM (4GB recommended)
|
|
- 10GB+ disk space
|
|
- Grafana running at `http://monitoring.cloudtopiaa.com/`
|
|
|
|
---
|
|
|
|
## Step 1: Install Docker (if not installed)
|
|
|
|
**Ubuntu/Debian:**
|
|
```bash
|
|
# Update packages
|
|
sudo apt update
|
|
|
|
# Install Docker
|
|
sudo apt install -y docker.io docker-compose
|
|
|
|
# Start Docker
|
|
sudo systemctl start docker
|
|
sudo systemctl enable docker
|
|
|
|
# Add user to docker group
|
|
sudo usermod -aG docker $USER
|
|
```
|
|
|
|
**CentOS/RHEL:**
|
|
```bash
|
|
# Install Docker
|
|
sudo yum install -y docker docker-compose
|
|
|
|
# Start Docker
|
|
sudo systemctl start docker
|
|
sudo systemctl enable docker
|
|
```
|
|
|
|
---
|
|
|
|
## Step 2: Create Loki Directory
|
|
|
|
```bash
|
|
sudo mkdir -p /opt/loki
|
|
cd /opt/loki
|
|
```
|
|
|
|
---
|
|
|
|
## Step 3: Create Loki Configuration (Linux)
|
|
|
|
```bash
|
|
sudo nano /opt/loki/loki-config.yaml
|
|
```
|
|
|
|
**Paste this configuration:**
|
|
|
|
```yaml
|
|
auth_enabled: false
|
|
|
|
server:
|
|
http_listen_port: 3100
|
|
grpc_listen_port: 9096
|
|
|
|
common:
|
|
instance_addr: 127.0.0.1
|
|
path_prefix: /tmp/loki
|
|
storage:
|
|
filesystem:
|
|
chunks_directory: /tmp/loki/chunks
|
|
rules_directory: /tmp/loki/rules
|
|
replication_factor: 1
|
|
ring:
|
|
kvstore:
|
|
store: inmemory
|
|
|
|
query_range:
|
|
results_cache:
|
|
cache:
|
|
embedded_cache:
|
|
enabled: true
|
|
max_size_mb: 100
|
|
|
|
schema_config:
|
|
configs:
|
|
- from: 2020-10-24
|
|
store: tsdb
|
|
object_store: filesystem
|
|
schema: v13
|
|
index:
|
|
prefix: index_
|
|
period: 24h
|
|
|
|
ruler:
|
|
alertmanager_url: http://localhost:9093
|
|
|
|
limits_config:
|
|
retention_period: 30d
|
|
ingestion_rate_mb: 10
|
|
ingestion_burst_size_mb: 20
|
|
|
|
# Storage retention
|
|
compactor:
|
|
working_directory: /tmp/loki/compactor
|
|
retention_enabled: true
|
|
retention_delete_delay: 2h
|
|
delete_request_store: filesystem
|
|
```
|
|
|
|
---
|
|
|
|
## Step 4: Create Docker Compose (Linux Production)
|
|
|
|
```bash
|
|
sudo nano /opt/loki/docker-compose.yml
|
|
```
|
|
|
|
**Paste this configuration (Loki only - uses existing Grafana):**
|
|
|
|
```yaml
|
|
version: '3.8'
|
|
|
|
services:
|
|
loki:
|
|
image: grafana/loki:2.9.2
|
|
container_name: loki
|
|
ports:
|
|
- "3100:3100"
|
|
volumes:
|
|
- ./loki-config.yaml:/etc/loki/local-config.yaml
|
|
- loki-data:/tmp/loki
|
|
command: -config.file=/etc/loki/local-config.yaml
|
|
networks:
|
|
- monitoring
|
|
restart: unless-stopped
|
|
healthcheck:
|
|
test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://localhost:3100/ready || exit 1"]
|
|
interval: 30s
|
|
timeout: 10s
|
|
retries: 5
|
|
|
|
networks:
|
|
monitoring:
|
|
driver: bridge
|
|
|
|
volumes:
|
|
loki-data:
|
|
driver: local
|
|
```
|
|
|
|
---
|
|
|
|
## Step 5: Start Loki (Linux)
|
|
|
|
```bash
|
|
cd /opt/loki
|
|
sudo docker-compose up -d
|
|
```
|
|
|
|
**Wait 30 seconds for Loki to initialize.**
|
|
|
|
---
|
|
|
|
## Step 6: Verify Loki (Linux)
|
|
|
|
```bash
|
|
# Check container
|
|
sudo docker ps | grep loki
|
|
|
|
# Test Loki health
|
|
curl http://localhost:3100/ready
|
|
|
|
# Test Loki is accepting logs
|
|
curl http://localhost:3100/loki/api/v1/labels
|
|
```
|
|
|
|
**Expected response:**
|
|
```json
|
|
{"status":"success","data":[]}
|
|
```
|
|
|
|
---
|
|
|
|
## Step 7: Open Firewall Port (Linux)
|
|
|
|
**Ubuntu/Debian:**
|
|
```bash
|
|
sudo ufw allow 3100/tcp
|
|
sudo ufw reload
|
|
```
|
|
|
|
**CentOS/RHEL:**
|
|
```bash
|
|
sudo firewall-cmd --permanent --add-port=3100/tcp
|
|
sudo firewall-cmd --reload
|
|
```
|
|
|
|
---
|
|
|
|
## Step 8: Add Loki to Existing Grafana
|
|
|
|
1. **Open Grafana:** `http://monitoring.cloudtopiaa.com/`
|
|
2. **Login** with admin credentials
|
|
3. **Go to:** Connections → Data Sources → Add data source
|
|
4. **Select:** Loki
|
|
5. **Configure:**
|
|
|
|
| Field | Value |
|
|
|-------|-------|
|
|
| Name | `RE-Workflow-Logs` |
|
|
| URL | `http://<loki-server-ip>:3100` |
|
|
| Timeout | `60` |
|
|
|
|
6. **Click:** Save & Test
|
|
7. **Should see:** ✅ "Data source successfully connected"
|
|
|
|
---
|
|
|
|
## Step 9: Configure Backend .env (Production)
|
|
|
|
```env
|
|
# Production - Remote Loki
|
|
LOKI_HOST=http://<loki-server-ip>:3100
|
|
# LOKI_USER= # Optional: if basic auth enabled
|
|
# LOKI_PASSWORD= # Optional: if basic auth enabled
|
|
```
|
|
|
|
---
|
|
|
|
## Linux Commands Reference
|
|
|
|
| Command | Purpose |
|
|
|---------|---------|
|
|
| `sudo docker-compose up -d` | Start Loki |
|
|
| `sudo docker-compose down` | Stop Loki |
|
|
| `sudo docker-compose logs -f` | View logs |
|
|
| `sudo docker-compose restart` | Restart |
|
|
| `sudo docker ps` | Check containers |
|
|
|
|
---
|
|
|
|
## Step 10: Enable Basic Auth (Optional - Production)
|
|
|
|
For added security, enable basic auth:
|
|
|
|
```bash
|
|
# Install apache2-utils for htpasswd
|
|
sudo apt install apache2-utils
|
|
|
|
# Create password file
|
|
sudo htpasswd -c /opt/loki/.htpasswd lokiuser
|
|
|
|
# Update docker-compose.yml to use nginx reverse proxy with auth
|
|
```
|
|
|
|
---
|
|
|
|
# Part 3: Grafana Dashboard Setup
|
|
|
|
## Create Dashboard
|
|
|
|
1. Go to: `http://monitoring.cloudtopiaa.com/dashboards` (or `http://localhost:3001` for dev)
|
|
2. Click: **New → New Dashboard**
|
|
3. Add panels as described below
|
|
|
|
---
|
|
|
|
### Panel 1: Error Count (Stat)
|
|
|
|
**Query (LogQL):**
|
|
```
|
|
count_over_time({app="re-workflow"} |= "error" [24h])
|
|
```
|
|
- Visualization: **Stat**
|
|
- Title: "Errors (24h)"
|
|
|
|
---
|
|
|
|
### Panel 2: Error Timeline (Time Series)
|
|
|
|
**Query (LogQL):**
|
|
```
|
|
sum by (level) (count_over_time({app="re-workflow"} | json | level=~"error|warn" [5m]))
|
|
```
|
|
- Visualization: **Time Series**
|
|
- Title: "Errors Over Time"
|
|
|
|
---
|
|
|
|
### Panel 3: Recent Errors (Logs)
|
|
|
|
**Query (LogQL):**
|
|
```
|
|
{app="re-workflow"} | json | level="error"
|
|
```
|
|
- Visualization: **Logs**
|
|
- Title: "Recent Errors"
|
|
|
|
---
|
|
|
|
### Panel 4: TAT Breaches (Stat)
|
|
|
|
**Query (LogQL):**
|
|
```
|
|
count_over_time({app="re-workflow"} | json | tatEvent="breached" [24h])
|
|
```
|
|
- Visualization: **Stat**
|
|
- Title: "TAT Breaches"
|
|
- Color: Red
|
|
|
|
---
|
|
|
|
### Panel 5: Workflow Events (Pie)
|
|
|
|
**Query (LogQL):**
|
|
```
|
|
sum by (workflowEvent) (count_over_time({app="re-workflow"} | json | workflowEvent!="" [24h]))
|
|
```
|
|
- Visualization: **Pie Chart**
|
|
- Title: "Workflow Events"
|
|
|
|
---
|
|
|
|
### Panel 6: Auth Failures (Table)
|
|
|
|
**Query (LogQL):**
|
|
```
|
|
{app="re-workflow"} | json | authEvent="auth_failure"
|
|
```
|
|
- Visualization: **Table**
|
|
- Title: "Authentication Failures"
|
|
|
|
---
|
|
|
|
## Useful LogQL Queries
|
|
|
|
| Purpose | Query |
|
|
|---------|-------|
|
|
| All errors | `{app="re-workflow"} \| json \| level="error"` |
|
|
| Specific request | `{app="re-workflow"} \| json \| requestId="REQ-2024-001"` |
|
|
| User activity | `{app="re-workflow"} \| json \| userId="user-123"` |
|
|
| TAT breaches | `{app="re-workflow"} \| json \| tatEvent="breached"` |
|
|
| Auth failures | `{app="re-workflow"} \| json \| authEvent="auth_failure"` |
|
|
| Workflow created | `{app="re-workflow"} \| json \| workflowEvent="created"` |
|
|
| API errors (5xx) | `{app="re-workflow"} \| json \| statusCode>=500` |
|
|
| Slow requests | `{app="re-workflow"} \| json \| duration>3000` |
|
|
| Error rate | `sum(rate({app="re-workflow"} \| json \| level="error"[5m]))` |
|
|
| By department | `{app="re-workflow"} \| json \| department="Engineering"` |
|
|
|
|
---
|
|
|
|
# Part 4: Alerting Setup
|
|
|
|
## Alert 1: High Error Rate
|
|
|
|
1. Go to: **Alerting → Alert Rules → New Alert Rule**
|
|
2. Configure:
|
|
- Name: `RE Workflow - High Error Rate`
|
|
- Data source: `RE-Workflow-Logs`
|
|
- Query: `count_over_time({app="re-workflow"} | json | level="error" [5m])`
|
|
- Condition: IS ABOVE 10
|
|
3. Add notification (Slack, Email)
|
|
|
|
## Alert 2: TAT Breach
|
|
|
|
1. Create new alert rule
|
|
2. Configure:
|
|
- Name: `RE Workflow - TAT Breach`
|
|
- Query: `count_over_time({app="re-workflow"} | json | tatEvent="breached" [15m])`
|
|
- Condition: IS ABOVE 0
|
|
3. Add notification
|
|
|
|
## Alert 3: Auth Attack Detection
|
|
|
|
1. Create new alert rule
|
|
2. Configure:
|
|
- Name: `RE Workflow - Auth Attack`
|
|
- Query: `count_over_time({app="re-workflow"} | json | authEvent="auth_failure" [5m])`
|
|
- Condition: IS ABOVE 20
|
|
3. Add notification to Security team
|
|
|
|
---
|
|
|
|
# Part 5: Troubleshooting
|
|
|
|
## Windows Issues
|
|
|
|
### Docker Desktop not starting
|
|
```powershell
|
|
# Restart Docker Desktop service
|
|
Restart-Service docker
|
|
|
|
# Or restart Docker Desktop from system tray
|
|
```
|
|
|
|
### Port 3100 already in use
|
|
```powershell
|
|
# Find process using port
|
|
netstat -ano | findstr :3100
|
|
|
|
# Kill process
|
|
taskkill /PID <pid> /F
|
|
```
|
|
|
|
### WSL2 issues
|
|
```powershell
|
|
# Update WSL
|
|
wsl --update
|
|
|
|
# Restart WSL
|
|
wsl --shutdown
|
|
```
|
|
|
|
---
|
|
|
|
## Linux Issues
|
|
|
|
### Loki won't start
|
|
|
|
```bash
|
|
# Check logs
|
|
sudo docker logs loki
|
|
|
|
# Common fix - permissions
|
|
sudo chown -R 10001:10001 /opt/loki
|
|
```
|
|
|
|
### Grafana can't connect to Loki
|
|
|
|
```bash
|
|
# Verify Loki is healthy
|
|
curl http://localhost:3100/ready
|
|
|
|
# Check network from Grafana server
|
|
curl http://loki-server:3100/ready
|
|
|
|
# Restart Loki
|
|
sudo docker-compose restart
|
|
```
|
|
|
|
### Logs not appearing in Grafana
|
|
|
|
1. Check application env has correct `LOKI_HOST`
|
|
2. Verify network connectivity: `curl http://loki:3100/ready`
|
|
3. Check labels: `curl http://localhost:3100/loki/api/v1/labels`
|
|
4. Wait for application to send first logs
|
|
|
|
### High memory usage
|
|
|
|
```bash
|
|
# Reduce retention period in loki-config.yaml
|
|
limits_config:
|
|
retention_period: 7d # Reduce from 30d
|
|
```
|
|
|
|
---
|
|
|
|
# Quick Reference
|
|
|
|
## Environment Comparison
|
|
|
|
| Setting | Windows Dev | Linux Production |
|
|
|---------|-------------|------------------|
|
|
| LOKI_HOST | `http://localhost:3100` | `http://<server-ip>:3100` |
|
|
| Grafana URL | `http://localhost:3001` | `http://monitoring.cloudtopiaa.com` |
|
|
| Config Path | `C:\loki\` | `/opt/loki/` |
|
|
| Retention | 7 days | 30 days |
|
|
|
|
## Port Reference
|
|
|
|
| Service | Port | URL |
|
|
|---------|------|-----|
|
|
| Loki | 3100 | `http://server:3100` |
|
|
| Grafana (Dev) | 3001 | `http://localhost:3001` |
|
|
| Grafana (Prod) | 80/443 | `http://monitoring.cloudtopiaa.com/` |
|
|
| React Frontend | 3000 | `http://localhost:3000` |
|
|
|
|
---
|
|
|
|
# Verification Checklist
|
|
|
|
## Windows Development
|
|
- [ ] Docker Desktop running
|
|
- [ ] `docker ps` shows loki and grafana containers
|
|
- [ ] `http://localhost:3100/ready` returns "ready"
|
|
- [ ] `http://localhost:3001` shows Grafana login
|
|
- [ ] Loki data source connected in Grafana
|
|
- [ ] Backend `.env` has `LOKI_HOST=http://localhost:3100`
|
|
|
|
## Linux Production
|
|
- [ ] Loki container running (`docker ps`)
|
|
- [ ] `curl localhost:3100/ready` returns "ready"
|
|
- [ ] Firewall port 3100 open
|
|
- [ ] Grafana connected to Loki
|
|
- [ ] Backend `.env` has correct `LOKI_HOST`
|
|
- [ ] Logs appearing in Grafana Explore
|
|
- [ ] Dashboard created
|
|
- [ ] Alerts configured
|
|
|
|
---
|
|
|
|
# Contact
|
|
|
|
For issues with this setup:
|
|
- Backend logs: Check Grafana dashboard
|
|
- Infrastructure: Contact DevOps team
|