codenuk_backend_mine/services/multi-document-upload-service/README.md
2025-12-01 09:04:09 +05:30

996 lines
43 KiB
Markdown

COMPLETE END-TO-END FLOW: Multi-Document Analysis to Report Generation
Let me give you the most detailed explanation possible with theory, diagrams, and step-by-step breakdown.
🎯 SYSTEM OVERVIEW
What We're Building:
A system that takes 100+ documents (PDFs, DOCX, PPT, images, etc.) and generates a comprehensive onboarding report by understanding causal relationships and connections across all documents.
Key Components:
Document Storage - Store uploaded files
Content Extraction - Get text from different formats
Causal Analysis - Understand cause-effect relationships (with Claude)
Knowledge Graph - Store relationships in Neo4j
Vector Database - Enable semantic search in Qdrant
Report Generation - Create final report (with Claude)
📊 COMPLETE ARCHITECTURE DIAGRAM
┌─────────────────────────────────────────────────────────────────────────────┐
│ USER INTERFACE │
│ ┌────────────────────────┐ ┌────────────────────────┐ │
│ │ Upload Documents │ │ Generate Report │ │
│ │ (100+ files) │ │ Button │ │
│ └───────────┬────────────┘ └────────────┬───────────┘ │
└──────────────┼───────────────────────────────────────┼─────────────────────┘
│ │
▼ ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ DOCUMENT UPLOAD SERVICE │ │
│ │ • Validate file types │ │
│ │ • Calculate file hash (deduplication) │ │
│ │ • Store metadata in PostgreSQL │ │
│ │ • Save files to storage (Local) │ │
│ └────────────────────────────┬────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ EXTRACTION ORCHESTRATOR │ │
│ │ • Routes files to appropriate extractors │ │
│ │ • Manages extraction queue │ │
│ │ • Handles failures and retries │ │
│ └─┬───────────────┬───────────────┬──────────────┬────────────────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────┐ ┌──────┐ ┌──────┐ ┌───────┐ │
│ │ PDF │ │ DOCX │ │ PPTX │ │ Image │ │
│ │Extr.│ │Extr. │ │Extr. │ │Extr. │ │
│ └──┬──┘ └───┬──┘ └───┬──┘ └───┬───┘ │
│ │ │ │ │ │
│ └──────────────┴──────────────┴──────────────┘ │
│ │ │
│ ▼ │
│ [Extracted Text for each document] │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 🤖 CLAUDE AI - CAUSAL EXTRACTION │ │
│ │ For each document: │ │
│ │ Input: Extracted text + metadata │ │
│ │ Output: List of causal relationships │ │
│ │ │ │
│ │ Example Output: │ │
│ │ { │ │
│ │ "cause": "Budget cut by 30%", │ │
│ │ "effect": "ML features postponed", │ │
│ │ "confidence": 0.92, │ │
│ │ "entities": ["Finance Team", "ML Team"] │ │
│ │ } │ │
│ └────────────────────────────┬────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ [Causal Relationships Database] │
│ (Temporary PostgreSQL table) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 🤖 CLAUDE AI - ENTITY RESOLUTION │ │
│ │ Resolve entity mentions across all documents │ │
│ │ │ │
│ │ Input: All entity mentions ["John", "J. Smith", "John Smith"] │ │
│ │ Output: Resolved entities {"John Smith": ["John", "J. Smith"]} │ │
│ └────────────────────────────┬────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ KNOWLEDGE GRAPH BUILDER │ │
│ │ Build Neo4j graph from causal relationships │ │
│ └────────────────────────────┬────────────────────────────────────────┘ │
└────────────────────────────────┼──────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────────────┐
│ STORAGE LAYER │
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ PostgreSQL │ │ Neo4j │ │ Qdrant │ │
│ │ │ │ │ │ │ │
│ │ • Metadata │ │ • Nodes: │ │ • Vectors │ │
│ │ • File paths │ │ - Events │ │ • Enriched │ │
│ │ • Status │ │ - Entities │ │ chunks │ │
│ │ │ │ - Documents │ │ • Metadata │ │
│ │ │ │ │ │ │ │
│ │ │ │ • Edges: │ │ │ │
│ │ │ │ - CAUSES │ │ │ │
│ │ │ │ - INVOLVES │ │ │ │
│ └────────────────┘ │ - MENTIONS │ │ │ │
│ └────────────────┘ └────────────────┘ │
│ │ │ │
└─────────────────────────────────┼─────────────────────┼───────────────────────┘
│ │
▼ ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ KG TO QDRANT ENRICHMENT PIPELINE │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ 1. Query Neo4j for causal chains │ │
│ │ MATCH (a)-[:CAUSES*1..3]->(b) │ │
│ │ │ │
│ │ 2. Convert to enriched text chunks │ │
│ │ "Budget cut → ML postponed → Timeline shifted" │ │
│ │ │ │
│ │ 3. Generate embeddings (OpenAI) │ │
│ │ │ │
│ │ 4. Store in Qdrant with metadata from KG │ │
│ │ - Original causal chain │ │
│ │ - Entities involved │ │
│ │ - Confidence scores │ │
│ │ - Source documents │ │
│ └────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────────────┐
│ REPORT GENERATION PHASE │
│ │
│ User clicks "Generate Report" │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ RETRIEVAL ORCHESTRATOR │ │
│ │ │ │
│ │ Step 1: Semantic Search (Qdrant) │ │
│ │ Query: "project overview timeline decisions" │ │
│ │ Returns: Top 50 most relevant chunks │ │
│ │ │ │
│ │ Step 2: Graph Traversal (Neo4j) │ │
│ │ Query: Critical causal chains with confidence > 0.8 │ │
│ │ Returns: Important decision paths │ │
│ │ │ │
│ │ Step 3: Entity Analysis (Neo4j) │ │
│ │ Query: Key people, teams, projects │ │
│ │ Returns: Entity profiles │ │
│ └───────────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ [Aggregated Context Package] │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ 🤖 CLAUDE AI - FINAL REPORT GENERATION │ │
│ │ │ │
│ │ Input: │ │
│ │ • 50 semantic chunks from Qdrant │ │
│ │ • 20 causal chains from Neo4j │ │
│ │ • Entity profiles │ │
│ │ • Report template │ │
│ │ │ │
│ │ Prompt: │ │
│ │ "You are creating an onboarding report. │ │
│ │ Based on 100+ documents, synthesize: │ │
│ │ - Project overview │ │
│ │ - Key decisions and WHY they were made │ │
│ │ - Critical causal chains │ │
│ │ - Timeline and milestones │ │
│ │ - Current status and next steps" │ │
│ │ │ │
│ │ Output: Comprehensive Markdown report │ │
│ └───────────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ PDF GENERATION │ │
│ │ • Convert Markdown to PDF │ │
│ │ • Add formatting, table of contents │ │
│ │ • Include citations to source documents │ │
│ └───────────────────────────┬─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ [Final PDF Report] │
│ │ │
│ ▼ │
│ Download to user │
└──────────────────────────────────────────────────────────────────────────────┘
📚 COMPLETE THEORY-WISE STEP-BY-STEP FLOW
Let me explain the entire system in pure theory - how it works, why each step exists, and what problem it solves.
🎯 THE BIG PICTURE (Theory)
The Problem:
A new person joins a project that has 100+ documents (meeting notes, technical specs, design docs, emails, presentations). Reading all of them would take weeks. They need to understand:
WHAT happened in the project
WHY decisions were made (causal relationships)
WHO is involved
WHEN things happened
HOW everything connects
The Solution:
Build an intelligent system that:
Reads all documents automatically
Understands cause-and-effect relationships
Connects related information across documents
Generates a comprehensive summary report
🔄 COMPLETE FLOW (Theory Explanation)
STAGE 1: DOCUMENT INGESTION
Theory: Why This Stage Exists
Problem: We have 100+ documents in different formats (PDF, Word, PowerPoint, Excel, images). We need to get them into the system.
Goal:
Accept all document types
Organize them
Prevent duplicates
Track processing status
What Happens:
USER ACTION:
└─> User uploads 100 files through web interface
SYSTEM ACTIONS:
Step 1.1: FILE VALIDATION
├─> Check: Is this a supported file type?
├─> Check: Is file size acceptable?
└─> Decision: Accept or Reject
Step 1.2: DEDUPLICATION
├─> Calculate unique hash (fingerprint) of file content
├─> Check: Have we seen this exact file before?
└─> Decision: Store as new OR link to existing
Step 1.3: METADATA STORAGE
├─> Store: filename, type, upload date, size
├─> Store: who uploaded it, when
└─> Assign: unique document ID
Step 1.4: PHYSICAL STORAGE
├─> Save file to disk/cloud storage
└─> Record: where file is stored
Step 1.5: QUEUE FOR PROCESSING
├─> Add document to processing queue
└─> Status: "waiting for extraction"
STAGE 2: CONTENT EXTRACTION
Theory: Why This Stage Exists
Problem: Documents are in binary formats (PDF, DOCX, PPTX). We can't directly read them - we need to extract the text content.
Goal: Convert all documents into plain text that can be analyzed
What Happens:
PROCESSING QUEUE:
└─> System picks next document from queue
Step 2.1: IDENTIFY FILE TYPE
├─> Read: document.type
└─> Route to appropriate extractor
Step 2.2a: IF PDF
├─> Use: PyMuPDF library
├─> Process: Read each page
├─> Extract: Text content
└─> Output: Plain text string
Step 2.2b: IF DOCX (Word)
├─> Use: python-docx library
├─> Process: Read paragraphs, tables
├─> Extract: Text content
└─> Output: Plain text string
Step 2.2c: IF PPTX (PowerPoint)
├─> Use: python-pptx library
├─> Process: Read each slide
├─> Extract: Title, content, notes
└─> Output: Plain text string
Step 2.2d: IF CSV/XLSX (Spreadsheet)
├─> Use: pandas library
├─> Process: Read rows and columns
├─> Convert: To text representation
└─> Output: Structured text
Step 2.2e: IF IMAGE (PNG, JPG)
├─> Use: Claude Vision API (AI model)
├─> Process: Analyze image content
├─> Extract: Description of diagram/chart
└─> Output: Text description
Step 2.3: TEXT CLEANING
├─> Remove: Extra whitespace
├─> Fix: Encoding issues
├─> Preserve: Important structure
└─> Output: Clean text
Step 2.4: STORE EXTRACTED TEXT
├─> Save: To database
├─> Link: To original document
└─> Update status: "text_extracted"
Example:
Input (PDF file):
[Binary PDF data - cannot be read directly]
Output (Extracted Text):
"Project Alpha - Q3 Meeting Minutes
Date: August 15, 2024
Discussion:
Due to budget constraints, we decided to postpone
the machine learning features. This will impact
our December launch timeline.
Action Items:
- Revise project roadmap
- Notify stakeholders
- Adjust resource allocation"
Why This Stage?
Different formats need different tools - One size doesn't fit all
Extract only text - Remove formatting, images (except for image docs)
Standardize - All docs become plain text for next stage
Images are special - They need AI (Claude Vision) to understand
STAGE 3: CAUSAL RELATIONSHIP EXTRACTION ⭐ (CRITICAL!)
Theory: Why This Stage Exists
Problem: Having text is not enough. We need to understand WHY things happened.
Example:
Just knowing "ML features postponed" is not useful
Knowing "Budget cut → ML features postponed → Timeline delayed" is MUCH more useful
Goal: Extract cause-and-effect relationships from text
What Is A Causal Relationship?
A causal relationship has three parts:
CAUSE → EFFECT
Example 1:
Cause: "Budget reduced by 30%"
Effect: "ML features postponed"
Example 2:
Cause: "John Smith left the company"
Effect: "Sarah Chen became lead developer"
Example 3:
Cause: "User feedback showed confusion"
Effect: "We redesigned the onboarding flow"
How We Extract Them:
INPUT: Extracted text from document
Step 3.1: BASIC NLP DETECTION (SpaCy)
├─> Look for: Causal keywords
│ Examples: "because", "due to", "as a result",
│ "led to", "caused", "therefore"
├─> Find: Sentences containing these patterns
└─> Output: Potential causal relationships (low confidence)
Step 3.2: AI-POWERED EXTRACTION (Claude API) ⭐
├─> Send: Full document text to Claude AI
├─> Ask Claude: "Find ALL causal relationships in this text"
├─> Claude analyzes:
│ • Explicit relationships ("because X, therefore Y")
│ • Implicit relationships (strongly implied)
│ • Context and background
│ • Who/what is involved
├─> Claude returns: Structured list of relationships
└─> Output: High-quality causal relationships (high confidence)
Step 3.3: STRUCTURE THE OUTPUT
For each relationship, extract:
├─> Cause: What triggered this?
├─> Effect: What was the result?
├─> Context: Additional background
├─> Entities: Who/what is involved? (people, teams, projects)
├─> Confidence: How certain are we? (0.0 to 1.0)
├─> Source: Which document and sentence?
└─> Date: When did this happen?
Step 3.4: STORE RELATIONSHIPS
├─> Save: To temporary database table
└─> Link: To source document
Example: Claude's Analysis
Input Text:
"In the Q3 review meeting, the CFO announced a 30%
budget reduction due to decreased market demand
As a result, the engineering team decided to
postpone machine learning features for Project Alpha.
This means our December launch will be delayed
until March 2025."
Claude's Output:
[
{
"cause": "Market demand decreased",
"effect": "CFO reduced budget by 30%",
"context": "Q3 financial review",
"entities": ["CFO", "Finance Team"],
"confidence": 0.95,
"source_sentence": "30% budget reduction due to decreased market demand",
"date": "Q3 2024"
},
{
"cause": "Budget reduced by 30%",
"effect": "Machine learning features postponed",
"context": "Project Alpha roadmap adjustment",
"entities": ["Engineering Team", "Project Alpha", "ML Team"],
"confidence": 0.92,
"source_sentence": "decided to postpone machine learning features",
"date": "Q3 2024"
},
{
"cause": "ML features postponed",
"effect": "Launch delayed from December to March",
"context": "Timeline impact",
"entities": ["Project Alpha"],
"confidence": 0.90,
"source_sentence": "December launch will be delayed until March 2025",
"date": "2024-2025"
}
]
```
### **Why Use Both NLP AND Claude?**
| Method | Pros | Cons | Use Case |
|--------|------|------|----------|
| **NLP (SpaCy)** | Fast, cheap, runs locally | Misses implicit relationships, lower accuracy | Quick first pass, simple docs |
| **Claude AI** | Understands context, finds implicit relationships, high accuracy | Costs money, requires API | Complex docs, deep analysis |
**Strategy:** Use NLP first for quick scan, then Claude for deep analysis.
### **Why This Stage Is Critical:**
Without causal extraction, you just have a pile of facts:
- ❌ "Budget was cut"
- ❌ "ML features postponed"
- ❌ "Timeline changed"
With causal extraction, you understand the story:
- ✅ Market demand dropped → Budget cut → ML postponed → Timeline delayed
This is **the heart of your system** - it's what makes it intelligent.
---
## **STAGE 4: ENTITY RESOLUTION** 🤖
### **Theory: Why This Stage Exists**
**Problem:** Same people/things are mentioned differently across documents.
**Examples:**
- "John Smith", "John", "J. Smith", "Smith" → Same person
- "Project Alpha", "Alpha", "The Alpha Project" → Same project
- "ML Team", "Machine Learning Team", "AI Team" → Same team (maybe)
**Goal:** Identify that these different mentions refer to the same entity.
### **What Happens:**
```
INPUT: All causal relationships from all documents
Step 4.1: COLLECT ALL ENTITIES
├─> Scan: All causal relationships
├─> Extract: Every entity mentioned
└─> Result: List of entity mentions
["John", "John Smith", "J. Smith", "Sarah", "S. Chen",
"Project Alpha", "Alpha", "ML Team", ...]
Step 4.2: GROUP BY ENTITY TYPE
├─> People: ["John", "John Smith", "Sarah", ...]
├─> Projects: ["Project Alpha", "Alpha", ...]
├─> Teams: ["ML Team", "AI Team", ...]
└─> Organizations: ["Finance Dept", "Engineering", ...]
Step 4.3: AI-POWERED RESOLUTION (Claude API) ⭐
├─> Send: All entity mentions to Claude
├─> Ask Claude: "Which mentions refer to the same real-world entity?"
├─> Claude analyzes:
│ • Name similarities
│ • Context clues
│ • Role descriptions
│ • Co-occurrence patterns
└─> Claude returns: Grouped entities
Step 4.4: CREATE CANONICAL NAMES
├─> Choose: Best name for each entity
├─> Example: "John Smith" becomes canonical for ["John", "J. Smith"]
└─> Store: Mapping table
```
### **Example:**
**Input (mentions across all docs):**
```
Document 1: "John led the meeting"
Document 2: "J. Smith approved the budget"
Document 3: "John Smith will present next week"
Document 4: "Smith suggested the new approach"
Claude's Resolution:
{
"entities": {
"John Smith": {
"canonical_name": "John Smith",
"mentions": ["John", "J. Smith", "John Smith", "Smith"],
"type": "Person",
"role": "Project Lead",
"confidence": 0.95
}
}
}
```
### **Why This Matters:**
Without entity resolution:
- ❌ System thinks "John" and "John Smith" are different people
- ❌ Can't track someone's involvement across documents
- ❌ Relationships are fragmented
With entity resolution:
- ✅ System knows they're the same person
- ✅ Can see full picture of someone's involvement
- ✅ Relationships are connected
---
## **STAGE 5: KNOWLEDGE GRAPH CONSTRUCTION** 📊
### **Theory: Why This Stage Exists**
**Problem:** We have hundreds of causal relationships. How do we organize them? How do we find connections?
**Solution:** Build a **graph** - a network of nodes (things) and edges (relationships).
### **What Is A Knowledge Graph?**
Think of it like a map:
- **Nodes** = Places (events, people, projects)
- **Edges** = Roads (relationships between them)
```
Example Graph:
(Budget Cut)
│ CAUSES
(ML Postponed)
│ CAUSES
(Timeline Delayed)
│ AFFECTS
(Project Alpha)
│ INVOLVES
(Engineering Team)
```
### **What Happens:**
```
INPUT: Causal relationships + Resolved entities
Step 5.1: CREATE EVENT NODES
For each causal relationship:
├─> Create Node: Cause event
├─> Create Node: Effect event
└─> Properties: text, date, confidence
Example:
Node1: {type: "Event", text: "Budget reduced by 30%"}
Node2: {type: "Event", text: "ML features postponed"}
Step 5.2: CREATE ENTITY NODES
For each resolved entity:
├─> Create Node: Entity
└─> Properties: name, type, role
Example:
Node3: {type: "Person", name: "John Smith", role: "Lead"}
Node4: {type: "Project", name: "Project Alpha"}
Step 5.3: CREATE DOCUMENT NODES
For each source document:
└─> Create Node: Document
Properties: filename, date, type
Example:
Node5: {type: "Document", name: "Q3_meeting.pdf"}
Step 5.4: CREATE RELATIONSHIPS (Edges)
├─> CAUSES: Event1 → Event2
├─> INVOLVED_IN: Person → Event
├─> MENTIONS: Document → Entity
├─> AFFECTS: Event → Project
└─> Properties: confidence, source, date
Example Relationships:
(Budget Cut) -[CAUSES]-> (ML Postponed)
(John Smith) -[INVOLVED_IN]-> (Budget Cut)
(Q3_meeting.pdf) -[MENTIONS]-> (John Smith)
Step 5.5: STORE IN NEO4J
├─> Connect: To Neo4j database
├─> Create: All nodes
├─> Create: All relationships
└─> Index: For fast querying
```
### **Visual Example:**
**Before (Just Text):**
```
"Budget cut → ML postponed"
"ML postponed → Timeline delayed"
"John Smith involved in budget decision"
```
**After (Knowledge Graph):**
```
(John Smith)
│ INVOLVED_IN
(Budget Cut) ──MENTIONED_IN──> (Q3_meeting.pdf)
│ CAUSES
(ML Postponed) ──AFFECTS──> (Project Alpha)
│ CAUSES
(Timeline Delayed) ──INVOLVES──> (Engineering Team)
```
### **Why Use A Graph?**
| Question | Without Graph | With Graph |
|----------|---------------|------------|
| "Why was ML postponed?" | Search all docs manually | Follow CAUSES edge backwards |
| "What did budget cut affect?" | Re-read everything | Follow CAUSES edges forward |
| "What is John involved in?" | Search his name everywhere | Follow INVOLVED_IN edges |
| "How are events connected?" | Hard to see | Visual path through graph |
**Key Benefit:** The graph shows **HOW** everything connects, not just WHAT exists.
---
## **STAGE 6: GRAPH TO VECTOR DATABASE** 🔄
### **Theory: Why This Stage Exists**
**Problem:**
- Neo4j is great for finding relationships ("What caused X?")
- But it's NOT good for semantic search ("Find docs about machine learning")
**Solution:** We need BOTH:
- **Neo4j** = Find causal chains and connections
- **Qdrant** = Find relevant content by meaning
### **Why We Need Both:**
**Neo4j (Graph Database):**
```
Good for: "Show me the chain of events that led to timeline delay"
Answer: Budget Cut → ML Postponed → Timeline Delayed
```
**Qdrant (Vector Database):**
```
Good for: "Find all content related to machine learning"
Answer: [50 relevant chunks from across all documents]
```
### **What Happens:**
```
INPUT: Complete Knowledge Graph in Neo4j
Step 6.1: EXTRACT CAUSAL CHAINS
├─> Query Neo4j: "Find all causal paths"
│ Example: MATCH (a)-[:CAUSES*1..3]->(b)
├─> Get: Sequences of connected events
└─> Result: List of causal chains
Example chains:
1. Market demand ↓ → Budget cut → ML postponed
2. John left → Sarah promoted → Team restructured
3. User feedback → Design change → Timeline adjusted
Step 6.2: CONVERT TO NARRATIVE TEXT
Take each chain and write it as a story:
Before: [Node1] → [Node2] → [Node3]
After: "Due to decreased market demand, the CFO
reduced the budget by 30%. This led to the
postponement of machine learning features, which
ultimately delayed the December launch to March."
WHY? Because we need text to create embeddings!
Step 6.3: ENRICH WITH CONTEXT
Add information from the graph:
├─> Who was involved?
├─> When did it happen?
├─> Which documents mention this?
├─> What projects were affected?
└─> How confident are we?
Enriched text:
"[CAUSAL CHAIN]
Due to decreased market demand, the CFO reduced
the budget by 30%. This led to ML postponement.
[METADATA]
Date: Q3 2024
Involved: CFO, Engineering Team, Project Alpha
Sources: Q3_meeting.pdf, budget_report.xlsx
Confidence: 0.92"
Step 6.4: CREATE EMBEDDINGS
├─> Use: OpenAI Embedding API
├─> Input: Enriched text
├─> Output: Vector (1536 numbers)
│ Example: [0.123, -0.456, 0.789, ...]
└─> This vector represents the "meaning" of the text
Step 6.5: STORE IN QDRANT
For each enriched chunk:
├─> Vector: The embedding
├─> Payload: The original text + all metadata
│ {
│ "text": "enriched narrative",
│ "type": "causal_chain",
│ "entities": ["CFO", "Project Alpha"],
│ "sources": ["Q3_meeting.pdf"],
│ "confidence": 0.92,
│ "graph_path": "Node1->Node2->Node3"
│ }
└─> Store: In Qdrant collection
```
### **What Are Embeddings?**
Think of embeddings as **coordinates in meaning-space**:
```
Text: "machine learning features"
Embedding: [0.2, 0.8, 0.1, -0.3, ...] ← 1536 numbers
Text: "AI capabilities"
Embedding: [0.19, 0.82, 0.09, -0.29, ...] ← Similar numbers!
Text: "budget reporting"
Embedding: [-0.6, 0.1, 0.9, 0.4, ...] ← Very different numbers
```
Similar meanings → Similar vectors → Qdrant finds them together!
### **Example Flow:**
**From Neo4j:**
```
Chain: (Budget Cut) → (ML Postponed) → (Timeline Delayed)
```
**Convert to Text:**
```
"Budget reduced by 30% → ML features postponed →
December launch delayed to March"
```
**Enrich:**
```
"[Causal Chain] Budget reduced by 30% led to ML
features being postponed, which delayed the December
launch to March 2025.
Involved: CFO, Engineering Team, Project Alpha
Sources: Q3_meeting.pdf, roadmap.pptx
Confidence: 0.91
Date: August-September 2024"
```
**Create Embedding:**
```
[0.234, -0.567, 0.891, 0.123, ...] ← 1536 numbers
Store in Qdrant:
{
"id": "chain_001",
"vector": [0.234, -0.567, ...],
"payload": {
"text": "enriched narrative...",
"type": "causal_chain",
"entities": ["CFO", "Engineering Team"],
"sources": ["Q3_meeting.pdf"],
"confidence": 0.91
}
}
```
### **Why This Stage?**
Now we have the **best of both worlds**:
| Need | Use |
|------|-----|
| "Find content about machine learning" | Qdrant semantic search |
| "Show me the causal chain" | Neo4j graph traversal |
| "Why did timeline delay?" | Start with Qdrant, then Neo4j for details |
| "Generate comprehensive report" | Pull from BOTH |
---
## **STAGE 7: REPORT GENERATION** 📝 (FINAL STAGE)
### **Theory: Why This Stage Exists**
**Goal:** Take everything we've learned from 100+ documents and create ONE comprehensive, readable report.
### **What Happens:**
```
USER ACTION:
└─> User clicks "Generate Onboarding Report"
Step 7.1: DEFINE REPORT REQUIREMENTS
What should the report include?
├─> Project overview
├─> Key decisions and WHY they were made
├─> Important people and their roles
├─> Timeline of events
├─> Current status
└─> Next steps
Step 7.2: SEMANTIC SEARCH (Qdrant)
Query 1: "project overview goals objectives"
├─> Qdrant returns: Top 20 relevant chunks
└─> Covers: High-level project information
Query 2: "timeline milestones dates schedule"
├─> Qdrant returns: Top 15 relevant chunks
└─> Covers: Timeline information
Query 3: "decisions architecture technical"
├─> Qdrant returns: Top 15 relevant chunks
└─> Covers: Technical decisions
Total: ~50 most relevant chunks from Qdrant
Step 7.3: GRAPH TRAVERSAL (Neo4j)
Query 1: Get critical causal chains
├─> MATCH (a)-[:CAUSES*2..4]->(b)
├─> WHERE confidence > 0.8
└─> Returns: Top 20 important decision chains
Query 2: Get key entities
├─> MATCH (e:Entity)-[:INVOLVED_IN]->(events)
├─> Count events per entity
└─> Returns: Most involved people/teams/projects
Query 3: Get recent timeline
├─> MATCH (e:Event) WHERE e.date > '2024-01-01'
├─> Order by date
└─> Returns: Chronological event list
Step 7.4: AGGREGATE CONTEXT
Combine everything:
├─> 50 semantic chunks from Qdrant
├─> 20 causal chains from Neo4j
├─> Key entities and their profiles
├─> Timeline of events
└─> Metadata about sources
Total Context Size: ~30,000-50,000 tokens
Step 7.5: PREPARE PROMPT FOR CLAUDE
Structure the prompt:
┌─────────────────────────────────────┐
│ SYSTEM: You are an expert technical │
│ writer creating an onboarding report│
│ │
│ USER: Based on these 100+ documents,│
│ create a comprehensive report. │
│ │
│ # SEMANTIC CONTEXT: │
│ [50 chunks from Qdrant] │
│ │
│ # CAUSAL CHAINS: │
│ [20 decision chains from Neo4j] │
│ │
│ # KEY ENTITIES: │
│ [People, teams, projects] │
│ │
│ # TIMELINE: │
│ [Chronological events] │
│ │
│ Generate report with sections: │
│ 1. Executive Summary │
│ 2. Project Overview │
│ 3. Key Decisions (with WHY) │
│ 4. Timeline │
│ 5. Current Status │
│ 6. Next Steps │
└─────────────────────────────────────┘
Step 7.6: CALL CLAUDE API ⭐
├─> Send: Complete prompt to Claude
├─> Claude processes:
│ • Reads all context
│ • Identifies key themes
│ • Synthesizes information
│ • Creates narrative structure
│ • Explains causal relationships
│ • Writes clear, coherent report
└─> Returns: Markdown-formatted report
Step 7.7: POST-PROCESS REPORT
├─> Add: Table of contents
├─> Add: Citations to source documents
├─> Add: Confidence indicators
├─> Format: Headings, bullet points, emphasis
└─> Result: Final Markdown report
Step 7.8: CONVERT TO PDF
├─> Use: Markdown-to-PDF library
├─> Add: Styling and formatting
├─> Add: Page numbers, headers
└─> Result: Professional PDF report
Step 7.9: DELIVER TO USER
├─> Save: PDF to storage
├─> Generate: Download link
└─> Show: Success message with download button
## **🔄 COMPLETE DATA FLOW SUMMARY**
```
Documents (100+)
[Extract Text] → Plain Text
[Claude: Causal Extraction] → Relationships List
[Claude: Entity Resolution] → Resolved Entities
[Build Graph] → Neo4j Knowledge Graph
[Convert + Enrich] → Narrative Chunks
[Create Embeddings] → Vectors
[Store] → Qdrant Vector DB
[User Request] → "Generate Report"
[Query Qdrant] → Relevant Chunks
+
[Query Neo4j] → Causal Chains
[Claude: Synthesis] → Final Report
[Convert] → PDF
[Deliver] → User Downloads Report
```