| .. | ||
| src/multi_document_upload_service | ||
| .dockerignore | ||
| Dockerfile | ||
| README.md | ||
| requirements.txt | ||
COMPLETE END-TO-END FLOW: Multi-Document Analysis to Report Generation Let me give you the most detailed explanation possible with theory, diagrams, and step-by-step breakdown.
🎯 SYSTEM OVERVIEW What We're Building: A system that takes 100+ documents (PDFs, DOCX, PPT, images, etc.) and generates a comprehensive onboarding report by understanding causal relationships and connections across all documents. Key Components:
Document Storage - Store uploaded files Content Extraction - Get text from different formats Causal Analysis - Understand cause-effect relationships (with Claude) Knowledge Graph - Store relationships in Neo4j Vector Database - Enable semantic search in Qdrant Report Generation - Create final report (with Claude)
📊 COMPLETE ARCHITECTURE DIAGRAM
┌─────────────────────────────────────────────────────────────────────────────┐ │ USER INTERFACE │ │ ┌────────────────────────┐ ┌────────────────────────┐ │ │ │ Upload Documents │ │ Generate Report │ │ │ │ (100+ files) │ │ Button │ │ │ └───────────┬────────────┘ └────────────┬───────────┘ │ └──────────────┼───────────────────────────────────────┼─────────────────────┘ │ │ ▼ ▼ ┌──────────────────────────────────────────────────────────────────────────────┐ │ APPLICATION LAYER │ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ DOCUMENT UPLOAD SERVICE │ │ │ │ • Validate file types │ │ │ │ • Calculate file hash (deduplication) │ │ │ │ • Store metadata in PostgreSQL │ │ │ │ • Save files to storage (Local) │ │ │ └────────────────────────────┬────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ EXTRACTION ORCHESTRATOR │ │ │ │ • Routes files to appropriate extractors │ │ │ │ • Manages extraction queue │ │ │ │ • Handles failures and retries │ │ │ └─┬───────────────┬───────────────┬──────────────┬────────────────────┘ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ ┌─────┐ ┌──────┐ ┌──────┐ ┌───────┐ │ │ │ PDF │ │ DOCX │ │ PPTX │ │ Image │ │ │ │Extr.│ │Extr. │ │Extr. │ │Extr. │ │ │ └──┬──┘ └───┬──┘ └───┬──┘ └───┬───┘ │ │ │ │ │ │ │ │ └──────────────┴──────────────┴──────────────┘ │ │ │ │ │ ▼ │ │ [Extracted Text for each document] │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ 🤖 CLAUDE AI - CAUSAL EXTRACTION │ │ │ │ For each document: │ │ │ │ Input: Extracted text + metadata │ │ │ │ Output: List of causal relationships │ │ │ │ │ │ │ │ Example Output: │ │ │ │ { │ │ │ │ "cause": "Budget cut by 30%", │ │ │ │ "effect": "ML features postponed", │ │ │ │ "confidence": 0.92, │ │ │ │ "entities": ["Finance Team", "ML Team"] │ │ │ │ } │ │ │ └────────────────────────────┬────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ [Causal Relationships Database] │ │ (Temporary PostgreSQL table) │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ 🤖 CLAUDE AI - ENTITY RESOLUTION │ │ │ │ Resolve entity mentions across all documents │ │ │ │ │ │ │ │ Input: All entity mentions ["John", "J. Smith", "John Smith"] │ │ │ │ Output: Resolved entities {"John Smith": ["John", "J. Smith"]} │ │ │ └────────────────────────────┬────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ KNOWLEDGE GRAPH BUILDER │ │ │ │ Build Neo4j graph from causal relationships │ │ │ └────────────────────────────┬────────────────────────────────────────┘ │ └────────────────────────────────┼──────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────────────────┐ │ STORAGE LAYER │ │ │ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ │ │ PostgreSQL │ │ Neo4j │ │ Qdrant │ │ │ │ │ │ │ │ │ │ │ │ • Metadata │ │ • Nodes: │ │ • Vectors │ │ │ │ • File paths │ │ - Events │ │ • Enriched │ │ │ │ • Status │ │ - Entities │ │ chunks │ │ │ │ │ │ - Documents │ │ • Metadata │ │ │ │ │ │ │ │ │ │ │ │ │ │ • Edges: │ │ │ │ │ │ │ │ - CAUSES │ │ │ │ │ │ │ │ - INVOLVES │ │ │ │ │ └────────────────┘ │ - MENTIONS │ │ │ │ │ └────────────────┘ └────────────────┘ │ │ │ │ │ └─────────────────────────────────┼─────────────────────┼───────────────────────┘ │ │ ▼ ▼ ┌──────────────────────────────────────────────────────────────────────────────┐ │ KG TO QDRANT ENRICHMENT PIPELINE │ │ │ │ ┌────────────────────────────────────────────────────────────────┐ │ │ │ 1. Query Neo4j for causal chains │ │ │ │ MATCH (a)-[:CAUSES*1..3]->(b) │ │ │ │ │ │ │ │ 2. Convert to enriched text chunks │ │ │ │ "Budget cut → ML postponed → Timeline shifted" │ │ │ │ │ │ │ │ 3. Generate embeddings (OpenAI) │ │ │ │ │ │ │ │ 4. Store in Qdrant with metadata from KG │ │ │ │ - Original causal chain │ │ │ │ - Entities involved │ │ │ │ - Confidence scores │ │ │ │ - Source documents │ │ │ └────────────────────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────────────────┐ │ REPORT GENERATION PHASE │ │ │ │ User clicks "Generate Report" │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ RETRIEVAL ORCHESTRATOR │ │ │ │ │ │ │ │ Step 1: Semantic Search (Qdrant) │ │ │ │ Query: "project overview timeline decisions" │ │ │ │ Returns: Top 50 most relevant chunks │ │ │ │ │ │ │ │ Step 2: Graph Traversal (Neo4j) │ │ │ │ Query: Critical causal chains with confidence > 0.8 │ │ │ │ Returns: Important decision paths │ │ │ │ │ │ │ │ Step 3: Entity Analysis (Neo4j) │ │ │ │ Query: Key people, teams, projects │ │ │ │ Returns: Entity profiles │ │ │ └───────────────────────────┬─────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ [Aggregated Context Package] │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ 🤖 CLAUDE AI - FINAL REPORT GENERATION │ │ │ │ │ │ │ │ Input: │ │ │ │ • 50 semantic chunks from Qdrant │ │ │ │ • 20 causal chains from Neo4j │ │ │ │ • Entity profiles │ │ │ │ • Report template │ │ │ │ │ │ │ │ Prompt: │ │ │ │ "You are creating an onboarding report. │ │ │ │ Based on 100+ documents, synthesize: │ │ │ │ - Project overview │ │ │ │ - Key decisions and WHY they were made │ │ │ │ - Critical causal chains │ │ │ │ - Timeline and milestones │ │ │ │ - Current status and next steps" │ │ │ │ │ │ │ │ Output: Comprehensive Markdown report │ │ │ └───────────────────────────┬─────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ PDF GENERATION │ │ │ │ • Convert Markdown to PDF │ │ │ │ • Add formatting, table of contents │ │ │ │ • Include citations to source documents │ │ │ └───────────────────────────┬─────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ [Final PDF Report] │ │ │ │ │ ▼ │ │ Download to user │ └──────────────────────────────────────────────────────────────────────────────┘
📚 COMPLETE THEORY-WISE STEP-BY-STEP FLOW Let me explain the entire system in pure theory - how it works, why each step exists, and what problem it solves.
🎯 THE BIG PICTURE (Theory) The Problem: A new person joins a project that has 100+ documents (meeting notes, technical specs, design docs, emails, presentations). Reading all of them would take weeks. They need to understand:
WHAT happened in the project WHY decisions were made (causal relationships) WHO is involved WHEN things happened HOW everything connects
The Solution: Build an intelligent system that:
Reads all documents automatically Understands cause-and-effect relationships Connects related information across documents Generates a comprehensive summary report
🔄 COMPLETE FLOW (Theory Explanation)
STAGE 1: DOCUMENT INGESTION Theory: Why This Stage Exists Problem: We have 100+ documents in different formats (PDF, Word, PowerPoint, Excel, images). We need to get them into the system. Goal:
Accept all document types Organize them Prevent duplicates Track processing status
What Happens:
USER ACTION: └─> User uploads 100 files through web interface
SYSTEM ACTIONS:
Step 1.1: FILE VALIDATION ├─> Check: Is this a supported file type? ├─> Check: Is file size acceptable? └─> Decision: Accept or Reject
Step 1.2: DEDUPLICATION ├─> Calculate unique hash (fingerprint) of file content ├─> Check: Have we seen this exact file before? └─> Decision: Store as new OR link to existing
Step 1.3: METADATA STORAGE ├─> Store: filename, type, upload date, size ├─> Store: who uploaded it, when └─> Assign: unique document ID
Step 1.4: PHYSICAL STORAGE ├─> Save file to disk/cloud storage └─> Record: where file is stored
Step 1.5: QUEUE FOR PROCESSING ├─> Add document to processing queue └─> Status: "waiting for extraction"
STAGE 2: CONTENT EXTRACTION Theory: Why This Stage Exists Problem: Documents are in binary formats (PDF, DOCX, PPTX). We can't directly read them - we need to extract the text content. Goal: Convert all documents into plain text that can be analyzed What Happens:
PROCESSING QUEUE: └─> System picks next document from queue
Step 2.1: IDENTIFY FILE TYPE ├─> Read: document.type └─> Route to appropriate extractor
Step 2.2a: IF PDF ├─> Use: PyMuPDF library ├─> Process: Read each page ├─> Extract: Text content └─> Output: Plain text string
Step 2.2b: IF DOCX (Word) ├─> Use: python-docx library ├─> Process: Read paragraphs, tables ├─> Extract: Text content └─> Output: Plain text string
Step 2.2c: IF PPTX (PowerPoint) ├─> Use: python-pptx library ├─> Process: Read each slide ├─> Extract: Title, content, notes └─> Output: Plain text string
Step 2.2d: IF CSV/XLSX (Spreadsheet) ├─> Use: pandas library ├─> Process: Read rows and columns ├─> Convert: To text representation └─> Output: Structured text
Step 2.2e: IF IMAGE (PNG, JPG) ├─> Use: Claude Vision API (AI model) ├─> Process: Analyze image content ├─> Extract: Description of diagram/chart └─> Output: Text description
Step 2.3: TEXT CLEANING ├─> Remove: Extra whitespace ├─> Fix: Encoding issues ├─> Preserve: Important structure └─> Output: Clean text
Step 2.4: STORE EXTRACTED TEXT ├─> Save: To database ├─> Link: To original document └─> Update status: "text_extracted"
Example: Input (PDF file): [Binary PDF data - cannot be read directly] Output (Extracted Text):
"Project Alpha - Q3 Meeting Minutes Date: August 15, 2024
Discussion: Due to budget constraints, we decided to postpone the machine learning features. This will impact our December launch timeline.
Action Items:
- Revise project roadmap
- Notify stakeholders
- Adjust resource allocation"
Why This Stage?
Different formats need different tools - One size doesn't fit all Extract only text - Remove formatting, images (except for image docs) Standardize - All docs become plain text for next stage Images are special - They need AI (Claude Vision) to understand
STAGE 3: CAUSAL RELATIONSHIP EXTRACTION ⭐ (CRITICAL!) Theory: Why This Stage Exists Problem: Having text is not enough. We need to understand WHY things happened. Example:
Just knowing "ML features postponed" is not useful Knowing "Budget cut → ML features postponed → Timeline delayed" is MUCH more useful
Goal: Extract cause-and-effect relationships from text What Is A Causal Relationship? A causal relationship has three parts: CAUSE → EFFECT
Example 1: Cause: "Budget reduced by 30%" Effect: "ML features postponed"
Example 2: Cause: "John Smith left the company" Effect: "Sarah Chen became lead developer"
Example 3: Cause: "User feedback showed confusion" Effect: "We redesigned the onboarding flow"
How We Extract Them:
INPUT: Extracted text from document
Step 3.1: BASIC NLP DETECTION (SpaCy) ├─> Look for: Causal keywords │ Examples: "because", "due to", "as a result", │ "led to", "caused", "therefore" ├─> Find: Sentences containing these patterns └─> Output: Potential causal relationships (low confidence)
Step 3.2: AI-POWERED EXTRACTION (Claude API) ⭐ ├─> Send: Full document text to Claude AI ├─> Ask Claude: "Find ALL causal relationships in this text" ├─> Claude analyzes: │ • Explicit relationships ("because X, therefore Y") │ • Implicit relationships (strongly implied) │ • Context and background │ • Who/what is involved ├─> Claude returns: Structured list of relationships └─> Output: High-quality causal relationships (high confidence)
Step 3.3: STRUCTURE THE OUTPUT For each relationship, extract: ├─> Cause: What triggered this? ├─> Effect: What was the result? ├─> Context: Additional background ├─> Entities: Who/what is involved? (people, teams, projects) ├─> Confidence: How certain are we? (0.0 to 1.0) ├─> Source: Which document and sentence? └─> Date: When did this happen?
Step 3.4: STORE RELATIONSHIPS ├─> Save: To temporary database table └─> Link: To source document
Example: Claude's Analysis
Input Text:
"In the Q3 review meeting, the CFO announced a 30% budget reduction due to decreased market demand As a result, the engineering team decided to postpone machine learning features for Project Alpha. This means our December launch will be delayed until March 2025."
Claude's Output:
[ { "cause": "Market demand decreased", "effect": "CFO reduced budget by 30%", "context": "Q3 financial review", "entities": ["CFO", "Finance Team"], "confidence": 0.95, "source_sentence": "30% budget reduction due to decreased market demand", "date": "Q3 2024" }, { "cause": "Budget reduced by 30%", "effect": "Machine learning features postponed", "context": "Project Alpha roadmap adjustment", "entities": ["Engineering Team", "Project Alpha", "ML Team"], "confidence": 0.92, "source_sentence": "decided to postpone machine learning features", "date": "Q3 2024" }, { "cause": "ML features postponed", "effect": "Launch delayed from December to March", "context": "Timeline impact", "entities": ["Project Alpha"], "confidence": 0.90, "source_sentence": "December launch will be delayed until March 2025", "date": "2024-2025" } ]
### **Why Use Both NLP AND Claude?**
| Method | Pros | Cons | Use Case |
|--------|------|------|----------|
| **NLP (SpaCy)** | Fast, cheap, runs locally | Misses implicit relationships, lower accuracy | Quick first pass, simple docs |
| **Claude AI** | Understands context, finds implicit relationships, high accuracy | Costs money, requires API | Complex docs, deep analysis |
**Strategy:** Use NLP first for quick scan, then Claude for deep analysis.
### **Why This Stage Is Critical:**
Without causal extraction, you just have a pile of facts:
- ❌ "Budget was cut"
- ❌ "ML features postponed"
- ❌ "Timeline changed"
With causal extraction, you understand the story:
- ✅ Market demand dropped → Budget cut → ML postponed → Timeline delayed
This is **the heart of your system** - it's what makes it intelligent.
---
## **STAGE 4: ENTITY RESOLUTION** 🤖
### **Theory: Why This Stage Exists**
**Problem:** Same people/things are mentioned differently across documents.
**Examples:**
- "John Smith", "John", "J. Smith", "Smith" → Same person
- "Project Alpha", "Alpha", "The Alpha Project" → Same project
- "ML Team", "Machine Learning Team", "AI Team" → Same team (maybe)
**Goal:** Identify that these different mentions refer to the same entity.
### **What Happens:**
INPUT: All causal relationships from all documents
Step 4.1: COLLECT ALL ENTITIES ├─> Scan: All causal relationships ├─> Extract: Every entity mentioned └─> Result: List of entity mentions ["John", "John Smith", "J. Smith", "Sarah", "S. Chen", "Project Alpha", "Alpha", "ML Team", ...]
Step 4.2: GROUP BY ENTITY TYPE ├─> People: ["John", "John Smith", "Sarah", ...] ├─> Projects: ["Project Alpha", "Alpha", ...] ├─> Teams: ["ML Team", "AI Team", ...] └─> Organizations: ["Finance Dept", "Engineering", ...]
Step 4.3: AI-POWERED RESOLUTION (Claude API) ⭐ ├─> Send: All entity mentions to Claude ├─> Ask Claude: "Which mentions refer to the same real-world entity?" ├─> Claude analyzes: │ • Name similarities │ • Context clues │ • Role descriptions │ • Co-occurrence patterns └─> Claude returns: Grouped entities
Step 4.4: CREATE CANONICAL NAMES ├─> Choose: Best name for each entity ├─> Example: "John Smith" becomes canonical for ["John", "J. Smith"] └─> Store: Mapping table
### **Example:**
**Input (mentions across all docs):**
Document 1: "John led the meeting" Document 2: "J. Smith approved the budget" Document 3: "John Smith will present next week" Document 4: "Smith suggested the new approach"
Claude's Resolution:
{ "entities": { "John Smith": { "canonical_name": "John Smith", "mentions": ["John", "J. Smith", "John Smith", "Smith"], "type": "Person", "role": "Project Lead", "confidence": 0.95 } } }
### **Why This Matters:**
Without entity resolution:
- ❌ System thinks "John" and "John Smith" are different people
- ❌ Can't track someone's involvement across documents
- ❌ Relationships are fragmented
With entity resolution:
- ✅ System knows they're the same person
- ✅ Can see full picture of someone's involvement
- ✅ Relationships are connected
---
## **STAGE 5: KNOWLEDGE GRAPH CONSTRUCTION** 📊
### **Theory: Why This Stage Exists**
**Problem:** We have hundreds of causal relationships. How do we organize them? How do we find connections?
**Solution:** Build a **graph** - a network of nodes (things) and edges (relationships).
### **What Is A Knowledge Graph?**
Think of it like a map:
- **Nodes** = Places (events, people, projects)
- **Edges** = Roads (relationships between them)
Example Graph:
(Budget Cut)
│
│ CAUSES
▼
(ML Postponed)
│
│ CAUSES
▼
(Timeline Delayed) │ │ AFFECTS ▼ (Project Alpha) │ │ INVOLVES ▼ (Engineering Team)
### **What Happens:**
INPUT: Causal relationships + Resolved entities
Step 5.1: CREATE EVENT NODES For each causal relationship: ├─> Create Node: Cause event ├─> Create Node: Effect event └─> Properties: text, date, confidence
Example: Node1: {type: "Event", text: "Budget reduced by 30%"} Node2: {type: "Event", text: "ML features postponed"}
Step 5.2: CREATE ENTITY NODES For each resolved entity: ├─> Create Node: Entity └─> Properties: name, type, role
Example: Node3: {type: "Person", name: "John Smith", role: "Lead"} Node4: {type: "Project", name: "Project Alpha"}
Step 5.3: CREATE DOCUMENT NODES For each source document: └─> Create Node: Document Properties: filename, date, type
Example: Node5: {type: "Document", name: "Q3_meeting.pdf"}
Step 5.4: CREATE RELATIONSHIPS (Edges) ├─> CAUSES: Event1 → Event2 ├─> INVOLVED_IN: Person → Event ├─> MENTIONS: Document → Entity ├─> AFFECTS: Event → Project └─> Properties: confidence, source, date
Example Relationships: (Budget Cut) -[CAUSES]-> (ML Postponed) (John Smith) -[INVOLVED_IN]-> (Budget Cut) (Q3_meeting.pdf) -[MENTIONS]-> (John Smith)
Step 5.5: STORE IN NEO4J ├─> Connect: To Neo4j database ├─> Create: All nodes ├─> Create: All relationships └─> Index: For fast querying
### **Visual Example:**
**Before (Just Text):**
"Budget cut → ML postponed" "ML postponed → Timeline delayed" "John Smith involved in budget decision"
**After (Knowledge Graph):**
(John Smith)
│
│ INVOLVED_IN
▼
(Budget Cut) ──MENTIONED_IN──> (Q3_meeting.pdf)
│
│ CAUSES
▼
(ML Postponed) ──AFFECTS──> (Project Alpha)
│
│ CAUSES
▼
(Timeline Delayed) ──INVOLVES──> (Engineering Team)
### **Why Use A Graph?**
| Question | Without Graph | With Graph |
|----------|---------------|------------|
| "Why was ML postponed?" | Search all docs manually | Follow CAUSES edge backwards |
| "What did budget cut affect?" | Re-read everything | Follow CAUSES edges forward |
| "What is John involved in?" | Search his name everywhere | Follow INVOLVED_IN edges |
| "How are events connected?" | Hard to see | Visual path through graph |
**Key Benefit:** The graph shows **HOW** everything connects, not just WHAT exists.
---
## **STAGE 6: GRAPH TO VECTOR DATABASE** 🔄
### **Theory: Why This Stage Exists**
**Problem:**
- Neo4j is great for finding relationships ("What caused X?")
- But it's NOT good for semantic search ("Find docs about machine learning")
**Solution:** We need BOTH:
- **Neo4j** = Find causal chains and connections
- **Qdrant** = Find relevant content by meaning
### **Why We Need Both:**
**Neo4j (Graph Database):**
Good for: "Show me the chain of events that led to timeline delay" Answer: Budget Cut → ML Postponed → Timeline Delayed
**Qdrant (Vector Database):**
Good for: "Find all content related to machine learning" Answer: [50 relevant chunks from across all documents]
### **What Happens:**
INPUT: Complete Knowledge Graph in Neo4j
Step 6.1: EXTRACT CAUSAL CHAINS ├─> Query Neo4j: "Find all causal paths" │ Example: MATCH (a)-[:CAUSES*1..3]->(b) ├─> Get: Sequences of connected events └─> Result: List of causal chains
Example chains:
- Market demand ↓ → Budget cut → ML postponed
- John left → Sarah promoted → Team restructured
- User feedback → Design change → Timeline adjusted
Step 6.2: CONVERT TO NARRATIVE TEXT Take each chain and write it as a story:
Before: [Node1] → [Node2] → [Node3]
After: "Due to decreased market demand, the CFO reduced the budget by 30%. This led to the postponement of machine learning features, which ultimately delayed the December launch to March."
WHY? Because we need text to create embeddings!
Step 6.3: ENRICH WITH CONTEXT Add information from the graph: ├─> Who was involved? ├─> When did it happen? ├─> Which documents mention this? ├─> What projects were affected? └─> How confident are we?
Enriched text: "[CAUSAL CHAIN] Due to decreased market demand, the CFO reduced the budget by 30%. This led to ML postponement.
[METADATA] Date: Q3 2024 Involved: CFO, Engineering Team, Project Alpha Sources: Q3_meeting.pdf, budget_report.xlsx Confidence: 0.92"
Step 6.4: CREATE EMBEDDINGS ├─> Use: OpenAI Embedding API ├─> Input: Enriched text ├─> Output: Vector (1536 numbers) │ Example: [0.123, -0.456, 0.789, ...] └─> This vector represents the "meaning" of the text
Step 6.5: STORE IN QDRANT For each enriched chunk: ├─> Vector: The embedding ├─> Payload: The original text + all metadata │ { │ "text": "enriched narrative", │ "type": "causal_chain", │ "entities": ["CFO", "Project Alpha"], │ "sources": ["Q3_meeting.pdf"], │ "confidence": 0.92, │ "graph_path": "Node1->Node2->Node3" │ } └─> Store: In Qdrant collection
### **What Are Embeddings?**
Think of embeddings as **coordinates in meaning-space**:
Text: "machine learning features" Embedding: [0.2, 0.8, 0.1, -0.3, ...] ← 1536 numbers
Text: "AI capabilities" Embedding: [0.19, 0.82, 0.09, -0.29, ...] ← Similar numbers!
Text: "budget reporting" Embedding: [-0.6, 0.1, 0.9, 0.4, ...] ← Very different numbers
Similar meanings → Similar vectors → Qdrant finds them together!
### **Example Flow:**
**From Neo4j:**
Chain: (Budget Cut) → (ML Postponed) → (Timeline Delayed)
**Convert to Text:**
"Budget reduced by 30% → ML features postponed → December launch delayed to March"
**Enrich:**
"[Causal Chain] Budget reduced by 30% led to ML features being postponed, which delayed the December launch to March 2025.
Involved: CFO, Engineering Team, Project Alpha Sources: Q3_meeting.pdf, roadmap.pptx Confidence: 0.91 Date: August-September 2024"
**Create Embedding:**
[0.234, -0.567, 0.891, 0.123, ...] ← 1536 numbers
Store in Qdrant:
{ "id": "chain_001", "vector": [0.234, -0.567, ...], "payload": { "text": "enriched narrative...", "type": "causal_chain", "entities": ["CFO", "Engineering Team"], "sources": ["Q3_meeting.pdf"], "confidence": 0.91 } }
### **Why This Stage?**
Now we have the **best of both worlds**:
| Need | Use |
|------|-----|
| "Find content about machine learning" | Qdrant semantic search |
| "Show me the causal chain" | Neo4j graph traversal |
| "Why did timeline delay?" | Start with Qdrant, then Neo4j for details |
| "Generate comprehensive report" | Pull from BOTH |
---
## **STAGE 7: REPORT GENERATION** 📝 (FINAL STAGE)
### **Theory: Why This Stage Exists**
**Goal:** Take everything we've learned from 100+ documents and create ONE comprehensive, readable report.
### **What Happens:**
USER ACTION: └─> User clicks "Generate Onboarding Report"
Step 7.1: DEFINE REPORT REQUIREMENTS What should the report include? ├─> Project overview ├─> Key decisions and WHY they were made ├─> Important people and their roles ├─> Timeline of events ├─> Current status └─> Next steps
Step 7.2: SEMANTIC SEARCH (Qdrant) Query 1: "project overview goals objectives" ├─> Qdrant returns: Top 20 relevant chunks └─> Covers: High-level project information
Query 2: "timeline milestones dates schedule" ├─> Qdrant returns: Top 15 relevant chunks └─> Covers: Timeline information
Query 3: "decisions architecture technical" ├─> Qdrant returns: Top 15 relevant chunks └─> Covers: Technical decisions
Total: ~50 most relevant chunks from Qdrant
Step 7.3: GRAPH TRAVERSAL (Neo4j) Query 1: Get critical causal chains ├─> MATCH (a)-[:CAUSES*2..4]->(b) ├─> WHERE confidence > 0.8 └─> Returns: Top 20 important decision chains
Query 2: Get key entities ├─> MATCH (e:Entity)-[:INVOLVED_IN]->(events) ├─> Count events per entity └─> Returns: Most involved people/teams/projects
Query 3: Get recent timeline ├─> MATCH (e:Event) WHERE e.date > '2024-01-01' ├─> Order by date └─> Returns: Chronological event list
Step 7.4: AGGREGATE CONTEXT Combine everything: ├─> 50 semantic chunks from Qdrant ├─> 20 causal chains from Neo4j ├─> Key entities and their profiles ├─> Timeline of events └─> Metadata about sources
Total Context Size: ~30,000-50,000 tokens
Step 7.5: PREPARE PROMPT FOR CLAUDE Structure the prompt: ┌─────────────────────────────────────┐ │ SYSTEM: You are an expert technical │ │ writer creating an onboarding report│ │ │ │ USER: Based on these 100+ documents,│ │ create a comprehensive report. │ │ │ │ # SEMANTIC CONTEXT: │ │ [50 chunks from Qdrant] │ │ │ │ # CAUSAL CHAINS: │ │ [20 decision chains from Neo4j] │ │ │ │ # KEY ENTITIES: │ │ [People, teams, projects] │ │ │ │ # TIMELINE: │ │ [Chronological events] │ │ │ │ Generate report with sections: │ │ 1. Executive Summary │ │ 2. Project Overview │ │ 3. Key Decisions (with WHY) │ │ 4. Timeline │ │ 5. Current Status │ │ 6. Next Steps │ └─────────────────────────────────────┘
Step 7.6: CALL CLAUDE API ⭐ ├─> Send: Complete prompt to Claude ├─> Claude processes: │ • Reads all context │ • Identifies key themes │ • Synthesizes information │ • Creates narrative structure │ • Explains causal relationships │ • Writes clear, coherent report └─> Returns: Markdown-formatted report
Step 7.7: POST-PROCESS REPORT ├─> Add: Table of contents ├─> Add: Citations to source documents ├─> Add: Confidence indicators ├─> Format: Headings, bullet points, emphasis └─> Result: Final Markdown report
Step 7.8: CONVERT TO PDF ├─> Use: Markdown-to-PDF library ├─> Add: Styling and formatting ├─> Add: Page numbers, headers └─> Result: Professional PDF report
Step 7.9: DELIVER TO USER ├─> Save: PDF to storage ├─> Generate: Download link └─> Show: Success message with download button
🔄 COMPLETE DATA FLOW SUMMARY
Documents (100+)
↓
[Extract Text] → Plain Text
↓
[Claude: Causal Extraction] → Relationships List
↓
[Claude: Entity Resolution] → Resolved Entities
↓
[Build Graph] → Neo4j Knowledge Graph
↓
[Convert + Enrich] → Narrative Chunks
↓
[Create Embeddings] → Vectors
↓
[Store] → Qdrant Vector DB
↓
[User Request] → "Generate Report"
↓
[Query Qdrant] → Relevant Chunks
+
[Query Neo4j] → Causal Chains
↓
[Claude: Synthesis] → Final Report
↓
[Convert] → PDF
↓
[Deliver] → User Downloads Report