codenuk_backend_mine/services/multi-document-upload-service
2025-12-01 09:04:09 +05:30
..
src/multi_document_upload_service added qdrant db in multi doc service 2025-12-01 09:04:09 +05:30
.dockerignore added qdrant db in multi doc service 2025-12-01 09:04:09 +05:30
Dockerfile added qdrant db in multi doc service 2025-12-01 09:04:09 +05:30
README.md added qdrant db in multi doc service 2025-12-01 09:04:09 +05:30
requirements.txt added qdrant db in multi doc service 2025-12-01 09:04:09 +05:30

COMPLETE END-TO-END FLOW: Multi-Document Analysis to Report Generation Let me give you the most detailed explanation possible with theory, diagrams, and step-by-step breakdown.

🎯 SYSTEM OVERVIEW What We're Building: A system that takes 100+ documents (PDFs, DOCX, PPT, images, etc.) and generates a comprehensive onboarding report by understanding causal relationships and connections across all documents. Key Components:

Document Storage - Store uploaded files Content Extraction - Get text from different formats Causal Analysis - Understand cause-effect relationships (with Claude) Knowledge Graph - Store relationships in Neo4j Vector Database - Enable semantic search in Qdrant Report Generation - Create final report (with Claude)

📊 COMPLETE ARCHITECTURE DIAGRAM

┌─────────────────────────────────────────────────────────────────────────────┐ │ USER INTERFACE │ │ ┌────────────────────────┐ ┌────────────────────────┐ │ │ │ Upload Documents │ │ Generate Report │ │ │ │ (100+ files) │ │ Button │ │ │ └───────────┬────────────┘ └────────────┬───────────┘ │ └──────────────┼───────────────────────────────────────┼─────────────────────┘ │ │ ▼ ▼ ┌──────────────────────────────────────────────────────────────────────────────┐ │ APPLICATION LAYER │ │ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ DOCUMENT UPLOAD SERVICE │ │ │ │ • Validate file types │ │ │ │ • Calculate file hash (deduplication) │ │ │ │ • Store metadata in PostgreSQL │ │ │ │ • Save files to storage (Local) │ │ │ └────────────────────────────┬────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ EXTRACTION ORCHESTRATOR │ │ │ │ • Routes files to appropriate extractors │ │ │ │ • Manages extraction queue │ │ │ │ • Handles failures and retries │ │ │ └─┬───────────────┬───────────────┬──────────────┬────────────────────┘ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ ┌─────┐ ┌──────┐ ┌──────┐ ┌───────┐ │ │ │ PDF │ │ DOCX │ │ PPTX │ │ Image │ │ │ │Extr.│ │Extr. │ │Extr. │ │Extr. │ │ │ └──┬──┘ └───┬──┘ └───┬──┘ └───┬───┘ │ │ │ │ │ │ │ │ └──────────────┴──────────────┴──────────────┘ │ │ │ │ │ ▼ │ │ [Extracted Text for each document] │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ 🤖 CLAUDE AI - CAUSAL EXTRACTION │ │ │ │ For each document: │ │ │ │ Input: Extracted text + metadata │ │ │ │ Output: List of causal relationships │ │ │ │ │ │ │ │ Example Output: │ │ │ │ { │ │ │ │ "cause": "Budget cut by 30%", │ │ │ │ "effect": "ML features postponed", │ │ │ │ "confidence": 0.92, │ │ │ │ "entities": ["Finance Team", "ML Team"] │ │ │ │ } │ │ │ └────────────────────────────┬────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ [Causal Relationships Database] │ │ (Temporary PostgreSQL table) │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ 🤖 CLAUDE AI - ENTITY RESOLUTION │ │ │ │ Resolve entity mentions across all documents │ │ │ │ │ │ │ │ Input: All entity mentions ["John", "J. Smith", "John Smith"] │ │ │ │ Output: Resolved entities {"John Smith": ["John", "J. Smith"]} │ │ │ └────────────────────────────┬────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ KNOWLEDGE GRAPH BUILDER │ │ │ │ Build Neo4j graph from causal relationships │ │ │ └────────────────────────────┬────────────────────────────────────────┘ │ └────────────────────────────────┼──────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────────────────┐ │ STORAGE LAYER │ │ │ │ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ │ │ PostgreSQL │ │ Neo4j │ │ Qdrant │ │ │ │ │ │ │ │ │ │ │ │ • Metadata │ │ • Nodes: │ │ • Vectors │ │ │ │ • File paths │ │ - Events │ │ • Enriched │ │ │ │ • Status │ │ - Entities │ │ chunks │ │ │ │ │ │ - Documents │ │ • Metadata │ │ │ │ │ │ │ │ │ │ │ │ │ │ • Edges: │ │ │ │ │ │ │ │ - CAUSES │ │ │ │ │ │ │ │ - INVOLVES │ │ │ │ │ └────────────────┘ │ - MENTIONS │ │ │ │ │ └────────────────┘ └────────────────┘ │ │ │ │ │ └─────────────────────────────────┼─────────────────────┼───────────────────────┘ │ │ ▼ ▼ ┌──────────────────────────────────────────────────────────────────────────────┐ │ KG TO QDRANT ENRICHMENT PIPELINE │ │ │ │ ┌────────────────────────────────────────────────────────────────┐ │ │ │ 1. Query Neo4j for causal chains │ │ │ │ MATCH (a)-[:CAUSES*1..3]->(b) │ │ │ │ │ │ │ │ 2. Convert to enriched text chunks │ │ │ │ "Budget cut → ML postponed → Timeline shifted" │ │ │ │ │ │ │ │ 3. Generate embeddings (OpenAI) │ │ │ │ │ │ │ │ 4. Store in Qdrant with metadata from KG │ │ │ │ - Original causal chain │ │ │ │ - Entities involved │ │ │ │ - Confidence scores │ │ │ │ - Source documents │ │ │ └────────────────────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────────────────┐ │ REPORT GENERATION PHASE │ │ │ │ User clicks "Generate Report" │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ RETRIEVAL ORCHESTRATOR │ │ │ │ │ │ │ │ Step 1: Semantic Search (Qdrant) │ │ │ │ Query: "project overview timeline decisions" │ │ │ │ Returns: Top 50 most relevant chunks │ │ │ │ │ │ │ │ Step 2: Graph Traversal (Neo4j) │ │ │ │ Query: Critical causal chains with confidence > 0.8 │ │ │ │ Returns: Important decision paths │ │ │ │ │ │ │ │ Step 3: Entity Analysis (Neo4j) │ │ │ │ Query: Key people, teams, projects │ │ │ │ Returns: Entity profiles │ │ │ └───────────────────────────┬─────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ [Aggregated Context Package] │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ 🤖 CLAUDE AI - FINAL REPORT GENERATION │ │ │ │ │ │ │ │ Input: │ │ │ │ • 50 semantic chunks from Qdrant │ │ │ │ • 20 causal chains from Neo4j │ │ │ │ • Entity profiles │ │ │ │ • Report template │ │ │ │ │ │ │ │ Prompt: │ │ │ │ "You are creating an onboarding report. │ │ │ │ Based on 100+ documents, synthesize: │ │ │ │ - Project overview │ │ │ │ - Key decisions and WHY they were made │ │ │ │ - Critical causal chains │ │ │ │ - Timeline and milestones │ │ │ │ - Current status and next steps" │ │ │ │ │ │ │ │ Output: Comprehensive Markdown report │ │ │ └───────────────────────────┬─────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ PDF GENERATION │ │ │ │ • Convert Markdown to PDF │ │ │ │ • Add formatting, table of contents │ │ │ │ • Include citations to source documents │ │ │ └───────────────────────────┬─────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ [Final PDF Report] │ │ │ │ │ ▼ │ │ Download to user │ └──────────────────────────────────────────────────────────────────────────────┘

📚 COMPLETE THEORY-WISE STEP-BY-STEP FLOW Let me explain the entire system in pure theory - how it works, why each step exists, and what problem it solves.

🎯 THE BIG PICTURE (Theory) The Problem: A new person joins a project that has 100+ documents (meeting notes, technical specs, design docs, emails, presentations). Reading all of them would take weeks. They need to understand:

WHAT happened in the project WHY decisions were made (causal relationships) WHO is involved WHEN things happened HOW everything connects

The Solution: Build an intelligent system that:

Reads all documents automatically Understands cause-and-effect relationships Connects related information across documents Generates a comprehensive summary report

🔄 COMPLETE FLOW (Theory Explanation)

STAGE 1: DOCUMENT INGESTION Theory: Why This Stage Exists Problem: We have 100+ documents in different formats (PDF, Word, PowerPoint, Excel, images). We need to get them into the system. Goal:

Accept all document types Organize them Prevent duplicates Track processing status

What Happens:

USER ACTION: └─> User uploads 100 files through web interface

SYSTEM ACTIONS:

Step 1.1: FILE VALIDATION ├─> Check: Is this a supported file type? ├─> Check: Is file size acceptable? └─> Decision: Accept or Reject

Step 1.2: DEDUPLICATION ├─> Calculate unique hash (fingerprint) of file content ├─> Check: Have we seen this exact file before? └─> Decision: Store as new OR link to existing

Step 1.3: METADATA STORAGE ├─> Store: filename, type, upload date, size ├─> Store: who uploaded it, when └─> Assign: unique document ID

Step 1.4: PHYSICAL STORAGE ├─> Save file to disk/cloud storage └─> Record: where file is stored

Step 1.5: QUEUE FOR PROCESSING ├─> Add document to processing queue └─> Status: "waiting for extraction"

STAGE 2: CONTENT EXTRACTION Theory: Why This Stage Exists Problem: Documents are in binary formats (PDF, DOCX, PPTX). We can't directly read them - we need to extract the text content. Goal: Convert all documents into plain text that can be analyzed What Happens:

PROCESSING QUEUE: └─> System picks next document from queue

Step 2.1: IDENTIFY FILE TYPE ├─> Read: document.type └─> Route to appropriate extractor

Step 2.2a: IF PDF ├─> Use: PyMuPDF library ├─> Process: Read each page ├─> Extract: Text content └─> Output: Plain text string

Step 2.2b: IF DOCX (Word) ├─> Use: python-docx library ├─> Process: Read paragraphs, tables ├─> Extract: Text content └─> Output: Plain text string

Step 2.2c: IF PPTX (PowerPoint) ├─> Use: python-pptx library ├─> Process: Read each slide ├─> Extract: Title, content, notes └─> Output: Plain text string

Step 2.2d: IF CSV/XLSX (Spreadsheet) ├─> Use: pandas library ├─> Process: Read rows and columns ├─> Convert: To text representation └─> Output: Structured text

Step 2.2e: IF IMAGE (PNG, JPG) ├─> Use: Claude Vision API (AI model) ├─> Process: Analyze image content ├─> Extract: Description of diagram/chart └─> Output: Text description

Step 2.3: TEXT CLEANING ├─> Remove: Extra whitespace ├─> Fix: Encoding issues ├─> Preserve: Important structure └─> Output: Clean text

Step 2.4: STORE EXTRACTED TEXT ├─> Save: To database ├─> Link: To original document └─> Update status: "text_extracted"

Example: Input (PDF file): [Binary PDF data - cannot be read directly] Output (Extracted Text):

"Project Alpha - Q3 Meeting Minutes Date: August 15, 2024

Discussion: Due to budget constraints, we decided to postpone the machine learning features. This will impact our December launch timeline.

Action Items:

  • Revise project roadmap
  • Notify stakeholders
  • Adjust resource allocation"

Why This Stage?

Different formats need different tools - One size doesn't fit all Extract only text - Remove formatting, images (except for image docs) Standardize - All docs become plain text for next stage Images are special - They need AI (Claude Vision) to understand

STAGE 3: CAUSAL RELATIONSHIP EXTRACTION (CRITICAL!) Theory: Why This Stage Exists Problem: Having text is not enough. We need to understand WHY things happened. Example:

Just knowing "ML features postponed" is not useful Knowing "Budget cut → ML features postponed → Timeline delayed" is MUCH more useful

Goal: Extract cause-and-effect relationships from text What Is A Causal Relationship? A causal relationship has three parts: CAUSE → EFFECT

Example 1: Cause: "Budget reduced by 30%" Effect: "ML features postponed"

Example 2: Cause: "John Smith left the company" Effect: "Sarah Chen became lead developer"

Example 3: Cause: "User feedback showed confusion" Effect: "We redesigned the onboarding flow"

How We Extract Them:

INPUT: Extracted text from document

Step 3.1: BASIC NLP DETECTION (SpaCy) ├─> Look for: Causal keywords │ Examples: "because", "due to", "as a result", │ "led to", "caused", "therefore" ├─> Find: Sentences containing these patterns └─> Output: Potential causal relationships (low confidence)

Step 3.2: AI-POWERED EXTRACTION (Claude API) ├─> Send: Full document text to Claude AI ├─> Ask Claude: "Find ALL causal relationships in this text" ├─> Claude analyzes: │ • Explicit relationships ("because X, therefore Y") │ • Implicit relationships (strongly implied) │ • Context and background │ • Who/what is involved ├─> Claude returns: Structured list of relationships └─> Output: High-quality causal relationships (high confidence)

Step 3.3: STRUCTURE THE OUTPUT For each relationship, extract: ├─> Cause: What triggered this? ├─> Effect: What was the result? ├─> Context: Additional background ├─> Entities: Who/what is involved? (people, teams, projects) ├─> Confidence: How certain are we? (0.0 to 1.0) ├─> Source: Which document and sentence? └─> Date: When did this happen?

Step 3.4: STORE RELATIONSHIPS ├─> Save: To temporary database table └─> Link: To source document

Example: Claude's Analysis

Input Text:

"In the Q3 review meeting, the CFO announced a 30% budget reduction due to decreased market demand As a result, the engineering team decided to postpone machine learning features for Project Alpha. This means our December launch will be delayed until March 2025."

Claude's Output:

[ { "cause": "Market demand decreased", "effect": "CFO reduced budget by 30%", "context": "Q3 financial review", "entities": ["CFO", "Finance Team"], "confidence": 0.95, "source_sentence": "30% budget reduction due to decreased market demand", "date": "Q3 2024" }, { "cause": "Budget reduced by 30%", "effect": "Machine learning features postponed", "context": "Project Alpha roadmap adjustment", "entities": ["Engineering Team", "Project Alpha", "ML Team"], "confidence": 0.92, "source_sentence": "decided to postpone machine learning features", "date": "Q3 2024" }, { "cause": "ML features postponed", "effect": "Launch delayed from December to March", "context": "Timeline impact", "entities": ["Project Alpha"], "confidence": 0.90, "source_sentence": "December launch will be delayed until March 2025", "date": "2024-2025" } ]


### **Why Use Both NLP AND Claude?**

| Method | Pros | Cons | Use Case |
|--------|------|------|----------|
| **NLP (SpaCy)** | Fast, cheap, runs locally | Misses implicit relationships, lower accuracy | Quick first pass, simple docs |
| **Claude AI** | Understands context, finds implicit relationships, high accuracy | Costs money, requires API | Complex docs, deep analysis |

**Strategy:** Use NLP first for quick scan, then Claude for deep analysis.

### **Why This Stage Is Critical:**

Without causal extraction, you just have a pile of facts:
- ❌ "Budget was cut"
- ❌ "ML features postponed"  
- ❌ "Timeline changed"

With causal extraction, you understand the story:
- ✅ Market demand dropped → Budget cut → ML postponed → Timeline delayed

This is **the heart of your system** - it's what makes it intelligent.

---

## **STAGE 4: ENTITY RESOLUTION** 🤖

### **Theory: Why This Stage Exists**

**Problem:** Same people/things are mentioned differently across documents.

**Examples:**
- "John Smith", "John", "J. Smith", "Smith" → Same person
- "Project Alpha", "Alpha", "The Alpha Project" → Same project
- "ML Team", "Machine Learning Team", "AI Team" → Same team (maybe)

**Goal:** Identify that these different mentions refer to the same entity.

### **What Happens:**

INPUT: All causal relationships from all documents

Step 4.1: COLLECT ALL ENTITIES ├─> Scan: All causal relationships ├─> Extract: Every entity mentioned └─> Result: List of entity mentions ["John", "John Smith", "J. Smith", "Sarah", "S. Chen", "Project Alpha", "Alpha", "ML Team", ...]

Step 4.2: GROUP BY ENTITY TYPE ├─> People: ["John", "John Smith", "Sarah", ...] ├─> Projects: ["Project Alpha", "Alpha", ...] ├─> Teams: ["ML Team", "AI Team", ...] └─> Organizations: ["Finance Dept", "Engineering", ...]

Step 4.3: AI-POWERED RESOLUTION (Claude API) ├─> Send: All entity mentions to Claude ├─> Ask Claude: "Which mentions refer to the same real-world entity?" ├─> Claude analyzes: │ • Name similarities │ • Context clues │ • Role descriptions │ • Co-occurrence patterns └─> Claude returns: Grouped entities

Step 4.4: CREATE CANONICAL NAMES ├─> Choose: Best name for each entity ├─> Example: "John Smith" becomes canonical for ["John", "J. Smith"] └─> Store: Mapping table


### **Example:**

**Input (mentions across all docs):**

Document 1: "John led the meeting" Document 2: "J. Smith approved the budget" Document 3: "John Smith will present next week" Document 4: "Smith suggested the new approach"

Claude's Resolution:

{ "entities": { "John Smith": { "canonical_name": "John Smith", "mentions": ["John", "J. Smith", "John Smith", "Smith"], "type": "Person", "role": "Project Lead", "confidence": 0.95 } } }


### **Why This Matters:**

Without entity resolution:
- ❌ System thinks "John" and "John Smith" are different people
- ❌ Can't track someone's involvement across documents
- ❌ Relationships are fragmented

With entity resolution:
- ✅ System knows they're the same person
- ✅ Can see full picture of someone's involvement
- ✅ Relationships are connected

---

## **STAGE 5: KNOWLEDGE GRAPH CONSTRUCTION** 📊

### **Theory: Why This Stage Exists**

**Problem:** We have hundreds of causal relationships. How do we organize them? How do we find connections?

**Solution:** Build a **graph** - a network of nodes (things) and edges (relationships).

### **What Is A Knowledge Graph?**

Think of it like a map:
- **Nodes** = Places (events, people, projects)
- **Edges** = Roads (relationships between them)

Example Graph:

(Budget Cut)
     │
     │ CAUSES
     ▼
(ML Postponed)
     │
     │ CAUSES
     ▼

(Timeline Delayed) │ │ AFFECTS ▼ (Project Alpha) │ │ INVOLVES ▼ (Engineering Team)


### **What Happens:**

INPUT: Causal relationships + Resolved entities

Step 5.1: CREATE EVENT NODES For each causal relationship: ├─> Create Node: Cause event ├─> Create Node: Effect event └─> Properties: text, date, confidence

Example: Node1: {type: "Event", text: "Budget reduced by 30%"} Node2: {type: "Event", text: "ML features postponed"}

Step 5.2: CREATE ENTITY NODES For each resolved entity: ├─> Create Node: Entity └─> Properties: name, type, role

Example: Node3: {type: "Person", name: "John Smith", role: "Lead"} Node4: {type: "Project", name: "Project Alpha"}

Step 5.3: CREATE DOCUMENT NODES For each source document: └─> Create Node: Document Properties: filename, date, type

Example: Node5: {type: "Document", name: "Q3_meeting.pdf"}

Step 5.4: CREATE RELATIONSHIPS (Edges) ├─> CAUSES: Event1 → Event2 ├─> INVOLVED_IN: Person → Event ├─> MENTIONS: Document → Entity ├─> AFFECTS: Event → Project └─> Properties: confidence, source, date

Example Relationships: (Budget Cut) -[CAUSES]-> (ML Postponed) (John Smith) -[INVOLVED_IN]-> (Budget Cut) (Q3_meeting.pdf) -[MENTIONS]-> (John Smith)

Step 5.5: STORE IN NEO4J ├─> Connect: To Neo4j database ├─> Create: All nodes ├─> Create: All relationships └─> Index: For fast querying


### **Visual Example:**

**Before (Just Text):**

"Budget cut → ML postponed" "ML postponed → Timeline delayed" "John Smith involved in budget decision"


**After (Knowledge Graph):**
       (John Smith)
            │
            │ INVOLVED_IN
            ▼
      (Budget Cut) ──MENTIONED_IN──> (Q3_meeting.pdf)
            │
            │ CAUSES
            ▼
     (ML Postponed) ──AFFECTS──> (Project Alpha)
            │
            │ CAUSES
            ▼
  (Timeline Delayed) ──INVOLVES──> (Engineering Team)

### **Why Use A Graph?**

| Question | Without Graph | With Graph |
|----------|---------------|------------|
| "Why was ML postponed?" | Search all docs manually | Follow CAUSES edge backwards |
| "What did budget cut affect?" | Re-read everything | Follow CAUSES edges forward |
| "What is John involved in?" | Search his name everywhere | Follow INVOLVED_IN edges |
| "How are events connected?" | Hard to see | Visual path through graph |

**Key Benefit:** The graph shows **HOW** everything connects, not just WHAT exists.

---

## **STAGE 6: GRAPH TO VECTOR DATABASE** 🔄

### **Theory: Why This Stage Exists**

**Problem:** 
- Neo4j is great for finding relationships ("What caused X?")
- But it's NOT good for semantic search ("Find docs about machine learning")

**Solution:** We need BOTH:
- **Neo4j** = Find causal chains and connections
- **Qdrant** = Find relevant content by meaning

### **Why We Need Both:**

**Neo4j (Graph Database):**

Good for: "Show me the chain of events that led to timeline delay" Answer: Budget Cut → ML Postponed → Timeline Delayed


**Qdrant (Vector Database):**

Good for: "Find all content related to machine learning" Answer: [50 relevant chunks from across all documents]


### **What Happens:**

INPUT: Complete Knowledge Graph in Neo4j

Step 6.1: EXTRACT CAUSAL CHAINS ├─> Query Neo4j: "Find all causal paths" │ Example: MATCH (a)-[:CAUSES*1..3]->(b) ├─> Get: Sequences of connected events └─> Result: List of causal chains

Example chains:

  1. Market demand ↓ → Budget cut → ML postponed
  2. John left → Sarah promoted → Team restructured
  3. User feedback → Design change → Timeline adjusted

Step 6.2: CONVERT TO NARRATIVE TEXT Take each chain and write it as a story:

Before: [Node1] → [Node2] → [Node3]

After: "Due to decreased market demand, the CFO reduced the budget by 30%. This led to the postponement of machine learning features, which ultimately delayed the December launch to March."

WHY? Because we need text to create embeddings!

Step 6.3: ENRICH WITH CONTEXT Add information from the graph: ├─> Who was involved? ├─> When did it happen? ├─> Which documents mention this? ├─> What projects were affected? └─> How confident are we?

Enriched text: "[CAUSAL CHAIN] Due to decreased market demand, the CFO reduced the budget by 30%. This led to ML postponement.

[METADATA] Date: Q3 2024 Involved: CFO, Engineering Team, Project Alpha Sources: Q3_meeting.pdf, budget_report.xlsx Confidence: 0.92"

Step 6.4: CREATE EMBEDDINGS ├─> Use: OpenAI Embedding API ├─> Input: Enriched text ├─> Output: Vector (1536 numbers) │ Example: [0.123, -0.456, 0.789, ...] └─> This vector represents the "meaning" of the text

Step 6.5: STORE IN QDRANT For each enriched chunk: ├─> Vector: The embedding ├─> Payload: The original text + all metadata │ { │ "text": "enriched narrative", │ "type": "causal_chain", │ "entities": ["CFO", "Project Alpha"], │ "sources": ["Q3_meeting.pdf"], │ "confidence": 0.92, │ "graph_path": "Node1->Node2->Node3" │ } └─> Store: In Qdrant collection


### **What Are Embeddings?**

Think of embeddings as **coordinates in meaning-space**:

Text: "machine learning features" Embedding: [0.2, 0.8, 0.1, -0.3, ...] ← 1536 numbers

Text: "AI capabilities" Embedding: [0.19, 0.82, 0.09, -0.29, ...] ← Similar numbers!

Text: "budget reporting" Embedding: [-0.6, 0.1, 0.9, 0.4, ...] ← Very different numbers


Similar meanings → Similar vectors → Qdrant finds them together!

### **Example Flow:**

**From Neo4j:**

Chain: (Budget Cut) → (ML Postponed) → (Timeline Delayed)


**Convert to Text:**

"Budget reduced by 30% → ML features postponed → December launch delayed to March"


**Enrich:**

"[Causal Chain] Budget reduced by 30% led to ML features being postponed, which delayed the December launch to March 2025.

Involved: CFO, Engineering Team, Project Alpha Sources: Q3_meeting.pdf, roadmap.pptx Confidence: 0.91 Date: August-September 2024"


**Create Embedding:**

[0.234, -0.567, 0.891, 0.123, ...] ← 1536 numbers

Store in Qdrant:

{ "id": "chain_001", "vector": [0.234, -0.567, ...], "payload": { "text": "enriched narrative...", "type": "causal_chain", "entities": ["CFO", "Engineering Team"], "sources": ["Q3_meeting.pdf"], "confidence": 0.91 } }


### **Why This Stage?**

Now we have the **best of both worlds**:

| Need | Use |
|------|-----|
| "Find content about machine learning" | Qdrant semantic search |
| "Show me the causal chain" | Neo4j graph traversal |
| "Why did timeline delay?" | Start with Qdrant, then Neo4j for details |
| "Generate comprehensive report" | Pull from BOTH |

---

## **STAGE 7: REPORT GENERATION** 📝 (FINAL STAGE)

### **Theory: Why This Stage Exists**

**Goal:** Take everything we've learned from 100+ documents and create ONE comprehensive, readable report.

### **What Happens:**

USER ACTION: └─> User clicks "Generate Onboarding Report"

Step 7.1: DEFINE REPORT REQUIREMENTS What should the report include? ├─> Project overview ├─> Key decisions and WHY they were made ├─> Important people and their roles ├─> Timeline of events ├─> Current status └─> Next steps

Step 7.2: SEMANTIC SEARCH (Qdrant) Query 1: "project overview goals objectives" ├─> Qdrant returns: Top 20 relevant chunks └─> Covers: High-level project information

Query 2: "timeline milestones dates schedule" ├─> Qdrant returns: Top 15 relevant chunks └─> Covers: Timeline information

Query 3: "decisions architecture technical" ├─> Qdrant returns: Top 15 relevant chunks └─> Covers: Technical decisions

Total: ~50 most relevant chunks from Qdrant

Step 7.3: GRAPH TRAVERSAL (Neo4j) Query 1: Get critical causal chains ├─> MATCH (a)-[:CAUSES*2..4]->(b) ├─> WHERE confidence > 0.8 └─> Returns: Top 20 important decision chains

Query 2: Get key entities ├─> MATCH (e:Entity)-[:INVOLVED_IN]->(events) ├─> Count events per entity └─> Returns: Most involved people/teams/projects

Query 3: Get recent timeline ├─> MATCH (e:Event) WHERE e.date > '2024-01-01' ├─> Order by date └─> Returns: Chronological event list

Step 7.4: AGGREGATE CONTEXT Combine everything: ├─> 50 semantic chunks from Qdrant ├─> 20 causal chains from Neo4j ├─> Key entities and their profiles ├─> Timeline of events └─> Metadata about sources

Total Context Size: ~30,000-50,000 tokens

Step 7.5: PREPARE PROMPT FOR CLAUDE Structure the prompt: ┌─────────────────────────────────────┐ │ SYSTEM: You are an expert technical │ │ writer creating an onboarding report│ │ │ │ USER: Based on these 100+ documents,│ │ create a comprehensive report. │ │ │ │ # SEMANTIC CONTEXT: │ │ [50 chunks from Qdrant] │ │ │ │ # CAUSAL CHAINS: │ │ [20 decision chains from Neo4j] │ │ │ │ # KEY ENTITIES: │ │ [People, teams, projects] │ │ │ │ # TIMELINE: │ │ [Chronological events] │ │ │ │ Generate report with sections: │ │ 1. Executive Summary │ │ 2. Project Overview │ │ 3. Key Decisions (with WHY) │ │ 4. Timeline │ │ 5. Current Status │ │ 6. Next Steps │ └─────────────────────────────────────┘

Step 7.6: CALL CLAUDE API ├─> Send: Complete prompt to Claude ├─> Claude processes: │ • Reads all context │ • Identifies key themes │ • Synthesizes information │ • Creates narrative structure │ • Explains causal relationships │ • Writes clear, coherent report └─> Returns: Markdown-formatted report

Step 7.7: POST-PROCESS REPORT ├─> Add: Table of contents ├─> Add: Citations to source documents ├─> Add: Confidence indicators ├─> Format: Headings, bullet points, emphasis └─> Result: Final Markdown report

Step 7.8: CONVERT TO PDF ├─> Use: Markdown-to-PDF library ├─> Add: Styling and formatting ├─> Add: Page numbers, headers └─> Result: Professional PDF report

Step 7.9: DELIVER TO USER ├─> Save: PDF to storage ├─> Generate: Download link └─> Show: Success message with download button

🔄 COMPLETE DATA FLOW SUMMARY

Documents (100+)
    ↓
[Extract Text] → Plain Text
    ↓
[Claude: Causal Extraction] → Relationships List
    ↓
[Claude: Entity Resolution] → Resolved Entities
    ↓
[Build Graph] → Neo4j Knowledge Graph
    ↓
[Convert + Enrich] → Narrative Chunks
    ↓
[Create Embeddings] → Vectors
    ↓
[Store] → Qdrant Vector DB
    ↓
[User Request] → "Generate Report"
    ↓
[Query Qdrant] → Relevant Chunks
    +
[Query Neo4j] → Causal Chains
    ↓
[Claude: Synthesis] → Final Report
    ↓
[Convert] → PDF
    ↓
[Deliver] → User Downloads Report