996 lines
43 KiB
Markdown
996 lines
43 KiB
Markdown
COMPLETE END-TO-END FLOW: Multi-Document Analysis to Report Generation
|
|
Let me give you the most detailed explanation possible with theory, diagrams, and step-by-step breakdown.
|
|
|
|
🎯 SYSTEM OVERVIEW
|
|
What We're Building:
|
|
A system that takes 100+ documents (PDFs, DOCX, PPT, images, etc.) and generates a comprehensive onboarding report by understanding causal relationships and connections across all documents.
|
|
Key Components:
|
|
|
|
Document Storage - Store uploaded files
|
|
Content Extraction - Get text from different formats
|
|
Causal Analysis - Understand cause-effect relationships (with Claude)
|
|
Knowledge Graph - Store relationships in Neo4j
|
|
Vector Database - Enable semantic search in Qdrant
|
|
Report Generation - Create final report (with Claude)
|
|
|
|
|
|
📊 COMPLETE ARCHITECTURE DIAGRAM
|
|
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ USER INTERFACE │
|
|
│ ┌────────────────────────┐ ┌────────────────────────┐ │
|
|
│ │ Upload Documents │ │ Generate Report │ │
|
|
│ │ (100+ files) │ │ Button │ │
|
|
│ └───────────┬────────────┘ └────────────┬───────────┘ │
|
|
└──────────────┼───────────────────────────────────────┼─────────────────────┘
|
|
│ │
|
|
▼ ▼
|
|
┌──────────────────────────────────────────────────────────────────────────────┐
|
|
│ APPLICATION LAYER │
|
|
│ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ DOCUMENT UPLOAD SERVICE │ │
|
|
│ │ • Validate file types │ │
|
|
│ │ • Calculate file hash (deduplication) │ │
|
|
│ │ • Store metadata in PostgreSQL │ │
|
|
│ │ • Save files to storage (Local) │ │
|
|
│ └────────────────────────────┬────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ EXTRACTION ORCHESTRATOR │ │
|
|
│ │ • Routes files to appropriate extractors │ │
|
|
│ │ • Manages extraction queue │ │
|
|
│ │ • Handles failures and retries │ │
|
|
│ └─┬───────────────┬───────────────┬──────────────┬────────────────────┘ │
|
|
│ │ │ │ │ │
|
|
│ ▼ ▼ ▼ ▼ │
|
|
│ ┌─────┐ ┌──────┐ ┌──────┐ ┌───────┐ │
|
|
│ │ PDF │ │ DOCX │ │ PPTX │ │ Image │ │
|
|
│ │Extr.│ │Extr. │ │Extr. │ │Extr. │ │
|
|
│ └──┬──┘ └───┬──┘ └───┬──┘ └───┬───┘ │
|
|
│ │ │ │ │ │
|
|
│ └──────────────┴──────────────┴──────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ [Extracted Text for each document] │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ 🤖 CLAUDE AI - CAUSAL EXTRACTION │ │
|
|
│ │ For each document: │ │
|
|
│ │ Input: Extracted text + metadata │ │
|
|
│ │ Output: List of causal relationships │ │
|
|
│ │ │ │
|
|
│ │ Example Output: │ │
|
|
│ │ { │ │
|
|
│ │ "cause": "Budget cut by 30%", │ │
|
|
│ │ "effect": "ML features postponed", │ │
|
|
│ │ "confidence": 0.92, │ │
|
|
│ │ "entities": ["Finance Team", "ML Team"] │ │
|
|
│ │ } │ │
|
|
│ └────────────────────────────┬────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ [Causal Relationships Database] │
|
|
│ (Temporary PostgreSQL table) │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ 🤖 CLAUDE AI - ENTITY RESOLUTION │ │
|
|
│ │ Resolve entity mentions across all documents │ │
|
|
│ │ │ │
|
|
│ │ Input: All entity mentions ["John", "J. Smith", "John Smith"] │ │
|
|
│ │ Output: Resolved entities {"John Smith": ["John", "J. Smith"]} │ │
|
|
│ └────────────────────────────┬────────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ KNOWLEDGE GRAPH BUILDER │ │
|
|
│ │ Build Neo4j graph from causal relationships │ │
|
|
│ └────────────────────────────┬────────────────────────────────────────┘ │
|
|
└────────────────────────────────┼──────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────────────────────────────────────────────┐
|
|
│ STORAGE LAYER │
|
|
│ │
|
|
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
|
|
│ │ PostgreSQL │ │ Neo4j │ │ Qdrant │ │
|
|
│ │ │ │ │ │ │ │
|
|
│ │ • Metadata │ │ • Nodes: │ │ • Vectors │ │
|
|
│ │ • File paths │ │ - Events │ │ • Enriched │ │
|
|
│ │ • Status │ │ - Entities │ │ chunks │ │
|
|
│ │ │ │ - Documents │ │ • Metadata │ │
|
|
│ │ │ │ │ │ │ │
|
|
│ │ │ │ • Edges: │ │ │ │
|
|
│ │ │ │ - CAUSES │ │ │ │
|
|
│ │ │ │ - INVOLVES │ │ │ │
|
|
│ └────────────────┘ │ - MENTIONS │ │ │ │
|
|
│ └────────────────┘ └────────────────┘ │
|
|
│ │ │ │
|
|
└─────────────────────────────────┼─────────────────────┼───────────────────────┘
|
|
│ │
|
|
▼ ▼
|
|
┌──────────────────────────────────────────────────────────────────────────────┐
|
|
│ KG TO QDRANT ENRICHMENT PIPELINE │
|
|
│ │
|
|
│ ┌────────────────────────────────────────────────────────────────┐ │
|
|
│ │ 1. Query Neo4j for causal chains │ │
|
|
│ │ MATCH (a)-[:CAUSES*1..3]->(b) │ │
|
|
│ │ │ │
|
|
│ │ 2. Convert to enriched text chunks │ │
|
|
│ │ "Budget cut → ML postponed → Timeline shifted" │ │
|
|
│ │ │ │
|
|
│ │ 3. Generate embeddings (OpenAI) │ │
|
|
│ │ │ │
|
|
│ │ 4. Store in Qdrant with metadata from KG │ │
|
|
│ │ - Original causal chain │ │
|
|
│ │ - Entities involved │ │
|
|
│ │ - Confidence scores │ │
|
|
│ │ - Source documents │ │
|
|
│ └────────────────────────────────────────────────────────────────┘ │
|
|
└──────────────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────────────────────────────────────────────┐
|
|
│ REPORT GENERATION PHASE │
|
|
│ │
|
|
│ User clicks "Generate Report" │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
|
│ │ RETRIEVAL ORCHESTRATOR │ │
|
|
│ │ │ │
|
|
│ │ Step 1: Semantic Search (Qdrant) │ │
|
|
│ │ Query: "project overview timeline decisions" │ │
|
|
│ │ Returns: Top 50 most relevant chunks │ │
|
|
│ │ │ │
|
|
│ │ Step 2: Graph Traversal (Neo4j) │ │
|
|
│ │ Query: Critical causal chains with confidence > 0.8 │ │
|
|
│ │ Returns: Important decision paths │ │
|
|
│ │ │ │
|
|
│ │ Step 3: Entity Analysis (Neo4j) │ │
|
|
│ │ Query: Key people, teams, projects │ │
|
|
│ │ Returns: Entity profiles │ │
|
|
│ └───────────────────────────┬─────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ [Aggregated Context Package] │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
|
│ │ 🤖 CLAUDE AI - FINAL REPORT GENERATION │ │
|
|
│ │ │ │
|
|
│ │ Input: │ │
|
|
│ │ • 50 semantic chunks from Qdrant │ │
|
|
│ │ • 20 causal chains from Neo4j │ │
|
|
│ │ • Entity profiles │ │
|
|
│ │ • Report template │ │
|
|
│ │ │ │
|
|
│ │ Prompt: │ │
|
|
│ │ "You are creating an onboarding report. │ │
|
|
│ │ Based on 100+ documents, synthesize: │ │
|
|
│ │ - Project overview │ │
|
|
│ │ - Key decisions and WHY they were made │ │
|
|
│ │ - Critical causal chains │ │
|
|
│ │ - Timeline and milestones │ │
|
|
│ │ - Current status and next steps" │ │
|
|
│ │ │ │
|
|
│ │ Output: Comprehensive Markdown report │ │
|
|
│ └───────────────────────────┬─────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
|
│ │ PDF GENERATION │ │
|
|
│ │ • Convert Markdown to PDF │ │
|
|
│ │ • Add formatting, table of contents │ │
|
|
│ │ • Include citations to source documents │ │
|
|
│ └───────────────────────────┬─────────────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ [Final PDF Report] │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ Download to user │
|
|
└──────────────────────────────────────────────────────────────────────────────┘
|
|
|
|
|
|
📚 COMPLETE THEORY-WISE STEP-BY-STEP FLOW
|
|
Let me explain the entire system in pure theory - how it works, why each step exists, and what problem it solves.
|
|
|
|
🎯 THE BIG PICTURE (Theory)
|
|
The Problem:
|
|
A new person joins a project that has 100+ documents (meeting notes, technical specs, design docs, emails, presentations). Reading all of them would take weeks. They need to understand:
|
|
|
|
WHAT happened in the project
|
|
WHY decisions were made (causal relationships)
|
|
WHO is involved
|
|
WHEN things happened
|
|
HOW everything connects
|
|
|
|
The Solution:
|
|
Build an intelligent system that:
|
|
|
|
Reads all documents automatically
|
|
Understands cause-and-effect relationships
|
|
Connects related information across documents
|
|
Generates a comprehensive summary report
|
|
|
|
|
|
🔄 COMPLETE FLOW (Theory Explanation)
|
|
|
|
STAGE 1: DOCUMENT INGESTION
|
|
Theory: Why This Stage Exists
|
|
Problem: We have 100+ documents in different formats (PDF, Word, PowerPoint, Excel, images). We need to get them into the system.
|
|
Goal:
|
|
|
|
Accept all document types
|
|
Organize them
|
|
Prevent duplicates
|
|
Track processing status
|
|
|
|
What Happens:
|
|
|
|
USER ACTION:
|
|
└─> User uploads 100 files through web interface
|
|
|
|
SYSTEM ACTIONS:
|
|
|
|
Step 1.1: FILE VALIDATION
|
|
├─> Check: Is this a supported file type?
|
|
├─> Check: Is file size acceptable?
|
|
└─> Decision: Accept or Reject
|
|
|
|
Step 1.2: DEDUPLICATION
|
|
├─> Calculate unique hash (fingerprint) of file content
|
|
├─> Check: Have we seen this exact file before?
|
|
└─> Decision: Store as new OR link to existing
|
|
|
|
Step 1.3: METADATA STORAGE
|
|
├─> Store: filename, type, upload date, size
|
|
├─> Store: who uploaded it, when
|
|
└─> Assign: unique document ID
|
|
|
|
Step 1.4: PHYSICAL STORAGE
|
|
├─> Save file to disk/cloud storage
|
|
└─> Record: where file is stored
|
|
|
|
Step 1.5: QUEUE FOR PROCESSING
|
|
├─> Add document to processing queue
|
|
└─> Status: "waiting for extraction"
|
|
|
|
STAGE 2: CONTENT EXTRACTION
|
|
Theory: Why This Stage Exists
|
|
Problem: Documents are in binary formats (PDF, DOCX, PPTX). We can't directly read them - we need to extract the text content.
|
|
Goal: Convert all documents into plain text that can be analyzed
|
|
What Happens:
|
|
|
|
PROCESSING QUEUE:
|
|
└─> System picks next document from queue
|
|
|
|
Step 2.1: IDENTIFY FILE TYPE
|
|
├─> Read: document.type
|
|
└─> Route to appropriate extractor
|
|
|
|
Step 2.2a: IF PDF
|
|
├─> Use: PyMuPDF library
|
|
├─> Process: Read each page
|
|
├─> Extract: Text content
|
|
└─> Output: Plain text string
|
|
|
|
Step 2.2b: IF DOCX (Word)
|
|
├─> Use: python-docx library
|
|
├─> Process: Read paragraphs, tables
|
|
├─> Extract: Text content
|
|
└─> Output: Plain text string
|
|
|
|
Step 2.2c: IF PPTX (PowerPoint)
|
|
├─> Use: python-pptx library
|
|
├─> Process: Read each slide
|
|
├─> Extract: Title, content, notes
|
|
└─> Output: Plain text string
|
|
|
|
Step 2.2d: IF CSV/XLSX (Spreadsheet)
|
|
├─> Use: pandas library
|
|
├─> Process: Read rows and columns
|
|
├─> Convert: To text representation
|
|
└─> Output: Structured text
|
|
|
|
Step 2.2e: IF IMAGE (PNG, JPG)
|
|
├─> Use: Claude Vision API (AI model)
|
|
├─> Process: Analyze image content
|
|
├─> Extract: Description of diagram/chart
|
|
└─> Output: Text description
|
|
|
|
Step 2.3: TEXT CLEANING
|
|
├─> Remove: Extra whitespace
|
|
├─> Fix: Encoding issues
|
|
├─> Preserve: Important structure
|
|
└─> Output: Clean text
|
|
|
|
Step 2.4: STORE EXTRACTED TEXT
|
|
├─> Save: To database
|
|
├─> Link: To original document
|
|
└─> Update status: "text_extracted"
|
|
|
|
Example:
|
|
Input (PDF file):
|
|
[Binary PDF data - cannot be read directly]
|
|
Output (Extracted Text):
|
|
|
|
"Project Alpha - Q3 Meeting Minutes
|
|
Date: August 15, 2024
|
|
|
|
Discussion:
|
|
Due to budget constraints, we decided to postpone
|
|
the machine learning features. This will impact
|
|
our December launch timeline.
|
|
|
|
Action Items:
|
|
- Revise project roadmap
|
|
- Notify stakeholders
|
|
- Adjust resource allocation"
|
|
|
|
Why This Stage?
|
|
|
|
Different formats need different tools - One size doesn't fit all
|
|
Extract only text - Remove formatting, images (except for image docs)
|
|
Standardize - All docs become plain text for next stage
|
|
Images are special - They need AI (Claude Vision) to understand
|
|
|
|
STAGE 3: CAUSAL RELATIONSHIP EXTRACTION ⭐ (CRITICAL!)
|
|
Theory: Why This Stage Exists
|
|
Problem: Having text is not enough. We need to understand WHY things happened.
|
|
Example:
|
|
|
|
Just knowing "ML features postponed" is not useful
|
|
Knowing "Budget cut → ML features postponed → Timeline delayed" is MUCH more useful
|
|
|
|
Goal: Extract cause-and-effect relationships from text
|
|
What Is A Causal Relationship?
|
|
A causal relationship has three parts:
|
|
CAUSE → EFFECT
|
|
|
|
Example 1:
|
|
Cause: "Budget reduced by 30%"
|
|
Effect: "ML features postponed"
|
|
|
|
Example 2:
|
|
Cause: "John Smith left the company"
|
|
Effect: "Sarah Chen became lead developer"
|
|
|
|
Example 3:
|
|
Cause: "User feedback showed confusion"
|
|
Effect: "We redesigned the onboarding flow"
|
|
|
|
How We Extract Them:
|
|
|
|
INPUT: Extracted text from document
|
|
|
|
Step 3.1: BASIC NLP DETECTION (SpaCy)
|
|
├─> Look for: Causal keywords
|
|
│ Examples: "because", "due to", "as a result",
|
|
│ "led to", "caused", "therefore"
|
|
├─> Find: Sentences containing these patterns
|
|
└─> Output: Potential causal relationships (low confidence)
|
|
|
|
Step 3.2: AI-POWERED EXTRACTION (Claude API) ⭐
|
|
├─> Send: Full document text to Claude AI
|
|
├─> Ask Claude: "Find ALL causal relationships in this text"
|
|
├─> Claude analyzes:
|
|
│ • Explicit relationships ("because X, therefore Y")
|
|
│ • Implicit relationships (strongly implied)
|
|
│ • Context and background
|
|
│ • Who/what is involved
|
|
├─> Claude returns: Structured list of relationships
|
|
└─> Output: High-quality causal relationships (high confidence)
|
|
|
|
Step 3.3: STRUCTURE THE OUTPUT
|
|
For each relationship, extract:
|
|
├─> Cause: What triggered this?
|
|
├─> Effect: What was the result?
|
|
├─> Context: Additional background
|
|
├─> Entities: Who/what is involved? (people, teams, projects)
|
|
├─> Confidence: How certain are we? (0.0 to 1.0)
|
|
├─> Source: Which document and sentence?
|
|
└─> Date: When did this happen?
|
|
|
|
Step 3.4: STORE RELATIONSHIPS
|
|
├─> Save: To temporary database table
|
|
└─> Link: To source document
|
|
|
|
Example: Claude's Analysis
|
|
|
|
Input Text:
|
|
|
|
"In the Q3 review meeting, the CFO announced a 30%
|
|
budget reduction due to decreased market demand
|
|
As a result, the engineering team decided to
|
|
postpone machine learning features for Project Alpha.
|
|
This means our December launch will be delayed
|
|
until March 2025."
|
|
|
|
|
|
Claude's Output:
|
|
|
|
[
|
|
{
|
|
"cause": "Market demand decreased",
|
|
"effect": "CFO reduced budget by 30%",
|
|
"context": "Q3 financial review",
|
|
"entities": ["CFO", "Finance Team"],
|
|
"confidence": 0.95,
|
|
"source_sentence": "30% budget reduction due to decreased market demand",
|
|
"date": "Q3 2024"
|
|
},
|
|
{
|
|
"cause": "Budget reduced by 30%",
|
|
"effect": "Machine learning features postponed",
|
|
"context": "Project Alpha roadmap adjustment",
|
|
"entities": ["Engineering Team", "Project Alpha", "ML Team"],
|
|
"confidence": 0.92,
|
|
"source_sentence": "decided to postpone machine learning features",
|
|
"date": "Q3 2024"
|
|
},
|
|
{
|
|
"cause": "ML features postponed",
|
|
"effect": "Launch delayed from December to March",
|
|
"context": "Timeline impact",
|
|
"entities": ["Project Alpha"],
|
|
"confidence": 0.90,
|
|
"source_sentence": "December launch will be delayed until March 2025",
|
|
"date": "2024-2025"
|
|
}
|
|
]
|
|
```
|
|
|
|
### **Why Use Both NLP AND Claude?**
|
|
|
|
| Method | Pros | Cons | Use Case |
|
|
|--------|------|------|----------|
|
|
| **NLP (SpaCy)** | Fast, cheap, runs locally | Misses implicit relationships, lower accuracy | Quick first pass, simple docs |
|
|
| **Claude AI** | Understands context, finds implicit relationships, high accuracy | Costs money, requires API | Complex docs, deep analysis |
|
|
|
|
**Strategy:** Use NLP first for quick scan, then Claude for deep analysis.
|
|
|
|
### **Why This Stage Is Critical:**
|
|
|
|
Without causal extraction, you just have a pile of facts:
|
|
- ❌ "Budget was cut"
|
|
- ❌ "ML features postponed"
|
|
- ❌ "Timeline changed"
|
|
|
|
With causal extraction, you understand the story:
|
|
- ✅ Market demand dropped → Budget cut → ML postponed → Timeline delayed
|
|
|
|
This is **the heart of your system** - it's what makes it intelligent.
|
|
|
|
---
|
|
|
|
## **STAGE 4: ENTITY RESOLUTION** 🤖
|
|
|
|
### **Theory: Why This Stage Exists**
|
|
|
|
**Problem:** Same people/things are mentioned differently across documents.
|
|
|
|
**Examples:**
|
|
- "John Smith", "John", "J. Smith", "Smith" → Same person
|
|
- "Project Alpha", "Alpha", "The Alpha Project" → Same project
|
|
- "ML Team", "Machine Learning Team", "AI Team" → Same team (maybe)
|
|
|
|
**Goal:** Identify that these different mentions refer to the same entity.
|
|
|
|
### **What Happens:**
|
|
```
|
|
INPUT: All causal relationships from all documents
|
|
|
|
Step 4.1: COLLECT ALL ENTITIES
|
|
├─> Scan: All causal relationships
|
|
├─> Extract: Every entity mentioned
|
|
└─> Result: List of entity mentions
|
|
["John", "John Smith", "J. Smith", "Sarah", "S. Chen",
|
|
"Project Alpha", "Alpha", "ML Team", ...]
|
|
|
|
Step 4.2: GROUP BY ENTITY TYPE
|
|
├─> People: ["John", "John Smith", "Sarah", ...]
|
|
├─> Projects: ["Project Alpha", "Alpha", ...]
|
|
├─> Teams: ["ML Team", "AI Team", ...]
|
|
└─> Organizations: ["Finance Dept", "Engineering", ...]
|
|
|
|
Step 4.3: AI-POWERED RESOLUTION (Claude API) ⭐
|
|
├─> Send: All entity mentions to Claude
|
|
├─> Ask Claude: "Which mentions refer to the same real-world entity?"
|
|
├─> Claude analyzes:
|
|
│ • Name similarities
|
|
│ • Context clues
|
|
│ • Role descriptions
|
|
│ • Co-occurrence patterns
|
|
└─> Claude returns: Grouped entities
|
|
|
|
Step 4.4: CREATE CANONICAL NAMES
|
|
├─> Choose: Best name for each entity
|
|
├─> Example: "John Smith" becomes canonical for ["John", "J. Smith"]
|
|
└─> Store: Mapping table
|
|
```
|
|
|
|
### **Example:**
|
|
|
|
**Input (mentions across all docs):**
|
|
```
|
|
Document 1: "John led the meeting"
|
|
Document 2: "J. Smith approved the budget"
|
|
Document 3: "John Smith will present next week"
|
|
Document 4: "Smith suggested the new approach"
|
|
|
|
Claude's Resolution:
|
|
|
|
{
|
|
"entities": {
|
|
"John Smith": {
|
|
"canonical_name": "John Smith",
|
|
"mentions": ["John", "J. Smith", "John Smith", "Smith"],
|
|
"type": "Person",
|
|
"role": "Project Lead",
|
|
"confidence": 0.95
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### **Why This Matters:**
|
|
|
|
Without entity resolution:
|
|
- ❌ System thinks "John" and "John Smith" are different people
|
|
- ❌ Can't track someone's involvement across documents
|
|
- ❌ Relationships are fragmented
|
|
|
|
With entity resolution:
|
|
- ✅ System knows they're the same person
|
|
- ✅ Can see full picture of someone's involvement
|
|
- ✅ Relationships are connected
|
|
|
|
---
|
|
|
|
## **STAGE 5: KNOWLEDGE GRAPH CONSTRUCTION** 📊
|
|
|
|
### **Theory: Why This Stage Exists**
|
|
|
|
**Problem:** We have hundreds of causal relationships. How do we organize them? How do we find connections?
|
|
|
|
**Solution:** Build a **graph** - a network of nodes (things) and edges (relationships).
|
|
|
|
### **What Is A Knowledge Graph?**
|
|
|
|
Think of it like a map:
|
|
- **Nodes** = Places (events, people, projects)
|
|
- **Edges** = Roads (relationships between them)
|
|
```
|
|
Example Graph:
|
|
|
|
(Budget Cut)
|
|
│
|
|
│ CAUSES
|
|
▼
|
|
(ML Postponed)
|
|
│
|
|
│ CAUSES
|
|
▼
|
|
(Timeline Delayed)
|
|
│
|
|
│ AFFECTS
|
|
▼
|
|
(Project Alpha)
|
|
│
|
|
│ INVOLVES
|
|
▼
|
|
(Engineering Team)
|
|
```
|
|
|
|
### **What Happens:**
|
|
```
|
|
INPUT: Causal relationships + Resolved entities
|
|
|
|
Step 5.1: CREATE EVENT NODES
|
|
For each causal relationship:
|
|
├─> Create Node: Cause event
|
|
├─> Create Node: Effect event
|
|
└─> Properties: text, date, confidence
|
|
|
|
Example:
|
|
Node1: {type: "Event", text: "Budget reduced by 30%"}
|
|
Node2: {type: "Event", text: "ML features postponed"}
|
|
|
|
Step 5.2: CREATE ENTITY NODES
|
|
For each resolved entity:
|
|
├─> Create Node: Entity
|
|
└─> Properties: name, type, role
|
|
|
|
Example:
|
|
Node3: {type: "Person", name: "John Smith", role: "Lead"}
|
|
Node4: {type: "Project", name: "Project Alpha"}
|
|
|
|
Step 5.3: CREATE DOCUMENT NODES
|
|
For each source document:
|
|
└─> Create Node: Document
|
|
Properties: filename, date, type
|
|
|
|
Example:
|
|
Node5: {type: "Document", name: "Q3_meeting.pdf"}
|
|
|
|
Step 5.4: CREATE RELATIONSHIPS (Edges)
|
|
├─> CAUSES: Event1 → Event2
|
|
├─> INVOLVED_IN: Person → Event
|
|
├─> MENTIONS: Document → Entity
|
|
├─> AFFECTS: Event → Project
|
|
└─> Properties: confidence, source, date
|
|
|
|
Example Relationships:
|
|
(Budget Cut) -[CAUSES]-> (ML Postponed)
|
|
(John Smith) -[INVOLVED_IN]-> (Budget Cut)
|
|
(Q3_meeting.pdf) -[MENTIONS]-> (John Smith)
|
|
|
|
Step 5.5: STORE IN NEO4J
|
|
├─> Connect: To Neo4j database
|
|
├─> Create: All nodes
|
|
├─> Create: All relationships
|
|
└─> Index: For fast querying
|
|
```
|
|
|
|
### **Visual Example:**
|
|
|
|
**Before (Just Text):**
|
|
```
|
|
"Budget cut → ML postponed"
|
|
"ML postponed → Timeline delayed"
|
|
"John Smith involved in budget decision"
|
|
```
|
|
|
|
**After (Knowledge Graph):**
|
|
```
|
|
(John Smith)
|
|
│
|
|
│ INVOLVED_IN
|
|
▼
|
|
(Budget Cut) ──MENTIONED_IN──> (Q3_meeting.pdf)
|
|
│
|
|
│ CAUSES
|
|
▼
|
|
(ML Postponed) ──AFFECTS──> (Project Alpha)
|
|
│
|
|
│ CAUSES
|
|
▼
|
|
(Timeline Delayed) ──INVOLVES──> (Engineering Team)
|
|
```
|
|
|
|
### **Why Use A Graph?**
|
|
|
|
| Question | Without Graph | With Graph |
|
|
|----------|---------------|------------|
|
|
| "Why was ML postponed?" | Search all docs manually | Follow CAUSES edge backwards |
|
|
| "What did budget cut affect?" | Re-read everything | Follow CAUSES edges forward |
|
|
| "What is John involved in?" | Search his name everywhere | Follow INVOLVED_IN edges |
|
|
| "How are events connected?" | Hard to see | Visual path through graph |
|
|
|
|
**Key Benefit:** The graph shows **HOW** everything connects, not just WHAT exists.
|
|
|
|
---
|
|
|
|
## **STAGE 6: GRAPH TO VECTOR DATABASE** 🔄
|
|
|
|
### **Theory: Why This Stage Exists**
|
|
|
|
**Problem:**
|
|
- Neo4j is great for finding relationships ("What caused X?")
|
|
- But it's NOT good for semantic search ("Find docs about machine learning")
|
|
|
|
**Solution:** We need BOTH:
|
|
- **Neo4j** = Find causal chains and connections
|
|
- **Qdrant** = Find relevant content by meaning
|
|
|
|
### **Why We Need Both:**
|
|
|
|
**Neo4j (Graph Database):**
|
|
```
|
|
Good for: "Show me the chain of events that led to timeline delay"
|
|
Answer: Budget Cut → ML Postponed → Timeline Delayed
|
|
```
|
|
|
|
**Qdrant (Vector Database):**
|
|
```
|
|
Good for: "Find all content related to machine learning"
|
|
Answer: [50 relevant chunks from across all documents]
|
|
```
|
|
|
|
### **What Happens:**
|
|
```
|
|
INPUT: Complete Knowledge Graph in Neo4j
|
|
|
|
Step 6.1: EXTRACT CAUSAL CHAINS
|
|
├─> Query Neo4j: "Find all causal paths"
|
|
│ Example: MATCH (a)-[:CAUSES*1..3]->(b)
|
|
├─> Get: Sequences of connected events
|
|
└─> Result: List of causal chains
|
|
|
|
Example chains:
|
|
1. Market demand ↓ → Budget cut → ML postponed
|
|
2. John left → Sarah promoted → Team restructured
|
|
3. User feedback → Design change → Timeline adjusted
|
|
|
|
Step 6.2: CONVERT TO NARRATIVE TEXT
|
|
Take each chain and write it as a story:
|
|
|
|
Before: [Node1] → [Node2] → [Node3]
|
|
|
|
After: "Due to decreased market demand, the CFO
|
|
reduced the budget by 30%. This led to the
|
|
postponement of machine learning features, which
|
|
ultimately delayed the December launch to March."
|
|
|
|
WHY? Because we need text to create embeddings!
|
|
|
|
Step 6.3: ENRICH WITH CONTEXT
|
|
Add information from the graph:
|
|
├─> Who was involved?
|
|
├─> When did it happen?
|
|
├─> Which documents mention this?
|
|
├─> What projects were affected?
|
|
└─> How confident are we?
|
|
|
|
Enriched text:
|
|
"[CAUSAL CHAIN]
|
|
Due to decreased market demand, the CFO reduced
|
|
the budget by 30%. This led to ML postponement.
|
|
|
|
[METADATA]
|
|
Date: Q3 2024
|
|
Involved: CFO, Engineering Team, Project Alpha
|
|
Sources: Q3_meeting.pdf, budget_report.xlsx
|
|
Confidence: 0.92"
|
|
|
|
Step 6.4: CREATE EMBEDDINGS
|
|
├─> Use: OpenAI Embedding API
|
|
├─> Input: Enriched text
|
|
├─> Output: Vector (1536 numbers)
|
|
│ Example: [0.123, -0.456, 0.789, ...]
|
|
└─> This vector represents the "meaning" of the text
|
|
|
|
Step 6.5: STORE IN QDRANT
|
|
For each enriched chunk:
|
|
├─> Vector: The embedding
|
|
├─> Payload: The original text + all metadata
|
|
│ {
|
|
│ "text": "enriched narrative",
|
|
│ "type": "causal_chain",
|
|
│ "entities": ["CFO", "Project Alpha"],
|
|
│ "sources": ["Q3_meeting.pdf"],
|
|
│ "confidence": 0.92,
|
|
│ "graph_path": "Node1->Node2->Node3"
|
|
│ }
|
|
└─> Store: In Qdrant collection
|
|
```
|
|
|
|
### **What Are Embeddings?**
|
|
|
|
Think of embeddings as **coordinates in meaning-space**:
|
|
```
|
|
Text: "machine learning features"
|
|
Embedding: [0.2, 0.8, 0.1, -0.3, ...] ← 1536 numbers
|
|
|
|
Text: "AI capabilities"
|
|
Embedding: [0.19, 0.82, 0.09, -0.29, ...] ← Similar numbers!
|
|
|
|
Text: "budget reporting"
|
|
Embedding: [-0.6, 0.1, 0.9, 0.4, ...] ← Very different numbers
|
|
```
|
|
|
|
Similar meanings → Similar vectors → Qdrant finds them together!
|
|
|
|
### **Example Flow:**
|
|
|
|
**From Neo4j:**
|
|
```
|
|
Chain: (Budget Cut) → (ML Postponed) → (Timeline Delayed)
|
|
```
|
|
|
|
**Convert to Text:**
|
|
```
|
|
"Budget reduced by 30% → ML features postponed →
|
|
December launch delayed to March"
|
|
```
|
|
|
|
**Enrich:**
|
|
```
|
|
"[Causal Chain] Budget reduced by 30% led to ML
|
|
features being postponed, which delayed the December
|
|
launch to March 2025.
|
|
|
|
Involved: CFO, Engineering Team, Project Alpha
|
|
Sources: Q3_meeting.pdf, roadmap.pptx
|
|
Confidence: 0.91
|
|
Date: August-September 2024"
|
|
```
|
|
|
|
**Create Embedding:**
|
|
```
|
|
[0.234, -0.567, 0.891, 0.123, ...] ← 1536 numbers
|
|
|
|
Store in Qdrant:
|
|
|
|
{
|
|
"id": "chain_001",
|
|
"vector": [0.234, -0.567, ...],
|
|
"payload": {
|
|
"text": "enriched narrative...",
|
|
"type": "causal_chain",
|
|
"entities": ["CFO", "Engineering Team"],
|
|
"sources": ["Q3_meeting.pdf"],
|
|
"confidence": 0.91
|
|
}
|
|
}
|
|
```
|
|
|
|
### **Why This Stage?**
|
|
|
|
Now we have the **best of both worlds**:
|
|
|
|
| Need | Use |
|
|
|------|-----|
|
|
| "Find content about machine learning" | Qdrant semantic search |
|
|
| "Show me the causal chain" | Neo4j graph traversal |
|
|
| "Why did timeline delay?" | Start with Qdrant, then Neo4j for details |
|
|
| "Generate comprehensive report" | Pull from BOTH |
|
|
|
|
---
|
|
|
|
## **STAGE 7: REPORT GENERATION** 📝 (FINAL STAGE)
|
|
|
|
### **Theory: Why This Stage Exists**
|
|
|
|
**Goal:** Take everything we've learned from 100+ documents and create ONE comprehensive, readable report.
|
|
|
|
### **What Happens:**
|
|
```
|
|
USER ACTION:
|
|
└─> User clicks "Generate Onboarding Report"
|
|
|
|
Step 7.1: DEFINE REPORT REQUIREMENTS
|
|
What should the report include?
|
|
├─> Project overview
|
|
├─> Key decisions and WHY they were made
|
|
├─> Important people and their roles
|
|
├─> Timeline of events
|
|
├─> Current status
|
|
└─> Next steps
|
|
|
|
Step 7.2: SEMANTIC SEARCH (Qdrant)
|
|
Query 1: "project overview goals objectives"
|
|
├─> Qdrant returns: Top 20 relevant chunks
|
|
└─> Covers: High-level project information
|
|
|
|
Query 2: "timeline milestones dates schedule"
|
|
├─> Qdrant returns: Top 15 relevant chunks
|
|
└─> Covers: Timeline information
|
|
|
|
Query 3: "decisions architecture technical"
|
|
├─> Qdrant returns: Top 15 relevant chunks
|
|
└─> Covers: Technical decisions
|
|
|
|
Total: ~50 most relevant chunks from Qdrant
|
|
|
|
Step 7.3: GRAPH TRAVERSAL (Neo4j)
|
|
Query 1: Get critical causal chains
|
|
├─> MATCH (a)-[:CAUSES*2..4]->(b)
|
|
├─> WHERE confidence > 0.8
|
|
└─> Returns: Top 20 important decision chains
|
|
|
|
Query 2: Get key entities
|
|
├─> MATCH (e:Entity)-[:INVOLVED_IN]->(events)
|
|
├─> Count events per entity
|
|
└─> Returns: Most involved people/teams/projects
|
|
|
|
Query 3: Get recent timeline
|
|
├─> MATCH (e:Event) WHERE e.date > '2024-01-01'
|
|
├─> Order by date
|
|
└─> Returns: Chronological event list
|
|
|
|
Step 7.4: AGGREGATE CONTEXT
|
|
Combine everything:
|
|
├─> 50 semantic chunks from Qdrant
|
|
├─> 20 causal chains from Neo4j
|
|
├─> Key entities and their profiles
|
|
├─> Timeline of events
|
|
└─> Metadata about sources
|
|
|
|
Total Context Size: ~30,000-50,000 tokens
|
|
|
|
Step 7.5: PREPARE PROMPT FOR CLAUDE
|
|
Structure the prompt:
|
|
┌─────────────────────────────────────┐
|
|
│ SYSTEM: You are an expert technical │
|
|
│ writer creating an onboarding report│
|
|
│ │
|
|
│ USER: Based on these 100+ documents,│
|
|
│ create a comprehensive report. │
|
|
│ │
|
|
│ # SEMANTIC CONTEXT: │
|
|
│ [50 chunks from Qdrant] │
|
|
│ │
|
|
│ # CAUSAL CHAINS: │
|
|
│ [20 decision chains from Neo4j] │
|
|
│ │
|
|
│ # KEY ENTITIES: │
|
|
│ [People, teams, projects] │
|
|
│ │
|
|
│ # TIMELINE: │
|
|
│ [Chronological events] │
|
|
│ │
|
|
│ Generate report with sections: │
|
|
│ 1. Executive Summary │
|
|
│ 2. Project Overview │
|
|
│ 3. Key Decisions (with WHY) │
|
|
│ 4. Timeline │
|
|
│ 5. Current Status │
|
|
│ 6. Next Steps │
|
|
└─────────────────────────────────────┘
|
|
|
|
Step 7.6: CALL CLAUDE API ⭐
|
|
├─> Send: Complete prompt to Claude
|
|
├─> Claude processes:
|
|
│ • Reads all context
|
|
│ • Identifies key themes
|
|
│ • Synthesizes information
|
|
│ • Creates narrative structure
|
|
│ • Explains causal relationships
|
|
│ • Writes clear, coherent report
|
|
└─> Returns: Markdown-formatted report
|
|
|
|
Step 7.7: POST-PROCESS REPORT
|
|
├─> Add: Table of contents
|
|
├─> Add: Citations to source documents
|
|
├─> Add: Confidence indicators
|
|
├─> Format: Headings, bullet points, emphasis
|
|
└─> Result: Final Markdown report
|
|
|
|
Step 7.8: CONVERT TO PDF
|
|
├─> Use: Markdown-to-PDF library
|
|
├─> Add: Styling and formatting
|
|
├─> Add: Page numbers, headers
|
|
└─> Result: Professional PDF report
|
|
|
|
Step 7.9: DELIVER TO USER
|
|
├─> Save: PDF to storage
|
|
├─> Generate: Download link
|
|
└─> Show: Success message with download button
|
|
|
|
|
|
## **🔄 COMPLETE DATA FLOW SUMMARY**
|
|
```
|
|
Documents (100+)
|
|
↓
|
|
[Extract Text] → Plain Text
|
|
↓
|
|
[Claude: Causal Extraction] → Relationships List
|
|
↓
|
|
[Claude: Entity Resolution] → Resolved Entities
|
|
↓
|
|
[Build Graph] → Neo4j Knowledge Graph
|
|
↓
|
|
[Convert + Enrich] → Narrative Chunks
|
|
↓
|
|
[Create Embeddings] → Vectors
|
|
↓
|
|
[Store] → Qdrant Vector DB
|
|
↓
|
|
[User Request] → "Generate Report"
|
|
↓
|
|
[Query Qdrant] → Relevant Chunks
|
|
+
|
|
[Query Neo4j] → Causal Chains
|
|
↓
|
|
[Claude: Synthesis] → Final Report
|
|
↓
|
|
[Convert] → PDF
|
|
↓
|
|
[Deliver] → User Downloads Report
|
|
``` |