176 lines
16 KiB
Markdown
176 lines
16 KiB
Markdown
# File Chunking Process Diagram
|
|
|
|
## Overview: How Files Are Processed in the AI Analysis Service
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ LARGE FILE INPUT │
|
|
│ (e.g., 5000-line Python file) │
|
|
└─────────────────────┬───────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ LANGUAGE DETECTION │
|
|
│ • Detect file extension (.py, .js, .ts, .java) │
|
|
│ • Load language-specific patterns for intelligent chunking │
|
|
└─────────────────────┬───────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ INTELLIGENT CHUNKING │
|
|
│ │
|
|
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
|
│ │ CHUNK 1: │ │ CHUNK 2: │ │ CHUNK 3: │ │
|
|
│ │ IMPORTS │ │ CLASSES │ │ FUNCTIONS │ │
|
|
│ │ • import os │ │ • class User │ │ • def auth() │ │
|
|
│ │ • from db │ │ • class Admin │ │ • def save() │ │
|
|
│ │ • typing │ │ • methods │ │ • def load() │ │
|
|
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
|
│ │
|
|
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
|
│ │ CHUNK 4: │ │ CHUNK 5: │ │ CHUNK 6: │ │
|
|
│ │ UTILITIES │ │ MAIN LOGIC │ │ TESTS │ │
|
|
│ │ • helpers │ │ • main() │ │ • test_* │ │
|
|
│ │ • validators │ │ • run() │ │ • fixtures │ │
|
|
│ │ • formatters │ │ • execute() │ │ • mocks │ │
|
|
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
|
└─────────────────────┬───────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ CHUNK ANALYSIS WITH CLAUDE AI │
|
|
│ │
|
|
│ For each chunk: │
|
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ CHUNK 1 → CLAUDE AI │ │
|
|
│ │ Prompt: "Analyze this import section for..." │ │
|
|
│ │ Response: Issues found, recommendations, quality score │ │
|
|
│ └─────────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ CHUNK 2 → CLAUDE AI │ │
|
|
│ │ Prompt: "Analyze this class definition for..." │ │
|
|
│ │ Response: Issues found, recommendations, quality score │ │
|
|
│ └─────────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ CHUNK 3 → CLAUDE AI │ │
|
|
│ │ Prompt: "Analyze these functions for..." │ │
|
|
│ │ Response: Issues found, recommendations, quality score │ │
|
|
│ └─────────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ... (and so on for each chunk) │
|
|
└─────────────────────┬───────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ RESULT COMBINATION │
|
|
│ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ COMBINED ANALYSIS RESULT │ │
|
|
│ │ • All issues from all chunks │ │
|
|
│ │ • Overall quality score (average of chunk scores) │ │
|
|
│ │ • Comprehensive recommendations │ │
|
|
│ │ • Chunking statistics (savings, efficiency) │ │
|
|
│ └─────────────────────────────────────────────────────────────────────┘ │
|
|
└─────────────────────┬───────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ FINAL REPORT │
|
|
│ • File path and language │
|
|
│ • Total lines of code │
|
|
│ • Quality score (1-10) │
|
|
│ • Issues found (with line numbers) │
|
|
│ • Recommendations for improvement │
|
|
│ • Chunking efficiency metrics │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Key Benefits of This Approach
|
|
|
|
### 1. **Token Efficiency**
|
|
```
|
|
Original File: 50,000 tokens
|
|
Chunked Files: 15,000 tokens (70% savings)
|
|
```
|
|
|
|
### 2. **Focused Analysis**
|
|
- Each chunk gets specialized attention
|
|
- Context-aware prompts for different code types
|
|
- Better quality analysis per section
|
|
|
|
### 3. **Cost Optimization**
|
|
- Smaller API calls = lower costs
|
|
- Parallel processing possible
|
|
- Caching of individual chunks
|
|
|
|
### 4. **Scalability**
|
|
- Can handle files of any size
|
|
- Memory efficient
|
|
- Rate limit friendly
|
|
|
|
## Chunking Strategy by File Type
|
|
|
|
### Python Files
|
|
```
|
|
┌─────────────┬──────────────┬─────────────────────────────────────────────┐
|
|
│ Chunk Type │ Pattern │ Example Content │
|
|
├─────────────┼──────────────┼─────────────────────────────────────────────┤
|
|
│ Imports │ ^import|^from│ import os, json, requests │
|
|
│ Classes │ ^class │ class User: def __init__(self): │
|
|
│ Functions │ ^def │ def authenticate_user(): │
|
|
│ Main Logic │ Other │ if __name__ == "__main__": │
|
|
└─────────────┴──────────────┴─────────────────────────────────────────────┘
|
|
```
|
|
|
|
### JavaScript/TypeScript Files
|
|
```
|
|
┌─────────────┬──────────────┬─────────────────────────────────────────────┐
|
|
│ Chunk Type │ Pattern │ Example Content │
|
|
├─────────────┼──────────────┼─────────────────────────────────────────────┤
|
|
│ Imports │ ^import|^const|import React from 'react' │
|
|
│ Classes │ ^class │ class Component extends React.Component │
|
|
│ Functions │ ^function|^const|function myFunction() { │
|
|
│ Exports │ ^export │ export default MyComponent │
|
|
└─────────────┴──────────────┴─────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Memory and Context Integration
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ CONTEXT AWARENESS │
|
|
│ │
|
|
│ Each chunk analysis includes: │
|
|
│ • Similar code patterns from repository │
|
|
│ • Best practices for that code type │
|
|
│ • Previous analysis results │
|
|
│ • Repository-specific patterns │
|
|
│ │
|
|
│ Example: │
|
|
│ "This function chunk is similar to 3 other functions in your repo │
|
|
│ that had security issues. Consider implementing the same fix here." │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Error Handling and Fallbacks
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ ROBUST PROCESSING │
|
|
│ │
|
|
│ If chunking fails: │
|
|
│ • Fall back to original file analysis │
|
|
│ • Use content optimization instead │
|
|
│ • Continue with other files │
|
|
│ │
|
|
│ If Claude API fails: │
|
|
│ • Retry with exponential backoff │
|
|
│ • Use cached results if available │
|
|
│ • Provide fallback analysis │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
This chunking system makes the AI analysis service much more powerful and efficient, allowing it to handle large codebases that would otherwise be too big for AI analysis.
|