219 lines
8.5 KiB
Markdown
219 lines
8.5 KiB
Markdown
# Fixes Summary for AI Analysis Service
|
|
|
|
## ⚠️ CRITICAL: Service Restart Required
|
|
|
|
**After code changes, you MUST restart the service for fixes to take effect:**
|
|
|
|
```bash
|
|
cd /home/tech4biz/Desktop/prakash/codenuk/backend/codenuk_backend_mine
|
|
docker-compose restart ai-analysis-service
|
|
```
|
|
|
|
**Why:** Docker containers cache the Python code. Changes to `.py` files are NOT reflected until the service is restarted. If you're still seeing errors, it means the OLD code is running.
|
|
|
|
**Verify Restart:** Check logs for the restart timestamp:
|
|
```bash
|
|
docker-compose logs --tail=20 ai-analysis-service | grep "AI Analysis Service initialized"
|
|
```
|
|
|
|
---
|
|
|
|
## Issues Fixed
|
|
|
|
### 1. TypeError: sequence item 21: expected str instance, dict found
|
|
|
|
**Problem:** In `store_chunk_analysis_in_memory()`, the `ai_response_parts` list contained dicts at certain indices (like `module_architecture` and `module_security_assessment`), which caused `"\n".join()` to fail.
|
|
|
|
**Root Cause:** The code was trying to add dict values directly to the list without converting them to strings first.
|
|
|
|
**Fix Applied:**
|
|
- Convert all dict values to JSON strings before adding to `ai_response_parts`
|
|
- Added explicit checks to convert `module_overview`, `module_architecture`, and `module_security_assessment` to strings before adding them to the list
|
|
- Added a final safety check that converts all items (dicts, lists, tuples) to strings before joining
|
|
|
|
**Location:** `server.py` lines 2419-2538
|
|
|
|
---
|
|
|
|
### 2. File Content Being Stored in Database
|
|
|
|
**Problem:** File content was being stored in MongoDB/Redis even though it shouldn't be.
|
|
|
|
**Root Cause:**
|
|
- `FileAnalysis` objects have a `content` field for in-memory analysis
|
|
- When storing in MongoDB/Redis, the content was being included in the stored dicts
|
|
|
|
**Fixes Applied:**
|
|
1. **In `store_chunk_analysis_in_memory()`:**
|
|
- Explicitly exclude `content` when creating `file_analyses_data` (line 2547-2565)
|
|
- Added safety check to delete `content` if it somehow gets included (line 2563-2564)
|
|
|
|
2. **In `analyze_single_file_parallel()`:**
|
|
- Removed `content` from cache storage (line 4354-4371)
|
|
- Set `content=""` when creating FileAnalysis from cache (line 4316)
|
|
- Added explicit deletion check before caching (line 4367-4369)
|
|
|
|
3. **General:**
|
|
- All storage operations now explicitly exclude content
|
|
- Added comments explaining that content should never be stored
|
|
|
|
**Why Content Was Being Stored:**
|
|
- `FileAnalysis` objects in memory have `content` for analysis purposes
|
|
- When converting to dicts for storage, the code wasn't explicitly excluding `content`
|
|
- The fix ensures `content` is never included in any database storage operations
|
|
|
|
**Note:** `FileAnalysis` objects in memory may still have `content` for analysis, but it's never stored in any database (MongoDB, Redis, PostgreSQL).
|
|
|
|
---
|
|
|
|
### 3. No Synthesis Analysis Found (0 modules)
|
|
|
|
**Problem:** When generating reports, the system couldn't find synthesis analysis or modules stored in MongoDB.
|
|
|
|
**Root Cause:**
|
|
1. **Synthesis Analysis:** The `run_id` wasn't being stored correctly in the metadata
|
|
2. **Module Storage:** Modules were being stored with `run_id` but retrieval might have been using a different `run_id`
|
|
|
|
**Fixes Applied:**
|
|
1. **In `store_synthesis_analysis_in_memory()`:**
|
|
- Added explicit `run_id` retrieval from analyzer (line 3995-3998)
|
|
- Store `run_id` in metadata for proper retrieval (line 4002)
|
|
- This ensures synthesis can be found using `metadata.run_id` query
|
|
|
|
2. **Module Storage:**
|
|
- Modules are already stored with `run_id` in metadata (line 2591)
|
|
- Retrieval uses `run_id` and `repository_id` (line 3431-3435)
|
|
- The issue was likely that `run_id` wasn't consistent between storage and retrieval
|
|
|
|
**How to Debug:**
|
|
- Check logs for `run_id` values during storage and retrieval
|
|
- Verify that the same `run_id` is used for both storage and retrieval
|
|
- The `run_id` is set at the start of analysis (line 4416-4427) and should be consistent throughout
|
|
|
|
---
|
|
|
|
### 4. "0 patterns found" Message
|
|
|
|
**Log Message:**
|
|
```
|
|
📊 State updated: 3 modules analyzed, 0 patterns found
|
|
```
|
|
|
|
**Status:** ✅ **NOT AN ERROR** - This is normal expected behavior
|
|
|
|
**What it means:**
|
|
The system looks for specific architectural pattern keywords in the AI's module architecture analysis, such as:
|
|
- "microservices"
|
|
- "layered architecture"
|
|
- "event-driven"
|
|
- "monolithic"
|
|
- "MVC"
|
|
- "REST API"
|
|
- "serverless"
|
|
- "hexagonal"
|
|
- etc.
|
|
|
|
**Why 0 patterns:**
|
|
The counter shows 0 when the AI's analysis doesn't mention any of these specific pattern keywords. This can happen when:
|
|
1. The AI uses different terminology (e.g., "service-based" instead of "microservices")
|
|
2. The code being analyzed doesn't have clear architectural patterns
|
|
3. Early in analysis before enough context is gathered
|
|
4. The patterns exist but aren't explicitly named in the AI response
|
|
|
|
**Code Location:** `server.py` lines 1708-1712:
|
|
```python
|
|
for pattern in pattern_keywords:
|
|
if pattern.lower() in module_architecture.lower():
|
|
if pattern not in analysis_state['architecture_patterns']:
|
|
analysis_state['architecture_patterns'].append(pattern)
|
|
```
|
|
|
|
**What to do:** Nothing! This is informational only. The analysis quality is not affected. Patterns may be detected as more modules are analyzed.
|
|
|
|
---
|
|
|
|
### 5. Performance: 35 Files Taking ~20 Minutes
|
|
|
|
**Problem:** 35 files taking approximately 20 minutes means ~34 seconds per file, which is too slow.
|
|
|
|
**Root Causes:**
|
|
1. **Sequential Processing:** Files are processed in chunks sequentially, not in parallel
|
|
2. **Delays:** There's a 0.1 second delay between chunks (line 4589)
|
|
3. **Rate Limiting:** API rate limiting might be too conservative
|
|
4. **No True Parallelization:** Even though named "parallel", files within chunks are analyzed sequentially
|
|
|
|
**Current Behavior:**
|
|
- Files are grouped into intelligent chunks
|
|
- Each chunk is processed sequentially (one after another)
|
|
- Within each chunk, files are analyzed in a single API call (batch)
|
|
- There's a 0.1 second delay between chunks
|
|
|
|
**Potential Optimizations:**
|
|
1. **Reduce Delays:** The 0.1 second delay could be reduced or removed if rate limiting is handled properly
|
|
2. **Parallel Chunk Processing:** Process multiple chunks in parallel (if API rate limits allow)
|
|
3. **Increase Batch Size:** Currently chunks are created semantically, but larger batches could reduce API calls
|
|
4. **Optimize Rate Limiting:** The current rate limiter might be too conservative
|
|
|
|
**Recommendations:**
|
|
- The current rate limit is set to 1000 requests/minute (line 5080)
|
|
- With 35 files in ~5-10 chunks, this should allow faster processing
|
|
- Consider reducing the delay between chunks from 0.1s to 0.05s or removing it entirely
|
|
- Monitor API rate limit errors - if none occur, the delays can be reduced
|
|
|
|
**Note:** The analysis itself (Claude API calls) takes time, so some delay is expected. However, 34 seconds per file suggests too much sequential processing.
|
|
|
|
---
|
|
|
|
## Summary of All Changes
|
|
|
|
1. ✅ Fixed TypeError in `store_chunk_analysis_in_memory()` by converting dicts to strings
|
|
2. ✅ Removed file content from all database storage operations (MongoDB, Redis, PostgreSQL)
|
|
3. ✅ Fixed synthesis analysis storage to include `run_id` for proper retrieval
|
|
4. ✅ Added explicit content exclusion checks throughout storage code
|
|
5. ✅ Improved error handling and logging
|
|
|
|
---
|
|
|
|
## Testing Recommendations
|
|
|
|
1. **Test TypeError Fix:**
|
|
- Run analysis and verify no `TypeError: sequence item X: expected str instance, dict found` errors
|
|
- Check logs for successful chunk storage
|
|
|
|
2. **Test Content Storage:**
|
|
- Verify MongoDB collections don't contain `content` fields
|
|
- Check Redis cache doesn't store file content
|
|
- Verify PostgreSQL doesn't store content in any tables
|
|
|
|
3. **Test Module/Synthesis Retrieval:**
|
|
- Run an analysis
|
|
- Generate a report and verify modules and synthesis are found
|
|
- Check logs for `run_id` consistency
|
|
|
|
4. **Test Performance:**
|
|
- Monitor analysis time for 35 files
|
|
- Should ideally take < 10 minutes with optimizations
|
|
- Check API rate limit errors - if none, reduce delays
|
|
|
|
---
|
|
|
|
## Next Steps for Performance Optimization
|
|
|
|
1. **Reduce Delays:**
|
|
```python
|
|
# Change line 4589 from:
|
|
await asyncio.sleep(0.1)
|
|
# To:
|
|
await asyncio.sleep(0.05) # or remove entirely
|
|
```
|
|
|
|
2. **Consider Parallel Chunk Processing:**
|
|
- Process 2-3 chunks simultaneously if API rate limits allow
|
|
- Use `asyncio.gather()` to run multiple chunk analyses in parallel
|
|
|
|
3. **Monitor and Adjust:**
|
|
- Track actual API call rate
|
|
- Adjust rate limiter if needed
|
|
- Reduce delays if no rate limit errors occur
|
|
|