CP_Assessment_engine/WORKFLOW_GUIDE.md
2026-02-10 12:59:40 +05:30

305 lines
8.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Complete Workflow Guide - Simulated Assessment Engine
## Overview
This guide explains the complete 3-step workflow for generating simulated assessment data:
1. **Persona Preparation**: Merge persona factory output with enrichment data
2. **Simulation**: Generate assessment responses for all students
3. **Post-Processing**: Color headers, replace omitted values, verify quality
---
## Quick Start
### Automated Workflow (Recommended)
Run all 3 steps automatically:
```bash
# Full production run (3,000 students)
python run_complete_pipeline.py --all
# Dry run (5 students for testing)
python run_complete_pipeline.py --all --dry-run
```
### Manual Workflow
Run each step individually:
```bash
# Step 1: Prepare personas
python scripts/prepare_data.py
# Step 2: Run simulation
python main.py --full
# Step 3: Post-process
python scripts/comprehensive_post_processor.py
```
---
## Step-by-Step Details
### Step 1: Persona Preparation
**Purpose**: Create `merged_personas.xlsx` by combining:
- Persona factory output (from `FW_Pseudo_Data_Documents/cogniprism_persona_factory_0402.py`)
- 22 enrichment columns from `fixed_3k_personas.xlsx` (goals, interests, strengths, etc.)
- Student data from `3000-students.xlsx` and `3000_students_output.xlsx`
**Prerequisites** (all files within project):
- `support/fixed_3k_personas.xlsx` (enrichment data with 22 columns)
- `support/3000-students.xlsx` (student demographics)
- `support/3000_students_output.xlsx` (StudentCPIDs from database)
**Output**: `data/merged_personas.xlsx` (3,000 students, 79 columns)
**Run**:
```bash
python scripts/prepare_data.py
```
**What it does**:
1. Loads student data and CPIDs from `support/` directory
2. Merges on Roll Number
3. Adds 22 enrichment columns from `support/fixed_3k_personas.xlsx`:
- `short_term_focus_1/2/3`
- `long_term_focus_1/2/3`
- `strength_1/2/3`
- `improvement_area_1/2/3`
- `hobby_1/2/3`
- `clubs`, `achievements`
- `expectation_1/2/3`
- `segment`, `archetype`
- `behavioral_fingerprint`
4. Validates and saves merged file
---
### Step 2: Simulation
**Purpose**: Generate assessment responses for all students across:
- 5 Survey Domains: Personality, Grit, Emotional Intelligence, Vocational Interest, Learning Strategies
- 12 Cognition Tests: Memory, Reaction Time, Reasoning, Attention tasks
**Prerequisites**:
- `data/merged_personas.xlsx` (from Step 1)
- `data/AllQuestions.xlsx` (question mapping)
- Anthropic API key in `.env` file
**Output**: 34 Excel files in `output/full_run/`
- 10 domain files (5 domains × 2 age groups)
- 24 cognition files (12 tests × 2 age groups)
**Run**:
```bash
# Full production (3,000 students, ~12-15 hours)
python main.py --full
# Dry run (5 students, ~5 minutes)
python main.py --dry
```
**Features**:
- ✅ Multithreaded processing (5 workers)
- ✅ Incremental saving (safe to interrupt)
- ✅ Resume capability (skips completed students)
- ✅ Fail-safe mechanisms (retry logic, sub-chunking)
**Progress Tracking**:
- Progress saved after each student
- Can resume from interruption
- Check `logs` file for detailed progress
---
### Step 3: Post-Processing
**Purpose**: Finalize output files with:
1. Header coloring (visual identification)
2. Omitted value replacement
3. Quality verification
**Prerequisites**:
- Output files from Step 2
- `data/AllQuestions.xlsx` (for mapping)
**Run**:
```bash
# Full post-processing (all 3 sub-steps)
python scripts/comprehensive_post_processor.py
# Skip specific steps
python scripts/comprehensive_post_processor.py --skip-colors
python scripts/comprehensive_post_processor.py --skip-replacement
python scripts/comprehensive_post_processor.py --skip-quality
```
**What it does**:
#### 3.1 Header Coloring
- 🟢 **Green headers**: Omission items (347 questions)
- 🚩 **Red headers**: Reverse-scoring items (264 questions)
- Priority: Red takes precedence over green
#### 3.2 Omitted Value Replacement
- Replaces all values in omitted question columns with `"--"`
- Preserves header colors
- Processes all 10 domain files
#### 3.3 Quality Verification
- Data density check (>95% target)
- Response variance check (>0.5 target)
- Schema validation
- Generates `quality_report.json`
**Output**:
- Processed files with colored headers and replaced omitted values
- Quality report: `output/full_run/quality_report.json`
---
## Pipeline Orchestrator
The `run_complete_pipeline.py` script orchestrates all 3 steps:
### Usage Examples
```bash
# Run all steps
python run_complete_pipeline.py --all
# Run specific step only
python run_complete_pipeline.py --step1
python run_complete_pipeline.py --step2
python run_complete_pipeline.py --step3
# Skip specific steps
python run_complete_pipeline.py --all --skip-prep
python run_complete_pipeline.py --all --skip-sim
python run_complete_pipeline.py --all --skip-post
# Dry run (5 students only)
python run_complete_pipeline.py --all --dry-run
```
### Options
| Option | Description |
|--------|-------------|
| `--step1` | Run only persona preparation |
| `--step2` | Run only simulation |
| `--step3` | Run only post-processing |
| `--all` | Run all steps (default if no step specified) |
| `--skip-prep` | Skip persona preparation |
| `--skip-sim` | Skip simulation |
| `--skip-post` | Skip post-processing |
| `--dry-run` | Run simulation with 5 students only |
---
## File Structure
```
Simulated_Assessment_Engine/
├── run_complete_pipeline.py # Master orchestrator
├── main.py # Simulation engine
├── scripts/
│ ├── prepare_data.py # Step 1: Persona preparation
│ ├── comprehensive_post_processor.py # Step 3: Post-processing
│ └── ...
├── data/
│ ├── merged_personas.xlsx # Output from Step 1
│ └── AllQuestions.xlsx # Question mapping
└── output/
└── full_run/
├── adolescense/
│ ├── 5_domain/ # 5 domain files
│ └── cognition/ # 12 cognition files
├── adults/
│ ├── 5_domain/ # 5 domain files
│ └── cognition/ # 12 cognition files
└── quality_report.json # Quality report from Step 3
```
---
## Troubleshooting
### Step 1 Issues
**Problem**: `fixed_3k_personas.xlsx` not found
- **Solution**: Ensure file exists in `FW_Pseudo_Data_Documents/` directory
- **Note**: This file contains 22 enrichment columns needed for persona enrichment
**Problem**: Student data files not found
- **Solution**: Check `3000-students.xlsx` and `3000_students_output.xlsx` in base directory or `support/` folder
### Step 2 Issues
**Problem**: API credit exhaustion
- **Solution**: Script will stop gracefully. Add credits and resume (it will skip completed students)
**Problem**: Simulation interrupted
- **Solution**: Simply re-run `python main.py --full`. It will resume from last saved point
### Step 3 Issues
**Problem**: Header colors not applied
- **Solution**: Re-run post-processing: `python scripts/comprehensive_post_processor.py`
**Problem**: Quality check fails
- **Solution**: Review `quality_report.json` for specific issues. Most warnings are acceptable (e.g., Grit variance < 0.5)
---
## Best Practices
1. **Always run Step 1 first** to ensure `merged_personas.xlsx` is up-to-date
2. **Use dry-run for testing** before full production run
3. **Monitor API credits** during Step 2 (long-running process)
4. **Review quality report** after Step 3 to verify data quality
5. **Keep backups** of `merged_personas.xlsx` before regeneration
---
## Time Estimates
| Step | Duration | Notes |
|------|----------|-------|
| Step 1 | ~2 minutes | Persona preparation |
| Step 2 | 12-15 hours | Full 3,000 students (can be interrupted/resumed) |
| Step 3 | ~5 minutes | Post-processing |
**Total**: ~12-15 hours for complete pipeline
---
## Output Verification
After completing all steps, verify:
1. `data/merged_personas.xlsx` exists (3,000 rows, 79 columns)
2. `output/full_run/` contains 34 files (10 domain + 24 cognition)
3. Domain files have colored headers (green/red)
4. Omitted values are replaced with `"--"`
5. Quality report shows >95% data density
---
## Support
For issues or questions:
1. Check `logs` file for detailed execution logs
2. Review `quality_report.json` for quality metrics
3. Check prerequisites for each step
4. Verify file paths and permissions
---
**Last Updated**: Final Production Version
**Status**: ✅ Production Ready