# Complete Workflow Guide - Simulated Assessment Engine ## Overview This guide explains the complete 3-step workflow for generating simulated assessment data: 1. **Persona Preparation**: Merge persona factory output with enrichment data 2. **Simulation**: Generate assessment responses for all students 3. **Post-Processing**: Color headers, replace omitted values, verify quality --- ## Quick Start ### Automated Workflow (Recommended) Run all 3 steps automatically: ```bash # Full production run (3,000 students) python run_complete_pipeline.py --all # Dry run (5 students for testing) python run_complete_pipeline.py --all --dry-run ``` ### Manual Workflow Run each step individually: ```bash # Step 1: Prepare personas python scripts/prepare_data.py # Step 2: Run simulation python main.py --full # Step 3: Post-process python scripts/comprehensive_post_processor.py ``` --- ## Step-by-Step Details ### Step 1: Persona Preparation **Purpose**: Create `merged_personas.xlsx` by combining: - Persona factory output (from `FW_Pseudo_Data_Documents/cogniprism_persona_factory_0402.py`) - 22 enrichment columns from `fixed_3k_personas.xlsx` (goals, interests, strengths, etc.) - Student data from `3000-students.xlsx` and `3000_students_output.xlsx` **Prerequisites** (all files within project): - `support/fixed_3k_personas.xlsx` (enrichment data with 22 columns) - `support/3000-students.xlsx` (student demographics) - `support/3000_students_output.xlsx` (StudentCPIDs from database) **Output**: `data/merged_personas.xlsx` (3,000 students, 79 columns) **Run**: ```bash python scripts/prepare_data.py ``` **What it does**: 1. Loads student data and CPIDs from `support/` directory 2. Merges on Roll Number 3. Adds 22 enrichment columns from `support/fixed_3k_personas.xlsx`: - `short_term_focus_1/2/3` - `long_term_focus_1/2/3` - `strength_1/2/3` - `improvement_area_1/2/3` - `hobby_1/2/3` - `clubs`, `achievements` - `expectation_1/2/3` - `segment`, `archetype` - `behavioral_fingerprint` 4. Validates and saves merged file --- ### Step 2: Simulation **Purpose**: Generate assessment responses for all students across: - 5 Survey Domains: Personality, Grit, Emotional Intelligence, Vocational Interest, Learning Strategies - 12 Cognition Tests: Memory, Reaction Time, Reasoning, Attention tasks **Prerequisites**: - `data/merged_personas.xlsx` (from Step 1) - `data/AllQuestions.xlsx` (question mapping) - Anthropic API key in `.env` file **Output**: 34 Excel files in `output/full_run/` - 10 domain files (5 domains × 2 age groups) - 24 cognition files (12 tests × 2 age groups) **Run**: ```bash # Full production (3,000 students, ~12-15 hours) python main.py --full # Dry run (5 students, ~5 minutes) python main.py --dry ``` **Features**: - ✅ Multithreaded processing (5 workers) - ✅ Incremental saving (safe to interrupt) - ✅ Resume capability (skips completed students) - ✅ Fail-safe mechanisms (retry logic, sub-chunking) **Progress Tracking**: - Progress saved after each student - Can resume from interruption - Check `logs` file for detailed progress --- ### Step 3: Post-Processing **Purpose**: Finalize output files with: 1. Header coloring (visual identification) 2. Omitted value replacement 3. Quality verification **Prerequisites**: - Output files from Step 2 - `data/AllQuestions.xlsx` (for mapping) **Run**: ```bash # Full post-processing (all 3 sub-steps) python scripts/comprehensive_post_processor.py # Skip specific steps python scripts/comprehensive_post_processor.py --skip-colors python scripts/comprehensive_post_processor.py --skip-replacement python scripts/comprehensive_post_processor.py --skip-quality ``` **What it does**: #### 3.1 Header Coloring - 🟢 **Green headers**: Omission items (347 questions) - 🚩 **Red headers**: Reverse-scoring items (264 questions) - Priority: Red takes precedence over green #### 3.2 Omitted Value Replacement - Replaces all values in omitted question columns with `"--"` - Preserves header colors - Processes all 10 domain files #### 3.3 Quality Verification - Data density check (>95% target) - Response variance check (>0.5 target) - Schema validation - Generates `quality_report.json` **Output**: - Processed files with colored headers and replaced omitted values - Quality report: `output/full_run/quality_report.json` --- ## Pipeline Orchestrator The `run_complete_pipeline.py` script orchestrates all 3 steps: ### Usage Examples ```bash # Run all steps python run_complete_pipeline.py --all # Run specific step only python run_complete_pipeline.py --step1 python run_complete_pipeline.py --step2 python run_complete_pipeline.py --step3 # Skip specific steps python run_complete_pipeline.py --all --skip-prep python run_complete_pipeline.py --all --skip-sim python run_complete_pipeline.py --all --skip-post # Dry run (5 students only) python run_complete_pipeline.py --all --dry-run ``` ### Options | Option | Description | |--------|-------------| | `--step1` | Run only persona preparation | | `--step2` | Run only simulation | | `--step3` | Run only post-processing | | `--all` | Run all steps (default if no step specified) | | `--skip-prep` | Skip persona preparation | | `--skip-sim` | Skip simulation | | `--skip-post` | Skip post-processing | | `--dry-run` | Run simulation with 5 students only | --- ## File Structure ``` Simulated_Assessment_Engine/ ├── run_complete_pipeline.py # Master orchestrator ├── main.py # Simulation engine ├── scripts/ │ ├── prepare_data.py # Step 1: Persona preparation │ ├── comprehensive_post_processor.py # Step 3: Post-processing │ └── ... ├── data/ │ ├── merged_personas.xlsx # Output from Step 1 │ └── AllQuestions.xlsx # Question mapping └── output/ └── full_run/ ├── adolescense/ │ ├── 5_domain/ # 5 domain files │ └── cognition/ # 12 cognition files ├── adults/ │ ├── 5_domain/ # 5 domain files │ └── cognition/ # 12 cognition files └── quality_report.json # Quality report from Step 3 ``` --- ## Troubleshooting ### Step 1 Issues **Problem**: `fixed_3k_personas.xlsx` not found - **Solution**: Ensure file exists in `FW_Pseudo_Data_Documents/` directory - **Note**: This file contains 22 enrichment columns needed for persona enrichment **Problem**: Student data files not found - **Solution**: Check `3000-students.xlsx` and `3000_students_output.xlsx` in base directory or `support/` folder ### Step 2 Issues **Problem**: API credit exhaustion - **Solution**: Script will stop gracefully. Add credits and resume (it will skip completed students) **Problem**: Simulation interrupted - **Solution**: Simply re-run `python main.py --full`. It will resume from last saved point ### Step 3 Issues **Problem**: Header colors not applied - **Solution**: Re-run post-processing: `python scripts/comprehensive_post_processor.py` **Problem**: Quality check fails - **Solution**: Review `quality_report.json` for specific issues. Most warnings are acceptable (e.g., Grit variance < 0.5) --- ## Best Practices 1. **Always run Step 1 first** to ensure `merged_personas.xlsx` is up-to-date 2. **Use dry-run for testing** before full production run 3. **Monitor API credits** during Step 2 (long-running process) 4. **Review quality report** after Step 3 to verify data quality 5. **Keep backups** of `merged_personas.xlsx` before regeneration --- ## Time Estimates | Step | Duration | Notes | |------|----------|-------| | Step 1 | ~2 minutes | Persona preparation | | Step 2 | 12-15 hours | Full 3,000 students (can be interrupted/resumed) | | Step 3 | ~5 minutes | Post-processing | **Total**: ~12-15 hours for complete pipeline --- ## Output Verification After completing all steps, verify: 1. ✅ `data/merged_personas.xlsx` exists (3,000 rows, 79 columns) 2. ✅ `output/full_run/` contains 34 files (10 domain + 24 cognition) 3. ✅ Domain files have colored headers (green/red) 4. ✅ Omitted values are replaced with `"--"` 5. ✅ Quality report shows >95% data density --- ## Support For issues or questions: 1. Check `logs` file for detailed execution logs 2. Review `quality_report.json` for quality metrics 3. Check prerequisites for each step 4. Verify file paths and permissions --- **Last Updated**: Final Production Version **Status**: ✅ Production Ready