8.4 KiB
Complete Workflow Guide - Simulated Assessment Engine
Overview
This guide explains the complete 3-step workflow for generating simulated assessment data:
- Persona Preparation: Merge persona factory output with enrichment data
- Simulation: Generate assessment responses for all students
- Post-Processing: Color headers, replace omitted values, verify quality
Quick Start
Automated Workflow (Recommended)
Run all 3 steps automatically:
# Full production run (3,000 students)
python run_complete_pipeline.py --all
# Dry run (5 students for testing)
python run_complete_pipeline.py --all --dry-run
Manual Workflow
Run each step individually:
# Step 1: Prepare personas
python scripts/prepare_data.py
# Step 2: Run simulation
python main.py --full
# Step 3: Post-process
python scripts/comprehensive_post_processor.py
Step-by-Step Details
Step 1: Persona Preparation
Purpose: Create merged_personas.xlsx by combining:
- Persona factory output (from
FW_Pseudo_Data_Documents/cogniprism_persona_factory_0402.py) - 22 enrichment columns from
fixed_3k_personas.xlsx(goals, interests, strengths, etc.) - Student data from
3000-students.xlsxand3000_students_output.xlsx
Prerequisites (all files within project):
support/fixed_3k_personas.xlsx(enrichment data with 22 columns)support/3000-students.xlsx(student demographics)support/3000_students_output.xlsx(StudentCPIDs from database)
Output: data/merged_personas.xlsx (3,000 students, 79 columns)
Run:
python scripts/prepare_data.py
What it does:
- Loads student data and CPIDs from
support/directory - Merges on Roll Number
- Adds 22 enrichment columns from
support/fixed_3k_personas.xlsx:short_term_focus_1/2/3long_term_focus_1/2/3strength_1/2/3improvement_area_1/2/3hobby_1/2/3clubs,achievementsexpectation_1/2/3segment,archetypebehavioral_fingerprint
- Validates and saves merged file
Step 2: Simulation
Purpose: Generate assessment responses for all students across:
- 5 Survey Domains: Personality, Grit, Emotional Intelligence, Vocational Interest, Learning Strategies
- 12 Cognition Tests: Memory, Reaction Time, Reasoning, Attention tasks
Prerequisites:
data/merged_personas.xlsx(from Step 1)data/AllQuestions.xlsx(question mapping)- Anthropic API key in
.envfile
Output: 34 Excel files in output/full_run/
- 10 domain files (5 domains × 2 age groups)
- 24 cognition files (12 tests × 2 age groups)
Run:
# Full production (3,000 students, ~12-15 hours)
python main.py --full
# Dry run (5 students, ~5 minutes)
python main.py --dry
Features:
- ✅ Multithreaded processing (5 workers)
- ✅ Incremental saving (safe to interrupt)
- ✅ Resume capability (skips completed students)
- ✅ Fail-safe mechanisms (retry logic, sub-chunking)
Progress Tracking:
- Progress saved after each student
- Can resume from interruption
- Check
logsfile for detailed progress
Step 3: Post-Processing
Purpose: Finalize output files with:
- Header coloring (visual identification)
- Omitted value replacement
- Quality verification
Prerequisites:
- Output files from Step 2
data/AllQuestions.xlsx(for mapping)
Run:
# Full post-processing (all 3 sub-steps)
python scripts/comprehensive_post_processor.py
# Skip specific steps
python scripts/comprehensive_post_processor.py --skip-colors
python scripts/comprehensive_post_processor.py --skip-replacement
python scripts/comprehensive_post_processor.py --skip-quality
What it does:
3.1 Header Coloring
- 🟢 Green headers: Omission items (347 questions)
- 🚩 Red headers: Reverse-scoring items (264 questions)
- Priority: Red takes precedence over green
3.2 Omitted Value Replacement
- Replaces all values in omitted question columns with
"--" - Preserves header colors
- Processes all 10 domain files
3.3 Quality Verification
- Data density check (>95% target)
- Response variance check (>0.5 target)
- Schema validation
- Generates
quality_report.json
Output:
- Processed files with colored headers and replaced omitted values
- Quality report:
output/full_run/quality_report.json
Pipeline Orchestrator
The run_complete_pipeline.py script orchestrates all 3 steps:
Usage Examples
# Run all steps
python run_complete_pipeline.py --all
# Run specific step only
python run_complete_pipeline.py --step1
python run_complete_pipeline.py --step2
python run_complete_pipeline.py --step3
# Skip specific steps
python run_complete_pipeline.py --all --skip-prep
python run_complete_pipeline.py --all --skip-sim
python run_complete_pipeline.py --all --skip-post
# Dry run (5 students only)
python run_complete_pipeline.py --all --dry-run
Options
| Option | Description |
|---|---|
--step1 |
Run only persona preparation |
--step2 |
Run only simulation |
--step3 |
Run only post-processing |
--all |
Run all steps (default if no step specified) |
--skip-prep |
Skip persona preparation |
--skip-sim |
Skip simulation |
--skip-post |
Skip post-processing |
--dry-run |
Run simulation with 5 students only |
File Structure
Simulated_Assessment_Engine/
├── run_complete_pipeline.py # Master orchestrator
├── main.py # Simulation engine
├── scripts/
│ ├── prepare_data.py # Step 1: Persona preparation
│ ├── comprehensive_post_processor.py # Step 3: Post-processing
│ └── ...
├── data/
│ ├── merged_personas.xlsx # Output from Step 1
│ └── AllQuestions.xlsx # Question mapping
└── output/
└── full_run/
├── adolescense/
│ ├── 5_domain/ # 5 domain files
│ └── cognition/ # 12 cognition files
├── adults/
│ ├── 5_domain/ # 5 domain files
│ └── cognition/ # 12 cognition files
└── quality_report.json # Quality report from Step 3
Troubleshooting
Step 1 Issues
Problem: fixed_3k_personas.xlsx not found
- Solution: Ensure file exists in
FW_Pseudo_Data_Documents/directory - Note: This file contains 22 enrichment columns needed for persona enrichment
Problem: Student data files not found
- Solution: Check
3000-students.xlsxand3000_students_output.xlsxin base directory orsupport/folder
Step 2 Issues
Problem: API credit exhaustion
- Solution: Script will stop gracefully. Add credits and resume (it will skip completed students)
Problem: Simulation interrupted
- Solution: Simply re-run
python main.py --full. It will resume from last saved point
Step 3 Issues
Problem: Header colors not applied
- Solution: Re-run post-processing:
python scripts/comprehensive_post_processor.py
Problem: Quality check fails
- Solution: Review
quality_report.jsonfor specific issues. Most warnings are acceptable (e.g., Grit variance < 0.5)
Best Practices
- Always run Step 1 first to ensure
merged_personas.xlsxis up-to-date - Use dry-run for testing before full production run
- Monitor API credits during Step 2 (long-running process)
- Review quality report after Step 3 to verify data quality
- Keep backups of
merged_personas.xlsxbefore regeneration
Time Estimates
| Step | Duration | Notes |
|---|---|---|
| Step 1 | ~2 minutes | Persona preparation |
| Step 2 | 12-15 hours | Full 3,000 students (can be interrupted/resumed) |
| Step 3 | ~5 minutes | Post-processing |
Total: ~12-15 hours for complete pipeline
Output Verification
After completing all steps, verify:
- ✅
data/merged_personas.xlsxexists (3,000 rows, 79 columns) - ✅
output/full_run/contains 34 files (10 domain + 24 cognition) - ✅ Domain files have colored headers (green/red)
- ✅ Omitted values are replaced with
"--" - ✅ Quality report shows >95% data density
Support
For issues or questions:
- Check
logsfile for detailed execution logs - Review
quality_report.jsonfor quality metrics - Check prerequisites for each step
- Verify file paths and permissions
Last Updated: Final Production Version
Status: ✅ Production Ready