305 lines
8.4 KiB
Markdown
305 lines
8.4 KiB
Markdown
# Complete Workflow Guide - Simulated Assessment Engine
|
||
|
||
## Overview
|
||
|
||
This guide explains the complete 3-step workflow for generating simulated assessment data:
|
||
|
||
1. **Persona Preparation**: Merge persona factory output with enrichment data
|
||
2. **Simulation**: Generate assessment responses for all students
|
||
3. **Post-Processing**: Color headers, replace omitted values, verify quality
|
||
|
||
---
|
||
|
||
## Quick Start
|
||
|
||
### Automated Workflow (Recommended)
|
||
|
||
Run all 3 steps automatically:
|
||
|
||
```bash
|
||
# Full production run (3,000 students)
|
||
python run_complete_pipeline.py --all
|
||
|
||
# Dry run (5 students for testing)
|
||
python run_complete_pipeline.py --all --dry-run
|
||
```
|
||
|
||
### Manual Workflow
|
||
|
||
Run each step individually:
|
||
|
||
```bash
|
||
# Step 1: Prepare personas
|
||
python scripts/prepare_data.py
|
||
|
||
# Step 2: Run simulation
|
||
python main.py --full
|
||
|
||
# Step 3: Post-process
|
||
python scripts/comprehensive_post_processor.py
|
||
```
|
||
|
||
---
|
||
|
||
## Step-by-Step Details
|
||
|
||
### Step 1: Persona Preparation
|
||
|
||
**Purpose**: Create `merged_personas.xlsx` by combining:
|
||
- Persona factory output (from `FW_Pseudo_Data_Documents/cogniprism_persona_factory_0402.py`)
|
||
- 22 enrichment columns from `fixed_3k_personas.xlsx` (goals, interests, strengths, etc.)
|
||
- Student data from `3000-students.xlsx` and `3000_students_output.xlsx`
|
||
|
||
**Prerequisites** (all files within project):
|
||
- `support/fixed_3k_personas.xlsx` (enrichment data with 22 columns)
|
||
- `support/3000-students.xlsx` (student demographics)
|
||
- `support/3000_students_output.xlsx` (StudentCPIDs from database)
|
||
|
||
**Output**: `data/merged_personas.xlsx` (3,000 students, 79 columns)
|
||
|
||
**Run**:
|
||
```bash
|
||
python scripts/prepare_data.py
|
||
```
|
||
|
||
**What it does**:
|
||
1. Loads student data and CPIDs from `support/` directory
|
||
2. Merges on Roll Number
|
||
3. Adds 22 enrichment columns from `support/fixed_3k_personas.xlsx`:
|
||
- `short_term_focus_1/2/3`
|
||
- `long_term_focus_1/2/3`
|
||
- `strength_1/2/3`
|
||
- `improvement_area_1/2/3`
|
||
- `hobby_1/2/3`
|
||
- `clubs`, `achievements`
|
||
- `expectation_1/2/3`
|
||
- `segment`, `archetype`
|
||
- `behavioral_fingerprint`
|
||
4. Validates and saves merged file
|
||
|
||
---
|
||
|
||
### Step 2: Simulation
|
||
|
||
**Purpose**: Generate assessment responses for all students across:
|
||
- 5 Survey Domains: Personality, Grit, Emotional Intelligence, Vocational Interest, Learning Strategies
|
||
- 12 Cognition Tests: Memory, Reaction Time, Reasoning, Attention tasks
|
||
|
||
**Prerequisites**:
|
||
- `data/merged_personas.xlsx` (from Step 1)
|
||
- `data/AllQuestions.xlsx` (question mapping)
|
||
- Anthropic API key in `.env` file
|
||
|
||
**Output**: 34 Excel files in `output/full_run/`
|
||
- 10 domain files (5 domains × 2 age groups)
|
||
- 24 cognition files (12 tests × 2 age groups)
|
||
|
||
**Run**:
|
||
```bash
|
||
# Full production (3,000 students, ~12-15 hours)
|
||
python main.py --full
|
||
|
||
# Dry run (5 students, ~5 minutes)
|
||
python main.py --dry
|
||
```
|
||
|
||
**Features**:
|
||
- ✅ Multithreaded processing (5 workers)
|
||
- ✅ Incremental saving (safe to interrupt)
|
||
- ✅ Resume capability (skips completed students)
|
||
- ✅ Fail-safe mechanisms (retry logic, sub-chunking)
|
||
|
||
**Progress Tracking**:
|
||
- Progress saved after each student
|
||
- Can resume from interruption
|
||
- Check `logs` file for detailed progress
|
||
|
||
---
|
||
|
||
### Step 3: Post-Processing
|
||
|
||
**Purpose**: Finalize output files with:
|
||
1. Header coloring (visual identification)
|
||
2. Omitted value replacement
|
||
3. Quality verification
|
||
|
||
**Prerequisites**:
|
||
- Output files from Step 2
|
||
- `data/AllQuestions.xlsx` (for mapping)
|
||
|
||
**Run**:
|
||
```bash
|
||
# Full post-processing (all 3 sub-steps)
|
||
python scripts/comprehensive_post_processor.py
|
||
|
||
# Skip specific steps
|
||
python scripts/comprehensive_post_processor.py --skip-colors
|
||
python scripts/comprehensive_post_processor.py --skip-replacement
|
||
python scripts/comprehensive_post_processor.py --skip-quality
|
||
```
|
||
|
||
**What it does**:
|
||
|
||
#### 3.1 Header Coloring
|
||
- 🟢 **Green headers**: Omission items (347 questions)
|
||
- 🚩 **Red headers**: Reverse-scoring items (264 questions)
|
||
- Priority: Red takes precedence over green
|
||
|
||
#### 3.2 Omitted Value Replacement
|
||
- Replaces all values in omitted question columns with `"--"`
|
||
- Preserves header colors
|
||
- Processes all 10 domain files
|
||
|
||
#### 3.3 Quality Verification
|
||
- Data density check (>95% target)
|
||
- Response variance check (>0.5 target)
|
||
- Schema validation
|
||
- Generates `quality_report.json`
|
||
|
||
**Output**:
|
||
- Processed files with colored headers and replaced omitted values
|
||
- Quality report: `output/full_run/quality_report.json`
|
||
|
||
---
|
||
|
||
## Pipeline Orchestrator
|
||
|
||
The `run_complete_pipeline.py` script orchestrates all 3 steps:
|
||
|
||
### Usage Examples
|
||
|
||
```bash
|
||
# Run all steps
|
||
python run_complete_pipeline.py --all
|
||
|
||
# Run specific step only
|
||
python run_complete_pipeline.py --step1
|
||
python run_complete_pipeline.py --step2
|
||
python run_complete_pipeline.py --step3
|
||
|
||
# Skip specific steps
|
||
python run_complete_pipeline.py --all --skip-prep
|
||
python run_complete_pipeline.py --all --skip-sim
|
||
python run_complete_pipeline.py --all --skip-post
|
||
|
||
# Dry run (5 students only)
|
||
python run_complete_pipeline.py --all --dry-run
|
||
```
|
||
|
||
### Options
|
||
|
||
| Option | Description |
|
||
|--------|-------------|
|
||
| `--step1` | Run only persona preparation |
|
||
| `--step2` | Run only simulation |
|
||
| `--step3` | Run only post-processing |
|
||
| `--all` | Run all steps (default if no step specified) |
|
||
| `--skip-prep` | Skip persona preparation |
|
||
| `--skip-sim` | Skip simulation |
|
||
| `--skip-post` | Skip post-processing |
|
||
| `--dry-run` | Run simulation with 5 students only |
|
||
|
||
---
|
||
|
||
## File Structure
|
||
|
||
```
|
||
Simulated_Assessment_Engine/
|
||
├── run_complete_pipeline.py # Master orchestrator
|
||
├── main.py # Simulation engine
|
||
├── scripts/
|
||
│ ├── prepare_data.py # Step 1: Persona preparation
|
||
│ ├── comprehensive_post_processor.py # Step 3: Post-processing
|
||
│ └── ...
|
||
├── data/
|
||
│ ├── merged_personas.xlsx # Output from Step 1
|
||
│ └── AllQuestions.xlsx # Question mapping
|
||
└── output/
|
||
└── full_run/
|
||
├── adolescense/
|
||
│ ├── 5_domain/ # 5 domain files
|
||
│ └── cognition/ # 12 cognition files
|
||
├── adults/
|
||
│ ├── 5_domain/ # 5 domain files
|
||
│ └── cognition/ # 12 cognition files
|
||
└── quality_report.json # Quality report from Step 3
|
||
```
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
### Step 1 Issues
|
||
|
||
**Problem**: `fixed_3k_personas.xlsx` not found
|
||
- **Solution**: Ensure file exists in `FW_Pseudo_Data_Documents/` directory
|
||
- **Note**: This file contains 22 enrichment columns needed for persona enrichment
|
||
|
||
**Problem**: Student data files not found
|
||
- **Solution**: Check `3000-students.xlsx` and `3000_students_output.xlsx` in base directory or `support/` folder
|
||
|
||
### Step 2 Issues
|
||
|
||
**Problem**: API credit exhaustion
|
||
- **Solution**: Script will stop gracefully. Add credits and resume (it will skip completed students)
|
||
|
||
**Problem**: Simulation interrupted
|
||
- **Solution**: Simply re-run `python main.py --full`. It will resume from last saved point
|
||
|
||
### Step 3 Issues
|
||
|
||
**Problem**: Header colors not applied
|
||
- **Solution**: Re-run post-processing: `python scripts/comprehensive_post_processor.py`
|
||
|
||
**Problem**: Quality check fails
|
||
- **Solution**: Review `quality_report.json` for specific issues. Most warnings are acceptable (e.g., Grit variance < 0.5)
|
||
|
||
---
|
||
|
||
## Best Practices
|
||
|
||
1. **Always run Step 1 first** to ensure `merged_personas.xlsx` is up-to-date
|
||
2. **Use dry-run for testing** before full production run
|
||
3. **Monitor API credits** during Step 2 (long-running process)
|
||
4. **Review quality report** after Step 3 to verify data quality
|
||
5. **Keep backups** of `merged_personas.xlsx` before regeneration
|
||
|
||
---
|
||
|
||
## Time Estimates
|
||
|
||
| Step | Duration | Notes |
|
||
|------|----------|-------|
|
||
| Step 1 | ~2 minutes | Persona preparation |
|
||
| Step 2 | 12-15 hours | Full 3,000 students (can be interrupted/resumed) |
|
||
| Step 3 | ~5 minutes | Post-processing |
|
||
|
||
**Total**: ~12-15 hours for complete pipeline
|
||
|
||
---
|
||
|
||
## Output Verification
|
||
|
||
After completing all steps, verify:
|
||
|
||
1. ✅ `data/merged_personas.xlsx` exists (3,000 rows, 79 columns)
|
||
2. ✅ `output/full_run/` contains 34 files (10 domain + 24 cognition)
|
||
3. ✅ Domain files have colored headers (green/red)
|
||
4. ✅ Omitted values are replaced with `"--"`
|
||
5. ✅ Quality report shows >95% data density
|
||
|
||
---
|
||
|
||
## Support
|
||
|
||
For issues or questions:
|
||
1. Check `logs` file for detailed execution logs
|
||
2. Review `quality_report.json` for quality metrics
|
||
3. Check prerequisites for each step
|
||
4. Verify file paths and permissions
|
||
|
||
---
|
||
|
||
**Last Updated**: Final Production Version
|
||
**Status**: ✅ Production Ready
|