CP_Assessment_engine/WORKFLOW_GUIDE.md

# Complete Workflow Guide - Simulated Assessment Engine

## Overview

This guide explains the complete 3-step workflow for generating simulated assessment data:

1. **Persona Preparation**: Merge persona factory output with enrichment data
2. **Simulation**: Generate assessment responses for all students
3. **Post-Processing**: Color headers, replace omitted values, verify quality

---

## Quick Start

### Automated Workflow (Recommended)

Run all 3 steps automatically:

```bash
# Full production run (3,000 students)
python run_complete_pipeline.py --all

# Dry run (5 students for testing)
python run_complete_pipeline.py --all --dry-run
```

### Manual Workflow

Run each step individually:

```bash
# Step 1: Prepare personas
python scripts/prepare_data.py

# Step 2: Run simulation
python main.py --full

# Step 3: Post-process
python scripts/comprehensive_post_processor.py
```

---

## Step-by-Step Details

### Step 1: Persona Preparation

**Purpose**: Create `merged_personas.xlsx` by combining:
- Persona factory output (from `FW_Pseudo_Data_Documents/cogniprism_persona_factory_0402.py`)
- 22 enrichment columns from `fixed_3k_personas.xlsx` (goals, interests, strengths, etc.)
- Student data from `3000-students.xlsx` and `3000_students_output.xlsx`

**Prerequisites** (all files within project):
- `support/fixed_3k_personas.xlsx` (enrichment data with 22 columns)
- `support/3000-students.xlsx` (student demographics)
- `support/3000_students_output.xlsx` (StudentCPIDs from database)

**Output**: `data/merged_personas.xlsx` (3,000 students, 79 columns)

**Run**:
```bash
python scripts/prepare_data.py
```

**What it does**:
1. Loads student data and CPIDs from `support/` directory
2. Merges on Roll Number
3. Adds 22 enrichment columns from `support/fixed_3k_personas.xlsx`:
   - `short_term_focus_1/2/3`
   - `long_term_focus_1/2/3`
   - `strength_1/2/3`
   - `improvement_area_1/2/3`
   - `hobby_1/2/3`
   - `clubs`, `achievements`
   - `expectation_1/2/3`
   - `segment`, `archetype`
   - `behavioral_fingerprint`
4. Validates and saves merged file

---

### Step 2: Simulation

**Purpose**: Generate assessment responses for all students across:
- 5 Survey Domains: Personality, Grit, Emotional Intelligence, Vocational Interest, Learning Strategies
- 12 Cognition Tests: Memory, Reaction Time, Reasoning, Attention tasks

**Prerequisites**:
- `data/merged_personas.xlsx` (from Step 1)
- `data/AllQuestions.xlsx` (question mapping)
- Anthropic API key in `.env` file

**Output**: 34 Excel files in `output/full_run/`
- 10 domain files (5 domains × 2 age groups)
- 24 cognition files (12 tests × 2 age groups)

**Run**:
```bash
# Full production (3,000 students, ~12-15 hours)
python main.py --full

# Dry run (5 students, ~5 minutes)
python main.py --dry
```

**Features**:
- ✅ Multithreaded processing (5 workers)
- ✅ Incremental saving (safe to interrupt)
- ✅ Resume capability (skips completed students)
- ✅ Fail-safe mechanisms (retry logic, sub-chunking)

**Progress Tracking**:
- Progress saved after each student
- Can resume from interruption
- Check `logs` file for detailed progress

---

### Step 3: Post-Processing

**Purpose**: Finalize output files with:
1. Header coloring (visual identification)
2. Omitted value replacement
3. Quality verification

**Prerequisites**:
- Output files from Step 2
- `data/AllQuestions.xlsx` (for mapping)

**Run**:
```bash
# Full post-processing (all 3 sub-steps)
python scripts/comprehensive_post_processor.py

# Skip specific steps
python scripts/comprehensive_post_processor.py --skip-colors
python scripts/comprehensive_post_processor.py --skip-replacement
python scripts/comprehensive_post_processor.py --skip-quality
```

**What it does**:

#### 3.1 Header Coloring
- 🟢 **Green headers**: Omission items (347 questions)
- 🚩 **Red headers**: Reverse-scoring items (264 questions)
- Priority: Red takes precedence over green

#### 3.2 Omitted Value Replacement
- Replaces all values in omitted question columns with `"--"`
- Preserves header colors
- Processes all 10 domain files

#### 3.3 Quality Verification
- Data density check (>95% target)
- Response variance check (>0.5 target)
- Schema validation
- Generates `quality_report.json`

**Output**:
- Processed files with colored headers and replaced omitted values
- Quality report: `output/full_run/quality_report.json`

---

## Pipeline Orchestrator

The `run_complete_pipeline.py` script orchestrates all 3 steps:

### Usage Examples

```bash
# Run all steps
python run_complete_pipeline.py --all

# Run specific step only
python run_complete_pipeline.py --step1
python run_complete_pipeline.py --step2
python run_complete_pipeline.py --step3

# Skip specific steps
python run_complete_pipeline.py --all --skip-prep
python run_complete_pipeline.py --all --skip-sim
python run_complete_pipeline.py --all --skip-post

# Dry run (5 students only)
python run_complete_pipeline.py --all --dry-run
```

### Options

| Option | Description |
|--------|-------------|
| `--step1` | Run only persona preparation |
| `--step2` | Run only simulation |
| `--step3` | Run only post-processing |
| `--all` | Run all steps (default if no step specified) |
| `--skip-prep` | Skip persona preparation |
| `--skip-sim` | Skip simulation |
| `--skip-post` | Skip post-processing |
| `--dry-run` | Run simulation with 5 students only |

---

## File Structure

```
Simulated_Assessment_Engine/
├── run_complete_pipeline.py          # Master orchestrator
├── main.py                            # Simulation engine
├── scripts/
│   ├── prepare_data.py               # Step 1: Persona preparation
│   ├── comprehensive_post_processor.py  # Step 3: Post-processing
│   └── ...
├── data/
│   ├── merged_personas.xlsx          # Output from Step 1
│   └── AllQuestions.xlsx             # Question mapping
└── output/
    └── full_run/
        ├── adolescense/
        │   ├── 5_domain/             # 5 domain files
        │   └── cognition/            # 12 cognition files
        ├── adults/
        │   ├── 5_domain/             # 5 domain files
        │   └── cognition/            # 12 cognition files
        └── quality_report.json       # Quality report from Step 3
```

---

## Troubleshooting

### Step 1 Issues

**Problem**: `fixed_3k_personas.xlsx` not found
- **Solution**: Ensure file exists in `FW_Pseudo_Data_Documents/` directory
- **Note**: This file contains 22 enrichment columns needed for persona enrichment

**Problem**: Student data files not found
- **Solution**: Check `3000-students.xlsx` and `3000_students_output.xlsx` in base directory or `support/` folder

### Step 2 Issues

**Problem**: API credit exhaustion
- **Solution**: Script will stop gracefully. Add credits and resume (it will skip completed students)

**Problem**: Simulation interrupted
- **Solution**: Simply re-run `python main.py --full`. It will resume from last saved point

### Step 3 Issues

**Problem**: Header colors not applied
- **Solution**: Re-run post-processing: `python scripts/comprehensive_post_processor.py`

**Problem**: Quality check fails
- **Solution**: Review `quality_report.json` for specific issues. Most warnings are acceptable (e.g., Grit variance < 0.5)

---

## Best Practices

1. **Always run Step 1 first** to ensure `merged_personas.xlsx` is up-to-date
2. **Use dry-run for testing** before full production run
3. **Monitor API credits** during Step 2 (long-running process)
4. **Review quality report** after Step 3 to verify data quality
5. **Keep backups** of `merged_personas.xlsx` before regeneration

---

## Time Estimates

| Step | Duration | Notes |
|------|----------|-------|
| Step 1 | ~2 minutes | Persona preparation |
| Step 2 | 12-15 hours | Full 3,000 students (can be interrupted/resumed) |
| Step 3 | ~5 minutes | Post-processing |

**Total**: ~12-15 hours for complete pipeline

---

## Output Verification

After completing all steps, verify:

1. ✅ `data/merged_personas.xlsx` exists (3,000 rows, 79 columns)
2. ✅ `output/full_run/` contains 34 files (10 domain + 24 cognition)
3. ✅ Domain files have colored headers (green/red)
4. ✅ Omitted values are replaced with `"--"`
5. ✅ Quality report shows >95% data density

---

## Support

For issues or questions:
1. Check `logs` file for detailed execution logs
2. Review `quality_report.json` for quality metrics
3. Check prerequisites for each step
4. Verify file paths and permissions

---

**Last Updated**: Final Production Version
**Status**: ✅ Production Ready