CP_Assessment_engine/WORKFLOW_GUIDE.md
2026-02-10 12:59:40 +05:30

8.4 KiB
Raw Permalink Blame History

Complete Workflow Guide - Simulated Assessment Engine

Overview

This guide explains the complete 3-step workflow for generating simulated assessment data:

  1. Persona Preparation: Merge persona factory output with enrichment data
  2. Simulation: Generate assessment responses for all students
  3. Post-Processing: Color headers, replace omitted values, verify quality

Quick Start

Run all 3 steps automatically:

# Full production run (3,000 students)
python run_complete_pipeline.py --all

# Dry run (5 students for testing)
python run_complete_pipeline.py --all --dry-run

Manual Workflow

Run each step individually:

# Step 1: Prepare personas
python scripts/prepare_data.py

# Step 2: Run simulation
python main.py --full

# Step 3: Post-process
python scripts/comprehensive_post_processor.py

Step-by-Step Details

Step 1: Persona Preparation

Purpose: Create merged_personas.xlsx by combining:

  • Persona factory output (from FW_Pseudo_Data_Documents/cogniprism_persona_factory_0402.py)
  • 22 enrichment columns from fixed_3k_personas.xlsx (goals, interests, strengths, etc.)
  • Student data from 3000-students.xlsx and 3000_students_output.xlsx

Prerequisites (all files within project):

  • support/fixed_3k_personas.xlsx (enrichment data with 22 columns)
  • support/3000-students.xlsx (student demographics)
  • support/3000_students_output.xlsx (StudentCPIDs from database)

Output: data/merged_personas.xlsx (3,000 students, 79 columns)

Run:

python scripts/prepare_data.py

What it does:

  1. Loads student data and CPIDs from support/ directory
  2. Merges on Roll Number
  3. Adds 22 enrichment columns from support/fixed_3k_personas.xlsx:
    • short_term_focus_1/2/3
    • long_term_focus_1/2/3
    • strength_1/2/3
    • improvement_area_1/2/3
    • hobby_1/2/3
    • clubs, achievements
    • expectation_1/2/3
    • segment, archetype
    • behavioral_fingerprint
  4. Validates and saves merged file

Step 2: Simulation

Purpose: Generate assessment responses for all students across:

  • 5 Survey Domains: Personality, Grit, Emotional Intelligence, Vocational Interest, Learning Strategies
  • 12 Cognition Tests: Memory, Reaction Time, Reasoning, Attention tasks

Prerequisites:

  • data/merged_personas.xlsx (from Step 1)
  • data/AllQuestions.xlsx (question mapping)
  • Anthropic API key in .env file

Output: 34 Excel files in output/full_run/

  • 10 domain files (5 domains × 2 age groups)
  • 24 cognition files (12 tests × 2 age groups)

Run:

# Full production (3,000 students, ~12-15 hours)
python main.py --full

# Dry run (5 students, ~5 minutes)
python main.py --dry

Features:

  • Multithreaded processing (5 workers)
  • Incremental saving (safe to interrupt)
  • Resume capability (skips completed students)
  • Fail-safe mechanisms (retry logic, sub-chunking)

Progress Tracking:

  • Progress saved after each student
  • Can resume from interruption
  • Check logs file for detailed progress

Step 3: Post-Processing

Purpose: Finalize output files with:

  1. Header coloring (visual identification)
  2. Omitted value replacement
  3. Quality verification

Prerequisites:

  • Output files from Step 2
  • data/AllQuestions.xlsx (for mapping)

Run:

# Full post-processing (all 3 sub-steps)
python scripts/comprehensive_post_processor.py

# Skip specific steps
python scripts/comprehensive_post_processor.py --skip-colors
python scripts/comprehensive_post_processor.py --skip-replacement
python scripts/comprehensive_post_processor.py --skip-quality

What it does:

3.1 Header Coloring

  • 🟢 Green headers: Omission items (347 questions)
  • 🚩 Red headers: Reverse-scoring items (264 questions)
  • Priority: Red takes precedence over green

3.2 Omitted Value Replacement

  • Replaces all values in omitted question columns with "--"
  • Preserves header colors
  • Processes all 10 domain files

3.3 Quality Verification

  • Data density check (>95% target)
  • Response variance check (>0.5 target)
  • Schema validation
  • Generates quality_report.json

Output:

  • Processed files with colored headers and replaced omitted values
  • Quality report: output/full_run/quality_report.json

Pipeline Orchestrator

The run_complete_pipeline.py script orchestrates all 3 steps:

Usage Examples

# Run all steps
python run_complete_pipeline.py --all

# Run specific step only
python run_complete_pipeline.py --step1
python run_complete_pipeline.py --step2
python run_complete_pipeline.py --step3

# Skip specific steps
python run_complete_pipeline.py --all --skip-prep
python run_complete_pipeline.py --all --skip-sim
python run_complete_pipeline.py --all --skip-post

# Dry run (5 students only)
python run_complete_pipeline.py --all --dry-run

Options

Option Description
--step1 Run only persona preparation
--step2 Run only simulation
--step3 Run only post-processing
--all Run all steps (default if no step specified)
--skip-prep Skip persona preparation
--skip-sim Skip simulation
--skip-post Skip post-processing
--dry-run Run simulation with 5 students only

File Structure

Simulated_Assessment_Engine/
├── run_complete_pipeline.py          # Master orchestrator
├── main.py                            # Simulation engine
├── scripts/
│   ├── prepare_data.py               # Step 1: Persona preparation
│   ├── comprehensive_post_processor.py  # Step 3: Post-processing
│   └── ...
├── data/
│   ├── merged_personas.xlsx          # Output from Step 1
│   └── AllQuestions.xlsx             # Question mapping
└── output/
    └── full_run/
        ├── adolescense/
        │   ├── 5_domain/             # 5 domain files
        │   └── cognition/            # 12 cognition files
        ├── adults/
        │   ├── 5_domain/             # 5 domain files
        │   └── cognition/            # 12 cognition files
        └── quality_report.json       # Quality report from Step 3

Troubleshooting

Step 1 Issues

Problem: fixed_3k_personas.xlsx not found

  • Solution: Ensure file exists in FW_Pseudo_Data_Documents/ directory
  • Note: This file contains 22 enrichment columns needed for persona enrichment

Problem: Student data files not found

  • Solution: Check 3000-students.xlsx and 3000_students_output.xlsx in base directory or support/ folder

Step 2 Issues

Problem: API credit exhaustion

  • Solution: Script will stop gracefully. Add credits and resume (it will skip completed students)

Problem: Simulation interrupted

  • Solution: Simply re-run python main.py --full. It will resume from last saved point

Step 3 Issues

Problem: Header colors not applied

  • Solution: Re-run post-processing: python scripts/comprehensive_post_processor.py

Problem: Quality check fails

  • Solution: Review quality_report.json for specific issues. Most warnings are acceptable (e.g., Grit variance < 0.5)

Best Practices

  1. Always run Step 1 first to ensure merged_personas.xlsx is up-to-date
  2. Use dry-run for testing before full production run
  3. Monitor API credits during Step 2 (long-running process)
  4. Review quality report after Step 3 to verify data quality
  5. Keep backups of merged_personas.xlsx before regeneration

Time Estimates

Step Duration Notes
Step 1 ~2 minutes Persona preparation
Step 2 12-15 hours Full 3,000 students (can be interrupted/resumed)
Step 3 ~5 minutes Post-processing

Total: ~12-15 hours for complete pipeline


Output Verification

After completing all steps, verify:

  1. data/merged_personas.xlsx exists (3,000 rows, 79 columns)
  2. output/full_run/ contains 34 files (10 domain + 24 cognition)
  3. Domain files have colored headers (green/red)
  4. Omitted values are replaced with "--"
  5. Quality report shows >95% data density

Support

For issues or questions:

  1. Check logs file for detailed execution logs
  2. Review quality_report.json for quality metrics
  3. Check prerequisites for each step
  4. Verify file paths and permissions

Last Updated: Final Production Version
Status: Production Ready