Re_Backend/docs/CPC-CDC.md

# CPC-CSD module (re-workflow)

This module (formerly referred to as CPC-CDC in code comments) covers **CPC/CSD document upload, OCR/extraction, validation against MSD payloads, audit history, dashboards, and Excel reports**. It was consolidated from the standalone **CPC-CSD** app into this backend.

## HTTP API

**CPC-CSD-compatible URLs** (same as `CPC-CSD/server/src/routes/index.js` + Postman `CPC-CSD-Full-Flow`): `POST /api/upload`, `GET /api/documents/*`, `POST /api/v1/ocr/validate`, `POST /api/v1/ocr/validate-upload` (field **`file`**), `POST /api/v1/ocr/upload` (field **`files`**, max 20), report downloads under `/api/v1/ocr/report/...`. Registered from `src/routes/cpc-csd-compat.mount.ts` before `/api/v1`; disable with **`CPC_LEGACY_COMPAT_ROUTES=false`**.

**Namespaced API** — canonical prefix **`/api/v1/cpc-csd`**; legacy alias **`/api/v1/cpc-cdc`** (`src/routes/cpc-cdc.routes.ts`) mounts the same handlers and auth.

| Method | Path (prefix **`/api`** or **`/api/v1/cpc-csd`** or legacy **`/api/v1/cpc-cdc`**) | Purpose |
|--------|------|---------|
| POST | `/upload` | GCS-only: multipart field **`file`** → `{ gcsUrl }` (compat: **`/api/upload`**) |
| POST | `/v1/ocr/validate` | JSON URL mode — returns **400** with legacy message (use validate-upload) |
| POST | `/v1/ocr/validate-upload` | Single file field **`file`** + `claim_id` / `msd_payload` / … |
| POST | `/v1/ocr/upload` | Bulk: field **`files`** (max 20) + `metadata_queue` or `msd_payload` / `document_type` |
| GET | `/documents/analytics` | Totals, pass rate, distribution, `dailyVolume`, `topMismatchFields` |
| GET | `/documents/history` | `claimId` query — attempts grouped |
| GET | `/documents/recent` | Paginated list; query: `page`, `limit`, `search`, `status`, `type`, `sortBy`, `order` |
| GET | `/documents/:id/file` | Authenticated file bytes for preview (browser cannot use `gs://` directly) |
| GET | `/documents/:id` | Document + audit logs + `field_results` |
| PUT | `/documents/:id/status` | Manual status / corrected fields |
| DELETE | `/documents/:id` | Remove document row |
| GET | `/v1/ocr/report/:claimId/download` | Per-claim Excel |
| GET | `/v1/ocr/report/all/download` | Master Excel (supports `search`, `status`, `type`) |

Compat paths are under **`/api/...`**; namespaced routes are **`/api/v1/cpc-csd/...`** with **`/api/v1/cpc-cdc/...`** as an alias (same path suffixes as in the table’s second column).

## Database

Sequelize models: **`CpcDocument`** (`cpc_documents`), **`CpcAuditLog`** (`cpc_audit_logs`). Migration: `src/migrations/2026041300-create-cpc-cdc-tables.ts`.

**Admin viewer list** is stored under `admin_configurations.config_key = CPC_CSD_ADMIN_CONFIG` (migration `20260416120000-rename-cpc-cdc-admin-config-key.ts` renames the legacy `CPC_CDC_ADMIN_CONFIG` row when applied).

On **application startup**, `ensureCpcCdcSchema()` runs after DB connect (`src/services/cpc-cdc/ensureCpcCdcSchema.ts`) so `CREATE TABLE IF NOT EXISTS` applies if migrations were skipped; still run `npm run migrate` for a full schema history.

Notable columns on `cpc_documents`: `booking_id`, `claim_id`, `attempt_no`, `document_type`, `document_gcp_url`, `provider`, JSONB `msd_payload`, `extracted_fields`, `field_confidence`, `validation_status`, `match_percentage`, `mismatch_reasons`, `field_results`, `ip_address`.

Unique index: `(claim_id, attempt_no, document_type)` — important when migrating legacy data with duplicates.

## Environment variables

Copy **`re-workflow-be/.env.example`** to `.env` and adjust. Typical keys (see `CpcCdcController` and `src/services/cpc-cdc/*`):

- **`GCP_PROJECT_ID`** — GCP project for Vertex / optional Document AI.
- **`VERTEX_AI_LOCATION`** — Vertex region (e.g. `asia-south1`).
- **`DOC_AI_PROCESSOR_ID`** — Optional; when set and valid, Document AI OCR may run before Gemini.
- **`GCP_LOCATION_DOC_AI`** — Document AI region (default `us`).
- **GCS** — Bucket/credentials as required by `CpcGcsService` (service account via `GOOGLE_APPLICATION_CREDENTIALS` or workload identity).
- **`CPC_ALLOW_DEGRADED_SAVE_WITHOUT_AI`** — **`true`**: always allow saving after failed/missing Vertex. **`false`**: in **production** only, disallow degraded saves. **Omitted in non-production**: degraded saves are **allowed** so local CPC works without GCP; set to **`false`** in dev to force strict Vertex. **Omitted in production**: strict (Vertex required unless `RULES` provider).

**Extraction behaviour (upload response):**

- **`extraction_source`: `vertex_gemini`** — Fields came from the Vertex Gemini API (document bytes + optional Document AI OCR text).
- **`extraction_source`: `rules_engine`** — Provider was **`RULES`**; fields come from `CpcRuleExtractService` on OCR text only (no Gemini).
- **`extraction_source`: `degraded_empty`** — Extraction was skipped, failed, or (in **non-production**) hit a **Vertex auth / ADC** problem; the row is still stored with empty `extracted_fields` so you can test DB/history. In production this only happens when **`CPC_ALLOW_DEGRADED_SAVE_WITHOUT_AI=true`** or missing `GCP_PROJECT_ID` with degraded policy.

## One-off data migration from legacy Prisma DB

If you still have the old **`Document`** / **`AuditLog`** tables (CPC-CSD Prisma schema) in PostgreSQL, run:

```bash
npm run migrate:cpc-csd
```

Optional **`CPC_CSD_DATABASE_URL`**: if set, rows are read from that database and written to the database in **`DATABASE_URL`** (re-workflow). If unset, both read and write use **`DATABASE_URL`** (same cluster; both table sets must exist).

After migration, spot-check history, document detail, and Excel downloads, then decommission the legacy app.