10 December 2025
·AI · Governance
·8 min
Prompt governance in regulated AI environments
How to version, audit, and control prompts in financial and legal systems where every model interaction is a liability. Covers prompt registries, change gates, and evaluation pipelines.
Prompt governance is the discipline of treating prompts as production artifacts — versioned, tested, audited, and attributable. In regulated environments this is not optional.
Why prompts are liabilities
Every prompt that reaches a model in a financial or legal system is a decision-making input. Regulators increasingly require that AI-assisted decisions be explainable and reproducible. A prompt that changed last Tuesday and produced a different answer today is an audit failure.
The standard response — "we'll add logging" — is insufficient. Logging records what happened. Governance controls what can happen.
The four layers of prompt governance
1. Registry
Every prompt in production lives in a versioned registry. Schema:
- ·id: canonical identifier (snake_case, stable across versions)
- ·version: semver
- ·status: draft | review | approved | deprecated
- ·owner: team or individual
- ·hash: content hash (detects silent mutations)
- ·created_at / approved_at / deprecated_at
No prompt executes in production without a registry entry in approved status.
2. Change gates
Promotion from draft to approved requires:
- ·Diff review by a designated reviewer (not the author)
- ·Passing the evaluation suite with defined acceptance thresholds
- ·Sign-off captured in the registry (who, when, which evaluation run)
Emergency overrides exist but create an incident record.
3. Evaluation pipelines
Each prompt version has an associated evaluation set: input-output pairs that define expected behavior. Pipeline runs on every promotion attempt:
- ·Correctness (task-specific metrics)
- ·Refusal rate (for safety-relevant prompts)
- ·Consistency (same input → same output class across N runs)
- ·Latency (p50, p95, p99)
Thresholds are set at the prompt level, not globally — a legal extraction prompt and a summary prompt have different acceptance criteria.
4. Runtime attribution
Every model call in production carries:
- ·prompt_id + prompt_version in the request context
- ·outcome_hash in the response log
- ·caller_id (which system or user triggered the call)
This creates a complete causal chain from decision to prompt version to evaluation state at the time of approval.
What this prevents
- ·Silent prompt drift (hash comparison catches it at registry write time)
- ·Unapproved changes reaching production (gating)
- ·"We don't know which version produced this" (attribution)
- ·Inability to reproduce a past decision (version + evaluation history preserved)
Implementation starting point
The minimal viable governance stack: a Postgres table as the registry, a CI/CD step that runs the evaluation pipeline on PR, and a middleware layer that rejects calls to non-approved prompt IDs.
More sophisticated implementations add a dedicated governance API (Prompt-Maker pattern), evaluation dashboards, and automated deprecation when successor versions are approved.
The infrastructure cost is low. The cost of not having it is measured in audit findings.