← Notes

10 December 2025

·

AI · Governance

·

8 min

Prompt governance in regulated AI environments

How to version, audit, and control prompts in financial and legal systems where every model interaction is a liability. Covers prompt registries, change gates, and evaluation pipelines.

Prompt governance is the discipline of treating prompts as production artifacts — versioned, tested, audited, and attributable. In regulated environments this is not optional.

Why prompts are liabilities

Every prompt that reaches a model in a financial or legal system is a decision-making input. Regulators increasingly require that AI-assisted decisions be explainable and reproducible. A prompt that changed last Tuesday and produced a different answer today is an audit failure.

The standard response — "we'll add logging" — is insufficient. Logging records what happened. Governance controls what can happen.

The four layers of prompt governance


1. Registry

Every prompt in production lives in a versioned registry. Schema:

  • ·id: canonical identifier (snake_case, stable across versions)
  • ·version: semver
  • ·status: draft | review | approved | deprecated
  • ·owner: team or individual
  • ·hash: content hash (detects silent mutations)
  • ·created_at / approved_at / deprecated_at

No prompt executes in production without a registry entry in approved status.

2. Change gates

Promotion from draft to approved requires:

  • ·Diff review by a designated reviewer (not the author)
  • ·Passing the evaluation suite with defined acceptance thresholds
  • ·Sign-off captured in the registry (who, when, which evaluation run)

Emergency overrides exist but create an incident record.

3. Evaluation pipelines

Each prompt version has an associated evaluation set: input-output pairs that define expected behavior. Pipeline runs on every promotion attempt:

  • ·Correctness (task-specific metrics)
  • ·Refusal rate (for safety-relevant prompts)
  • ·Consistency (same input → same output class across N runs)
  • ·Latency (p50, p95, p99)

Thresholds are set at the prompt level, not globally — a legal extraction prompt and a summary prompt have different acceptance criteria.

4. Runtime attribution

Every model call in production carries:

  • ·prompt_id + prompt_version in the request context
  • ·outcome_hash in the response log
  • ·caller_id (which system or user triggered the call)

This creates a complete causal chain from decision to prompt version to evaluation state at the time of approval.

What this prevents

  • ·Silent prompt drift (hash comparison catches it at registry write time)
  • ·Unapproved changes reaching production (gating)
  • ·"We don't know which version produced this" (attribution)
  • ·Inability to reproduce a past decision (version + evaluation history preserved)

Implementation starting point

The minimal viable governance stack: a Postgres table as the registry, a CI/CD step that runs the evaluation pipeline on PR, and a middleware layer that rejects calls to non-approved prompt IDs.

More sophisticated implementations add a dedicated governance API (Prompt-Maker pattern), evaluation dashboards, and automated deprecation when successor versions are approved.

The infrastructure cost is low. The cost of not having it is measured in audit findings.