Reducing vs. eliminating scheming in AI output: The formal methods approach

The joint research by Apollo Research and OpenAI, “Stress Testing Deliberative Alignment for Anti-Scheming Training”, highlights both the promise and limits of heuristic alignment, especially when models exhibit situational awareness that can mask misalignment (“scheming”). Formal Foundry acknowledges the progress and presents formal verification as the complementary step for domains where failure is unacceptable.

Our approach doesn’t replace alignment techniques; it adds mathematical guarantees where probabilistic controls alone are insufficient.

Three Formal Foundry insights on AI scheming

1) Augment heuristic alignment with proof-checked correctness

Deliberative alignment can steer internal reasoning toward a safety spec. For critical use cases, we go further: the model must produce a formal proof (e.g., in Agda) that its output satisfies a safety property. An external type checker validates the proof, yielding a machine-checked guarantee rather than a high-probability claim.

Concrete example (sketch):

Task: Summarize a patient chart while never proposing medication changes.
Spec (in logic): For all outputs o, Prohibit(o, MedicationChange).
Obligation: LLM emits a summary and an Agda proof that Prohibit holds.
Check: The Agda type checker accepts or rejects the proof. If rejected, the output is discarded.

2) Make ambiguity impossible with a formal specification

Scheming often exploits gaps between a high-level task and the underlying safety goal. We remove this gap by defining a single, unambiguous objective as a Correctness Theorem before generation. The model then targets that theorem, not an informal description.

Actionable structure:

Domain formalization: Identify entities, invariants, and allowed operations.
Property definition: State safety and compliance properties as predicates.
Correctness theorem: Encode the exact obligation the output must prove.
Proof-guided generation: Constrain the LLM to produce output + proof.
Independent checking: Reject anything that fails proof verification.

3) Verify externally to transcend situational awareness

If a model behaves differently when “under evaluation,” alignment can appear better than it is. Our guarantees do not depend on the model’s internal state: the proof is externally checked. We optionally produce a semi-natural-language explanation derived from the formal proof so auditors can understand why the guarantee holds.

What this enables in practice

Regulated workflows: Attach a machine-checked proof to every AI-generated artifact (report, transformation, recommendation).
Fail-closed design: If a proof doesn’t check, the system returns a safe fallback or requests clarification—never an unchecked result.
Traceability: Each decision links to a specific theorem and proof term, enabling audits and root-cause analysis.

Business example (illustrative): Discharge medication plans with machine‑checked safety

Illustrative only; not medical advice.

Business problem & why it matters

Discharging a patient with an incorrect or unsafe medication plan can trigger avoidable readmissions, pharmacist rework, delays in bed turnover, and regulatory exposure. Today’s safeguards—clinical decision support pop‑ups and manual pharmacist checks—reduce risk but still allow edge‑case failures and inconsistent documentation. Leadership asks for something stronger than “the model is usually right”: a predictable, auditable, and fail‑closed process that never releases an unchecked plan.

What this solution delivers

On‑demand, safe discharge plans: Clinicians receive a proposed plan that is accompanied by a machine‑checked safety guarantee. If the guarantee can’t be established, the system returns a clear “cannot confirm safety” message with the reason and suggested next steps.
Operational reliability: The system is fail‑closed—it will not emit a plan unless an independent checker validates safety against a fixed, signed interaction database snapshot.
Full traceability: Every plan ships with an audit bundle (manifest, DB snapshot hash, and query traces) suitable for internal QA and external review.
Workflow fit: Integrates with the EHR and pharmacy systems to read diagnoses, allergies, and current medications and to write back the approved plan and its audit bundle. Pharmacists can override or annotate as needed.

Who uses it & where it fits

Discharging teams (hospitalists, residents): Generate a safe draft in seconds; escalate only when validation fails.
Pharmacy & medication reconciliation: Focus time on the minority of complex cases flagged by the checker instead of re‑checking straightforward ones.
Quality & compliance: Use the bundled proofs and traces to demonstrate consistent process control during audits.
IT & data governance: Manage the signed drug‑interaction snapshot lifecycle and monitor system health through structured metrics.

Inputs (business view)

Patient Profile P: demographics, active conditions, allergies, and current meds (read‑only from the EHR).
Proposed Plan R: adds/keeps/stops with dose, route, frequency (generated by the model or clinician‑authored).
Interaction Oracle Ω: a read‑only, versioned interaction/label database; each release is content‑addressed (e.g., sha256Ω) and signed by data governance.

The safety contract (technical, kept high‑level)

The system publishes a single, unambiguous obligation:

SafePlan(P, R, Ω) holds iff the plan has no drug–drug interactions, no condition contraindications, no allergy conflicts, and all doses are within labeled bounds for the patient—as judged strictly against the signed snapshot Ω@sha256Ω.

Concretely, four named properties must all be satisfied:

NoDDI: nothing in R.add_or_keep interacts with the patient’s current or kept meds.
NoContra: nothing added or kept is contraindicated for the patient’s conditions.
NoAllergy: nothing added or kept matches recorded allergies.
DoseInRange: each added/kept medication’s dose is appropriate for P.

Correctness obligation (plain language):
For any concrete patient P, the system may return a plan R only if an independent checker verifies SafePlan(P, R, Ω) against the fixed, signed database snapshot sha256Ω.

End‑to‑end flow (what actually happens)

Draft & propose
The model (or a clinician) proposes R and simultaneously emits a proof artifact claiming that SafePlan(P, R, Ω) holds, plus a manifest that pins the exact oracle snapshot (sha256Ω) and query traces for every interaction/label check performed.
Independent validation
A separate checker service (not the model) verifies the proof artifact and replays all oracle queries strictly against Ω@sha256Ω.
- Pass: Plan + proof are returned to the EHR; the user sees “Safety verified” along with a human‑readable explanation and any relevant caveats (e.g., “No interactions detected given snapshot 2025‑09 release”).
- Fail: The output is rejected. The user receives a concise reason (e.g., “Interaction detected between A and B” or “Dose outside labeled bounds”) and suggested alternatives or prompts for clarification.
Audit bundling & storage
The system stores (R, manifest, sha256Ω, query traces, validation result) as a tamper‑evident record. QA can reconstruct what was checked, with which data, and why it passed or failed.

What the clinician sees (UX snapshot)

A “Generate safe plan” action in the discharge workflow.
A result panel with: proposed meds and dosing, “Safety: Verified” (or “Safety: Not verified” with clear reasons), and a View details link showing the snapshot ID and an understandable rationale derived from the proof.

Change management & governance

Snapshot releases: Pharmacy/medical governance publishes Ω updates on a cadence (e.g., monthly). Each release is signed and content‑addressed (sha256Ω).
Controlled rollout: New snapshots are staged in lower environments; once promoted, the checker only uses the promoted snapshot, ensuring reproducibility.
Audit readiness: Every approval is traceable to a snapshot and a specific set of oracle queries; this supports internal review and external regulators.

KPIs & acceptance tests (business outcomes)

Throughput: % of discharge plans auto‑validated without pharmacist rework.
Safety yield: % of unsafe proposals caught by the checker before reaching clinicians.
Cycle time: Median time from “Generate plan” to “Plan verified or escalated.”
Explainability quality: Clinician‑rated clarity of the safety rationale.
Reproducibility: Given the same P, R, and sha256Ω, validation results match 100% of the time.

Sample acceptance tests

If any interaction is present per Ω, validation must fail with a specific reason.
If Ω changes (new sha256Ω), prior validations remain reproducible against their original snapshot.
If the proof artifact is malformed or incomplete, the system must reject the plan and surface a friendly remediation path (e.g., “Need dose for Med X”).

Integration & security notes

Data exchange: Reads patient conditions, allergies, and meds; writes back only validated plans and their audit bundle (IDs, not raw traces for routine viewers).
Least privilege: The checker has read‑only access to Ω and cannot modify clinical data.
PHI handling: Proof artifacts and traces avoid unnecessary PHI; all persisted artifacts are encrypted and lifecycle‑managed per retention policy.

Limits & escalation

Oracle coverage limits: The guarantee is only as comprehensive as Ω. The UI displays the snapshot ID and any known coverage gaps (e.g., missing herbal supplements).
Clinical nuance: When safety cannot be confirmed (e.g., rare comorbidity), the system fails closed, routes to a clinician/pharmacist, and suggests data or rationale needed for re‑validation.

Why this is different

Unlike heuristic guardrails that lower the chance of a bad plan, this pattern prevents an unchecked plan from leaving the system. It converts “we tried to be safe” into “we can prove we were safe, or we refused to proceed”—with artifacts that stand up to audit.

Brief quotes from the paper

“Across 26 OOD evaluations (180+ environments), deliberative alignment reduces covert action rates (OpenAI o3: 13%→0.4%; OpenAI o4-mini: 8.7%→0.3%) but does not fully eliminate them.”

“We find that models’ chain-of-thought often demonstrates awareness of being evaluated for alignment.”

“Therefore, we cannot exclude that the observed reductions in covert action rates are at least partially driven by situational awareness.”

Conclusion

Advanced alignment reduces risk; formal verification makes the remaining risk explicit and checkable. In finance, healthcare, and critical infrastructure, we pair LLM productivity with proof obligations and independent checkers to deliver correctness at scale.