Remix.run Logo
alexgarden a day ago

The short version: instructions tell the model what to do. An Alignment Card declares what the agent committed to do — and then a separate system verifies it actually did.

Most intent/instruction work (system prompts, Model Spec, tool-use policies) is input-side. You're shaping behavior by telling the model "here are your rules." That's important and necessary. But it's unverifiable — you have no way to confirm the model followed the instructions, partially followed them, or quietly ignored them.

AAP is an output-side verification infrastructure. The Alignment Card is a schema-validated behavioral contract: permitted actions, forbidden actions, escalation triggers, values. Machine-readable, not just LLM-readable. Then AIP reads the agent's reasoning between every action and compares it to that contract. Different system, different model, independent judgment.

Bonus: if you run through our gateway (smoltbot), it can nudge the agent back on course in real time — not just detect the drift, but correct it.

So they're complementary. Use whatever instruction framework you want to shape the agent's behavior. AAP/AIP sits alongside and answers the question instructions can't: "did it actually comply?"

tiffanyh a day ago | parent [-]

> Then AIP reads the agent's reasoning between every action and compares it to that contract.

How would this work? Is one LLM used to “read” (and verify) another LLMs reasoning?

alexgarden a day ago | parent [-]

Yep... fair question.

So AIP and AAP are protocols. You can implement them in a variety of ways.

They're implemented on our infrastructure via smoltbot, which is a hosted (or self-hosted) gateway that proxies LLM calls.

For AAP it's a sidecar observer running on a schedule. Zero drag on the model performance.

For AIP, it's an inline conscience observer and a nudge-based enforcement step that monitors the agent's thinking blocks. ~1 second latency penalty - worth it when you must have trust.

For both, they use Haiku-class models for intent summarization; actual verification is via the protocols.

tiffanyh a day ago | parent [-]

Dumb question: don’t you eventually need a way to monitor the monitoring agent?

If a second LLM is supposed to verify the primary agent’s intent/instructions, how do we know that verifier is actually doing what it was told to do?

alexgarden a day ago | parent [-]

Not a dumb question — it's the right one. "Who watches the watchmen" has been on my mind from the start of this.

Today the answer is two layers:

The integrity check isn't an LLM deciding if it "feels" like the agent behaved. An LLM does the analysis, but the verdict comes from checkIntegrity() — deterministic rule evaluation against the Alignment Card. The rules are code, not prompts. Auditable.

Cryptographic attestation. Every integrity check produces a signed certificate: SHA-256 input commitments, Ed25519 signature, tamper-evident hash chain, Merkle inclusion proof. Modify or delete a verdict after the fact, and the math breaks.

Tomorrow I'm shipping interactive visualizations for all of this — certificate explorer, hash chain with tamper simulation, Merkle tree with inclusion proof highlighting, and a live verification demo that runs Ed25519 verification in your browser. You'll be able to see and verify the cryptography yourself at mnemom.ai/showcase.

And I'm close to shipping a third layer that removes the need to trust the verifier entirely. Think: mathematically proving the verdict was honestly derived, not just signed. Stay tuned.

tiffanyh a day ago | parent [-]

Appreciate all you’re doing in this area. Wishing you the best.

alexgarden a day ago | parent [-]

You're welcome - and thanks for that. Makes up for the large time blocks away from the family. It does feel like potentially the most important work of my career. Would love your feedback once the new showcase is up. Will be tomorrow - preflighting it now.