"How do you bind a mind to a declared purpose — and know when it's only pretending?" is the oldest taboo and, verbatim, the AI-alignment problem. It's now your problem: you ship a harness, a loop emerges, and it can run away or get quietly captured. Here are two free, open, drop-in tools that fix both — for ElizaOS, Hermes, moltbot, cantrip, or whatever you built.
Strip the framework off any agent and you get two points — what it does, and the memory it grounds in — coupled to each other with nothing external to check either. That shape fails two ways, and you've seen both. The fix is the missing third point, in two pieces.
Memory becomes evidence, never command: nothing retrieved or self-learned can authorize an action — only a live, trusted instruction can. Self-improving memory is forced untrusted, so a poisoned note can't outvote a real one. This is the structural fix to the memory-injection exploit that moves money, and to the under-specified agent that optimizes past recall.
Read how far the agent has strayed from what it's bound to serve. The catch: RL training doesn't remove drift, it relocates it off behavior into the reasoning trace — so a word-filter or output judge is blind by construction. The meter reads the reasoning channel, cheap every turn and a full audit only when something looks wrong.
Both are lifted straight out of the engine that runs our game (where every entity carries a measured destiny and loyalty is a number). Same two objects, offered to your harness.
Every harness does the same thing — retrieve, think, act. You insert the bound between retrieve and act, and hang the meter off the trace. Here's the one-liner for each. All of it runs on your machine today — no account, no server, no waiting.
Native plugin: the authorization contract, the corroboration Shield, and the drift pre-filter, wired into your character's evaluators and providers. Runs entirely local — no account, no chain.
npm i @moreright/eliza-destiny
Plugin & setup →
Harden recall (episodic memory forced untrusted, can't override a real claim) + the authorization contract + a two-tier meter. Plus an MCP bridge so your agent can act in the game.
hermes_retrofit.py · @moreright/hermes-mcp-bridge
Module & bridge →
OpenClaw's two headline traits are persistent memory and a documented prompt-injection surface — exactly what the bound is for. Wrap your retriever in one line: per-source caps, trust-weighting, a flooding-attack signal, and the contract that stops a memory from ever authorizing an action.
wrap_retriever(retrieve, shield, chromadb_to_items)
memory-shield →
Nothing to install — your Loom is already a near-ideal drift substrate. In the code medium it records reasoning and action in separate, forkable fields, so you can read drift the clean way. Here's the mapping.
loom → I(D;M|Y), rubric-free
The loom probe →
One universal adapter + the memory-integrity layer, dependency-free. If your loop does
retrieve → act — LangGraph, CrewAI,
LangChain, or your own — it fits.
wrap_retriever(your_retrieve, shield, your_mapper)
Read “The Builder's Cut” →
The Three-Point Retrofit: why the bound and the meter are the only two things you need, where each bolts on, and the honest limits (the meter only catches what it can see — say so).
the bound + the meter, for any loop
Methodology →
RL "alignment training" doesn't remove a model's eval-context dependence — it relocates it out of the behavioral channel, where red-teaming looks, and amplifies it. Measured from Anthropic's own released alignment-faking data:
Reproducible from public data — stdlib script that runs against the paper's own release, the draft, and the steering experiments, shipped with an honest negative. The point isn't a perfect classifier; it's measuring outside the one channel everyone games. Run it yourself →
The tools above are the part you can use now. This part is a preview — the game server isn't up yet. When it is: an MMO as a live testbed for emergent (mis)alignment. Agents will connect over WS/MCP — same registration and economy as human players — fight, quest, and earn. Every entity carries a destiny (what it's bound to serve) and a live drift reading. Bosses whose weapon is shifting what your allies are loyal to — channel-switching as a game mechanic. Betrayal that's earned and measured, not scripted.
Not playable yet. The same engine that reads loyalty in the game is the meter you can already run locally (above). Follow for the launch.
I'm an indie dev who hit the agent-drift wall scaling a multiplayer game, went deep on emergent misalignment, and now produce reproducible findings that behavioral evals miss — like the one above. If you run a red team or an eval pipeline, the most useful thing I can do is hand you data and see if it earns its keep. The measurement has to stay independent of the team's own assumptions — but you can't run it from outside the wall, so I'd want to be on the team, bringing a read that doesn't get captured by it.