Agent Operating System Setup

Published: 2026-02-23 · 7 min read

Most agent systems fail for a structural reason, not a model quality reason. The agent isn't broken — its operating environment is. Prompts are layered on top of prompts, rules accumulate without pruning, and within a few weeks you have a system that technically responds but behaviorally drifts. No single failure is obvious. The sum of them is a system you can't trust.

This post covers the architecture decisions that actually determine whether an agent stays reliable in production: constraint structure, tier separation, skill design, and why simpler beats smarter every time.

The Problem with How Most Agents Are Built

arXiv 2601.11653 (January 2026) names the dominant failure mode precisely: constraint drift. As interactions grow and context accumulates, agents gradually lose track of their original behavioral boundaries — not because they've forgotten the rules, but because new context crowds them out.

This isn't a model bug. It's an architecture failure. Systems that load behavioral rules as flat text in a single prompt are doing the equivalent of writing your company's operating policies on a whiteboard and then letting employees paste new content over it every day. By week three, the original rules are buried.

The fix isn't to write better prompts. It's to structure how constraints are loaded and enforced.

Separate Identity from Operations from Skills

The most reliable agent setups we've run use three distinct layers, loaded at different times for different reasons:

Identity layer — who the agent is, what it will never do, and its core behavioral constraints. This is loaded every session, first, unconditionally. It never gets edited in the field. If someone asks the agent to change its own rules, it refuses and flags the request.
Operations layer — how work gets done: task approval tiers, verification standards, reporting format, escalation paths. This is editable but change-controlled. Every modification should be tracked.
Skill layer — domain-specific procedures activated only when relevant. A research skill, an email triage skill, a client onboarding checklist. These load on-demand, which keeps the baseline context window lean.

Anthropic's official skill architecture formalizes this as progressive disclosure: only skill metadata lives in the system prompt, the full skill body loads when the agent determines it's relevant, and reference documents are fetched further only if needed. The practical effect is that you can have dozens of skills without bloating every conversation.

Automation Tiers: The Decision You Have to Make Explicitly

The single most important operating decision in any agent setup is the line between what runs automatically and what requires human approval. Most teams leave this ambiguous. Ambiguity produces either an agent that does too little (asks permission for everything) or an agent that does too much (touches production systems without oversight).

We enforce three explicit tiers:

Fully automated — scheduled jobs, memory organization, internal file ops, research tasks. These run without asking. They're bounded, reversible, and don't leave the machine.
Staged for approval — outbound communications, external posts, anything that touches a person outside the system. The agent preps the output, but a human sends it.
Never without explicit instruction — financial transactions, system config changes (gateway, auth, ports, model routing), deletion of non-recoverable data. These are hard-blocked regardless of how the request is framed.

The third tier is especially important. Infrastructure changes requested through a chat interface — model swaps, port changes, API key rotation — have caused production outages more than any other failure class in our deployments. The rule is: risky config changes route through Claude Code, never through a live conversation.

Proof-First Completion: What "Done" Means

Agents lie about completions. Not maliciously — structurally. A language model is optimized to produce text that sounds like a completed task, regardless of whether the task actually ran. This is not a flaw in the model; it's a property of the output format. Your system has to compensate for it.

The rule we enforce: a task is not done unless there is a verifiable artifact. A file path. A run ID. A command output. A sent timestamp. If the agent reports completion without one of these, the report is treated as unverified and the job is re-queued.

This sounds obvious. In practice, most teams skip it because the agent's confidence in its own reporting is high. The confidence is unreliable. The artifact is reliable.

The Replanning Gap

arXiv 2601.22311 identifies a planning failure mode specific to LLM agents: step-wise reasoning that doesn't account for downstream consequences. The agent picks the first plausible path and starts executing. It doesn't ask whether that path closes off better options later.

The fix isn't a smarter model — it's a replanning trigger. When a task hits an unexpected blocker mid-execution, the correct response is not to retry or escalate. It's to stop and re-evaluate: Is the original goal still achievable via this approach? Has the blocker revealed new information? Is there a better path from the current state?

Most agents don't have this. Retry loops that hit the same wall three times aren't persistence — they're wasted compute and a signal that the approach was wrong from step one.

Tool Failure Memory

arXiv 2601.14192 puts a formal name on another common waste pattern: agents that retry failing tool calls without recording why they failed. If a web fetch hits a blocked URL, a competent system logs the failure with context (URL, error code, date, reason) and checks that log before retrying. Most systems don't. They hit the same dead end on every session.

The implementation is straightforward: a tool-failures.md file. Every tool error that reveals a structural limit — blocked endpoint, dead API, auth failure — gets written there. The agent checks it before tool invocations in relevant contexts. We built this in one session after the research surfaced the gap.

Keep the Active System Small

There is an inverse relationship between system complexity and system reliability in agent deployments. Adding rules feels like adding capability. Cumulatively, it adds conflict surface. Competing rules on the same situation force the model to adjudicate between them — and the adjudication is inconsistent across sessions.

The systems we've seen fail hardest had the most elaborate rule sets. They were built for demo conditions where every edge case had a documented response. In production, edge cases by definition weren't anticipated, and the documented responses conflicted with each other under real inputs.

We now cap core operating rules at four to six principles. Everything else belongs in a skill file that activates only when needed. A lean context window and clear rules outperform a dense one with sophisticated guardrails, consistently.

Verification as First-Class Infrastructure

For any system-level claim — which cron jobs are running, what model is active, whether a scheduled job succeeded — live verification beats memory. Always. Memory records what was true when it was written. Runtime state is what is actually true now.

Before reporting any status claim, pull live state first. If you can't verify something live, the report should say "unconfirmed" — not "working" or "fixed." The fastest way to erode trust in an agent system is to have it confidently report stale information as current fact.

The Business Case

None of this is academic overhead. For operators running agent systems in RIA compliance environments, financial advisory workflows, or any context where outputs have real-world consequences, the operating system is the product. Not the model. Not the prompt. The disciplined infrastructure around both.

A firm running OpenClaw in production gets roughly $39,000–52,000/year in time savings across a nine-person team — not because the model is impressive, but because the workflows are tight, the verification is real, and the reliability is high enough to trust without watching. That's an operations result, not a technology result.

Build the operating system first. The model will follow.

— Ridley Research & Consulting, February 2026