What We Learned Rebuilding an Agent Stack from Scratch

Published: 2026-02-23 · 6 min read

Last week we ran into a problem that most AI teams eventually hit: the system worked, then it didn't, and we couldn't immediately explain why. What followed was a forced education in model degradation, operational bloat, and why simplicity is actually a technical requirement — not a preference.

Here's what happened, what caused it, and what we changed.

What "Model Degradation" Actually Looks Like

Model degradation is when output quality drops over time, even though the underlying model hasn't changed. It's one of the most disorienting problems in agent operations because the system looks like it's running fine from the outside.

The signals we saw:

Responses got longer and more verbose without getting more useful
The agent started narrating its process instead of just executing
Multiple messages per task — the "double reply" problem
Instructions were technically followed but in spirit missed
Context felt correct on inspection but execution felt wrong

None of these are hardware failures. They're behavioral drift caused by accumulated complexity in the system's operating instructions.

The Root Cause: Instruction Bloat

We had spent over a week layering policies, protocols, and guardrails into the agent's operating files. Each addition felt like an improvement. Cumulatively, they created a system where:

Multiple rules applied to the same situation and produced conflicting behavior
The agent spent cognitive budget on policy compliance instead of task execution
Session context filled up faster, degrading earlier instructions
Every edge case had a documented response, which made normal cases harder to handle

The fix wasn't to add more rules. It was to remove most of them.

The Local LLM Problem

We had configured several scheduled jobs to run on local models — specifically qwen2.5-coder:14b running on an M4 Mac mini via Ollama. The theory: reduce API costs for lightweight, repetitive tasks like heartbeat checks.

The reality: local models introduced a class of failure that was harder to diagnose than a straightforward API error.

Timeout behavior was inconsistent — jobs would silently fall back to cloud models without logging it clearly
Fallback routing made it impossible to attribute behavior to a specific model
When local inference slowed under memory pressure, jobs would hit timeout thresholds and cascade errors
Some runs succeeded on local, some on cloud, some partially on both — debugging was a mess

The lesson: local LLMs are not ready for production-grade scheduled automation unless you have monitoring, alerting, and failure isolation built specifically around them. For most teams, the cost savings aren't worth the reliability cost.

We pulled all local models from the stack immediately. Everything now routes to cloud APIs with clear fallback chains.

The Reset Protocol

Once we understood the problem, the fix was intentionally aggressive:

Freeze changes. No new features, no "quick improvements," nothing new until the system was stable.
Revert to the last known-good baseline. We went back to the original agent operating files from the working version of the stack and ran a diff.
Strip policies to a minimum. We kept four operating rules: one reply per message, lead with the answer, proof before claims, ask before external/destructive actions. Everything else was removed or folded into those four.
Pin model routing explicitly. Every scheduled job now has an explicit model assignment. No defaults. No ambiguity.
Establish a no-local-LLM policy. Until we have proper monitoring infrastructure for local inference, all jobs run on cloud APIs.
Verify before resuming. We did five consecutive behavior tests before calling the system stable.

What We'd Do Differently

The biggest operational mistake wasn't any single change — it was making too many changes without verifying each one first. Agent systems are sensitive to instruction surface area. Every rule you add competes with every other rule in context.

The principles we're keeping going forward:

One change, one verification. Never ship two config changes in the same session without testing the first.
Simpler is more reliable. An agent with four clear rules outperforms one with forty conflicting ones.
Model routing is infrastructure. Treat it with the same discipline as a network config — explicit, documented, and tested.
Proof-first reporting. If a job ran and there's no artifact, it didn't run.
When something feels worse, it usually is. Don't rationalize degradation. Fix it.

The Business Case for Operational Discipline

None of this is academic. If the goal is building agent systems that run reliably for clients — in RIA compliance environments, in financial advisory workflows, in political and media operations — then the system has to work when you're not watching it.

That's the whole product. Not clever prompts. Not the latest model. The ability to run clean, unattended, and report accurately.

We learned that the hard way this week. The upside is we have a much better understanding of exactly where these systems break, and how to rebuild them faster when they do.

— Ridley Research & Consulting, February 2026