What We Learned Rebuilding an Agent Stack from Scratch

Published: 2026-02-23 · 6 min read

Last week we ran into a problem that most AI teams eventually hit: the system worked, then it didn't, and we couldn't immediately explain why. What followed was a forced education in model degradation, operational bloat, and why simplicity is actually a technical requirement — not a preference.

Here's what happened, what caused it, and what we changed.

What "Model Degradation" Actually Looks Like

Model degradation is when output quality drops over time, even though the underlying model hasn't changed. It's one of the most disorienting problems in agent operations because the system looks like it's running fine from the outside.

The signals we saw:

None of these are hardware failures. They're behavioral drift caused by accumulated complexity in the system's operating instructions.

The Root Cause: Instruction Bloat

We had spent over a week layering policies, protocols, and guardrails into the agent's operating files. Each addition felt like an improvement. Cumulatively, they created a system where:

The fix wasn't to add more rules. It was to remove most of them.

The Local LLM Problem

We had configured several scheduled jobs to run on local models — specifically qwen2.5-coder:14b running on an M4 Mac mini via Ollama. The theory: reduce API costs for lightweight, repetitive tasks like heartbeat checks.

The reality: local models introduced a class of failure that was harder to diagnose than a straightforward API error.

The lesson: local LLMs are not ready for production-grade scheduled automation unless you have monitoring, alerting, and failure isolation built specifically around them. For most teams, the cost savings aren't worth the reliability cost.

We pulled all local models from the stack immediately. Everything now routes to cloud APIs with clear fallback chains.

The Reset Protocol

Once we understood the problem, the fix was intentionally aggressive:

  1. Freeze changes. No new features, no "quick improvements," nothing new until the system was stable.
  2. Revert to the last known-good baseline. We went back to the original agent operating files from the working version of the stack and ran a diff.
  3. Strip policies to a minimum. We kept four operating rules: one reply per message, lead with the answer, proof before claims, ask before external/destructive actions. Everything else was removed or folded into those four.
  4. Pin model routing explicitly. Every scheduled job now has an explicit model assignment. No defaults. No ambiguity.
  5. Establish a no-local-LLM policy. Until we have proper monitoring infrastructure for local inference, all jobs run on cloud APIs.
  6. Verify before resuming. We did five consecutive behavior tests before calling the system stable.

What We'd Do Differently

The biggest operational mistake wasn't any single change — it was making too many changes without verifying each one first. Agent systems are sensitive to instruction surface area. Every rule you add competes with every other rule in context.

The principles we're keeping going forward:

The Business Case for Operational Discipline

None of this is academic. If the goal is building agent systems that run reliably for clients — in RIA compliance environments, in financial advisory workflows, in political and media operations — then the system has to work when you're not watching it.

That's the whole product. Not clever prompts. Not the latest model. The ability to run clean, unattended, and report accurately.

We learned that the hard way this week. The upside is we have a much better understanding of exactly where these systems break, and how to rebuild them faster when they do.

— Ridley Research & Consulting, February 2026