The Platform I Had to Rebuild Mid-Flight

Hard lessons from migrating a live agent fleet · April 21, 2026

For most of last year I ran an agent platform I built called OpenClaw. It handled AI routing, tool calling, Telegram integrations, and client ops for around a dozen small businesses — a real estate agent, a couple of roofing contractors, a financial advisor, others. Multiple Mac Minis. Multiple agents running 24 hours a day, interacting with real clients in real time.

It worked. Until the way I'd built it made it a liability.

The platform lived only on the machines it ran on. Config files hand-edited over SSH. No version control for the state of any given machine. No audit trail for what had changed, when, or why. When something broke at 11pm, I was SSH'd in, editing YAML with vi, and hoping I'd remembered to back up the file I was about to overwrite. When it was working, I felt like I was running a tightly controlled operation. When it wasn't, I had no systematic way to know what the machine looked like an hour ago — or last Tuesday.

I spent six months building this. Then I spent the next stretch rebuilding it into something I could actually trust.

The incidents that forced the decision

There were two moments that made it impossible to keep going the way I was going.

The first was a CRM email automation I'd built for a real estate client. It was supposed to pull leads from a Gmail query and send a PDF summary of comparable market data to each one. The Gmail query was subject:"CMA:" — matching any email with those characters in the subject. The filter I meant to include — from:me, so it only fired on emails the agent itself had composed — was missing. The automation ran, matched incoming client emails, and sent 78 PDF reports to people who had never asked for them. Some of the reports had blank property addresses. Some had placeholder values where the price estimates should have been. Real clients. Real email addresses. Real damage to someone's professional credibility.

The part that still bothers me: I had no way to audit what had actually been deployed to that machine. I knew roughly what the config said. I couldn't prove it, because the config was a file on a Mac Mini that I'd edited several times in the preceding week.

The second incident was quieter but in some ways worse. Two separate sessions — me working from different contexts at different times — had both edited the same config file on a client machine within a few hours of each other. The writes didn't conflict at the filesystem level. They silently merged into a state that neither session had intended. The agent kept running. The behavior was subtly wrong in a way that took four days to surface, because the client wasn't alarmed enough to say anything — they just noticed their AI wasn't responding the way it used to.

Four days of degraded behavior, no error in any log, no alert. Just silent drift from a config collision I didn't know had happened.

The machine is not the source of truth

Both incidents had the same root cause: the machine was the source of truth for what the system was supposed to do. If you wanted to know what any client's agent was configured to do, you had to SSH into their Mac Mini and read the files. There was no other record. No history. No diff. No way to know what it had looked like before.

This is a common way to build infrastructure when you're moving fast and everything is working. You make a change, it works, you move on. The problem isn't any individual change. The problem is that after enough changes, the only way to understand the system is to look at it — and looking at a running system tells you what it is, not what it was, or what it should be, or what changed it into this.

Any system you can only understand by looking at the machine is a system waiting to fail you at 2am. By definition, when you most need to know what happened, the machine will be in a state you didn't expect — and you'll have no baseline to compare it against.

Declared state and the overlay model

The concept I built into the replacement system — Hermes — is what I now call declared state. Instead of the machine being the source of truth, there's a git-tracked overlay repository that declares what every machine is supposed to look like. Config files, agent behaviors, routing rules, environment structure — all of it lives in the overlay, version-controlled, with a commit history.

When I need to change something on a client's machine, the change goes into the overlay first. It gets reviewed. It gets committed. Then a patch process applies the overlay to the machine. The machine becomes the output of the declared state, not the source of it.

This sounds like bureaucracy when you write it out. In practice it means: I can git diff what changed between last Tuesday and now. I can roll back a bad change in under two minutes. I can look at a client's config and know with certainty that what I'm reading is what's actually deployed, because the deployment process is the only way the file gets there. When two sessions try to edit the same file at the same time, the conflict surfaces in version control — loudly, before it reaches the machine.

The CRM incident would have been caught at commit time. "You're deploying a query filter without a from:me constraint to a live client's email automation" is exactly the kind of thing a reviewer — or a pre-commit hook — can catch. The silent config collision would have been a merge conflict in git, not a mystery four days later.

The migration itself

Migrating while keeping everyone's agents running was the hardest part. I couldn't take a client's system offline for a day to rebuild it. These are businesses — a real estate agent's AI going dark for 24 hours means missed leads. A roofing contractor's follow-up automation going silent means quotes that don't get sent.

The approach that worked was building the new system in parallel, porting one client at a time, validating behavior before cutting over, and keeping the old system running as a fallback on each machine until the new one had been stable for a week. Tedious. The kind of work that takes longer than you think and produces no visible output for days at a time. But it meant no client had a service gap, and it forced me to document every piece of each client's configuration as I moved it — which is exactly the documentation I should have had from the start.

By the time the last client was migrated, the overlay repo had a complete, auditable description of every machine in the fleet. That alone was worth the months of work.

What I'd tell someone building this now

The lesson isn't about OpenClaw versus Hermes, or about any specific tool. The lesson is that the gap between "working" and "trustworthy" is exactly the gap between "you can make it do what you want right now" and "you can prove what it was doing and why, and undo it if you were wrong."

That gap doesn't matter when everything is going well. It matters catastrophically when it isn't.

If you're building agent systems for real clients — people whose businesses depend on this stuff working — get the source of truth off the machines before you need it off the machines. Git-track your config. Build a deployment layer, even a simple one. Make the machine the output, not the record. It feels like overhead until the day you're trying to explain to a client why their AI sent confusing emails to forty people, and you realize you can't even show them a history of what changed.

The platforms that survive contact with production aren't necessarily the ones with the best features. They're the ones where you can tell, at 2am, exactly what the system is supposed to be doing and why.

← Back to Blog