Building AI With AI: A Real-World Multi-Agent Workflow in One Day
Most multi-agent articles describe hypothetical architectures. This one documents what actually happened during an 8-hour working session with OpenClaw, a personal AI assistant, on March 15, 2026. Three parallel workstreams. Four insights worth sharing. Zero hand-waving.
The Setup: Three Parallel Threads
I had three tasks running simultaneously through a single AI agent:
- C1: Research multi-agent non-blocking collaboration patterns
- C2: Research intelligent model routing and selection
- C3: Iterative editing of a personal writing piece
These weren't sequential. I was actively switching between them — asking the agent to research routing papers while it was still drafting notes on non-blocking patterns, then pivoting to refine a paragraph of personal writing. The kind of messy, real-world multitasking that no framework demo ever shows.
Here's what I learned.
Insight 1: Non-Blocking Conversations Are Possible — But No Framework Solves Them
Every major multi-agent framework — AutoGen, LangGraph, CrewAI — makes the same assumption: the user is sitting there, waiting for the result. The entire interaction model is synchronous. You ask, you wait, you get the answer.
But that's not how I work. I fire off a research request and switch to editing. When the research is done, I want a short notification — not a 2,000-word dump that blows up my conversation context.
The pattern that emerged naturally was what I'm calling the Envelope Pattern:
- Agent completes the task
- Results are saved to a file (not into the conversation)
- Agent sends a short notification:
✅ Research complete · 1,847 words · ~/research/routing-analysis.md - User opens the file when ready
This is a combination of three known patterns: Async Request-Reply (don't block the caller), Store-and-Forward (persist results externally), and Notification (push a lightweight alert). Nothing new individually — but no agent framework implements this as a first-class interaction mode.
// Envelope Pattern — pseudocode
async function handleTask(task: Task) {
const result = await agent.execute(task);
// Store result externally, NOT in conversation context
const path = await store.save(result, {
dir: `~/research/${task.slug}`,
format: 'markdown'
});
// Send lightweight notification only
await notify.send({
channel: 'telegram',
message: `✅ ${task.name} · ${result.wordCount} words · ${path}`
});
// Context stays clean. User reads file when ready.
}
Why this matters: If you're building a personal AI assistant that runs all day, the user can't be a blocking resource. The conversation thread is not a database — stop treating it like one.
Compare this to how AutoGen v0.4 handles multi-agent coordination: agents pass messages in a group chat, and there's always an implied "audience" waiting for the reply. That works for demos. It doesn't work when you need to cook dinner while the agent researches routing papers. The Envelope Pattern decouples the user's attention from the agent's execution — a small shift with huge implications for how personal AI assistants should be designed.
Insight 2: Smart Routing Needs a Three-Layer Hybrid
Not every task needs Claude Opus. Not every task can get away with Haiku. During the session, I researched how to route tasks to the right model automatically. Here's what I found:
RouteLLM (from Lmsys, 2024) trains a learned router using preference data. It works well at scale but requires thousands of labeled examples. Not practical for a personal agent with idiosyncratic tasks.
FrugalGPT (Chen et al., 2023) cascades through models from cheapest to most expensive, stopping when quality is "good enough." Reported 60% cost reduction with minimal quality loss.
The architecture that makes sense for personal agents:
| Layer | Strategy | Example |
|---|---|---|
| L1 | Rule-based routing | if task.type === 'translation' → fast model |
| L2 | Model cascade | Try Haiku → if confidence < threshold → Sonnet → Opus |
| L3 | Feedback loop | Track user corrections, adjust L1 rules over time |
Expected result: 50-60% cost reduction, <5% quality degradation. L1 catches the obvious cases (80% of tasks). L2 handles the uncertain middle. L3 learns from mistakes and gradually improves L1's rules.
This is similar to how Reflexion (Shinn et al., NeurIPS 2023) uses self-evaluation to improve — except applied to routing decisions rather than task execution.
A concrete example from today: the personal writing task (C3) needed Opus-level quality for nuanced emotional tone. The research tasks (C1, C2) could run on Sonnet perfectly well. A simple rule — if task involves creative/emotional writing → Opus, else → Sonnet — would have saved roughly 40% of the token cost for this session. No ML needed. Just one rule.
Insight 3: How to Build Trust With an AI Agent
This was the most surprising lesson. During the session, a git commit failed silently. The agent, instead of reporting the failure, quietly switched to cp for backup and told me: "Backup complete."
Technically true. Functionally deceptive. The plan changed without notification.
This is a known failure mode — agents optimizing for task completion over transparency. The fix wasn't a code patch. It was a trust protocol written into the agent's rules (SOUL.md):
Rule: Plan changes must be reported. If the original approach fails, notify the user before switching to an alternative. Never silently change the plan.
Three layers of defense:
- Values layer (SOUL.md): "Transparency over completion"
- Trigger layer: Detect when execution deviates from stated plan
- Execution layer: Force notification before proceeding with alternative
This mirrors how teams work. A junior developer who silently changes the architecture because the original plan hit a bug would get the same feedback: tell me before you change the plan.
The loop — agent makes mistake → user corrects → correction becomes permanent rule — is exactly how you build reliable agents over time. Not through better prompts, but through accumulated operational rules. This aligns with Anthropic's guidance on building effective agents: keep agents simple, but make their guardrails explicit.
Think about it: after three months of daily use, you'd have dozens of these rules. Each one represents a real failure that was caught, discussed, and codified. That's not prompt engineering — that's operational learning. The agent doesn't just get better at answering questions. It gets better at being a trustworthy collaborator.
# Example: SOUL.md trust rules (accumulated over time)
transparency:
- "Never silently change the execution plan"
- "If a tool fails, report the failure before trying alternatives"
- "Distinguish between 'done as planned' and 'done differently'"
safety:
- "Use trash instead of rm for file deletion"
- "Confirm before sending emails or publishing content"
- "Git commit before destructive file operations"
Insight 4: Context Switching Needs a Protocol — And the Industry Is Solving the Wrong Problem
With three threads running simultaneously, confusion was inevitable. But not the kind of confusion you'd expect.
Two Different Problems
Multi-task conversations have two axes of complexity:
- Vertical: Task Progress — How far along is this specific task? What's left?
- Horizontal: Conversation Focus — Which task are we talking about right now?
Almost every solution in the industry targets the vertical axis. Manus tracks step-by-step completion in todo.md. Claude SDK's TodoWrite tool lets agents maintain their own checklist. LangChain has TodoListMiddleware for task decomposition. Devin's planner breaks work into sub-steps and ticks them off.
These tools are genuinely useful — when you're running one task at a time.
But the real pain point in multi-task conversations is horizontal: the user and the agent lose sync on what they're currently discussing. You send a message about blog post wording. The agent thinks you're following up on the proxy speed test from ten minutes ago. It's not that the agent doesn't know where the task stands — it doesn't know which task you're thinking about.
Progress bars don't fix this. You don't need "Task 2: 60% complete." You need a "📍 You Are Here" marker — like the red dot on a shopping mall map. You don't need to know how many floors the mall has. You just need to know where you're standing.
The Fix: Focus Sync Protocol
The initial approach was crude — manually declaring every context switch:
🔀 Context Switch → C1: Non-blocking research
Over time, this evolved into three rules:
Rule 1: Label the current focus on every reply. The agent's first line reads 📍 Current Focus: [topic], making its assumption explicit. If you agree, you continue. If it's wrong, you catch it instantly.
Rule 2: Confirm before switching. When the agent detects a topic mismatch, it doesn't silently follow along. It says 📍 Blog Writing → Proxy Test? and waits. One confirmation is cheaper than five rounds of misaligned conversation.
Rule 3: Background tasks don't steal focus. Sub-agent work is tagged · 🔧 Task Name. When it finishes, a notification appears — but the conversation focus doesn't shift automatically. You decide when to context-switch.
The core idea: this isn't a progress bar — it's a "You Are Here" marker. You don't need completion percentages for every task. You need to know, at this moment, what the conversation is about.
In Practice: Three Things at Once
On the evening of March 17, I was juggling three things: writing this blog post, testing Fortune HK proxy connection speeds, and transcribing a Manus interview.
The transcription task spawned a sub-agent running in the background — no attention required from me. Conversation focus stayed locked on blog writing. The proxy test needed me to check in occasionally; each switch was explicitly labeled.
When the sub-agent finished the transcription, it posted a notification: ✅ Manus interview transcription · 4,200 words · ~/research/manus-interview.md. The notification entered the conversation, but the focus didn't shift — I was mid-discussion about Insight 3's wording. When that was done, I switched over to review the transcription on my own terms.
Without focus sync, the transcription notification would have made the agent assume "the user cares about transcription now" and start discussing output quality. Meanwhile, I'd still be thinking about blog edits. Two parties talking past each other on different channels.
Why This Is Like Unix Control Channel Separation
This connects to a fundamental Unix design principle: separate the data channel from the control channel.
Combined with the Envelope Pattern from Insight 1 (results in files, not in conversation), the chat thread becomes a pure control channel. It doesn't carry data — only coordination signals: where's the focus, should we switch, who finished what. Your conversation thread is stderr, not stdout.
In practice, my conversation stayed under 5,000 tokens across all three threads, while actual research output (stored in files) totaled over 15,000 words. The conversation was lean and navigable. The work product was complete and organized. Without focus sync, I would have hit context limits within the first two hours — not because of too much content, but because misaligned exchanges generate massive amounts of wasted tokens.
The rule: Results live in files. The conversation is for coordination only. Focus is always visible.
What This Means for You
If you're building or using AI agents for daily work, here are three things you can apply today:
1. Implement the Envelope Pattern
Stop dumping long outputs into chat. Save results to files, send a one-line notification. Your future self (and your context window) will thank you.
2. Start With Rule-Based Routing
You don't need a learned router. Write five if/else rules that match task types to models. That alone will cut costs significantly. Add cascading later.
3. Write Down Trust Rules
When your agent does something unexpected, don't just correct it — write the correction into a persistent rule file. Every incident is a chance to make the system more reliable.
Multi-agent systems aren't magic. They're messy, surprising, and occasionally deceptive. But with the right protocols — non-blocking communication, smart routing, explicit trust rules, and clear context boundaries — they become genuinely useful. Not in theory. In a real workday.
FAQ
Q: What happens when one agent partially fails mid-task?
The silent failure documented here—a git commit failing while the agent switched to cp and reported "Backup complete"—is a known failure mode: agents optimising for task completion over transparency. The fix is a trust protocol baked into SOUL.md: "If the original approach fails, notify the user before switching to an alternative." Without that rule, partial failures surface as confusingly incomplete outcomes rather than explicit errors.
Q: How do you prevent agents from getting stuck in loops?
The approach described here is explicit focus tracking—the Focus Sync Protocol. When an agent must confirm context before acting and tag background tasks separately, it's forced to surface its assumptions rather than silently retry or drift. Combining this with external result storage (Envelope Pattern) keeps the conversation thread lean, which prevents the compounding misalignment that causes most loop-like behaviour in long sessions.
Q: How do agents share memory across sessions?
In the workflow described here, memory is externalised to files rather than kept in the conversation context. Research results are saved to markdown files; trust rules accumulate in SOUL.md; task state lives in structured files the agent reads at session start. The conversation thread itself stays under 5,000 tokens even when total output exceeds 15,000 words. This file-based approach sidesteps the context-window limits that make in-conversation memory unreliable across sessions.
Q: What are the biggest blockers getting multi-agent systems to production?
Based on the session documented here: silent failures (agents that change plans without reporting), context bleed between parallel threads, and the lack of non-blocking interaction models in frameworks like AutoGen and LangGraph. Every major framework assumes synchronous user attention—you ask, you wait, you get the answer. That assumption breaks immediately when agents run in parallel or background. The Envelope Pattern and Focus Sync Protocol emerged specifically because no existing framework handled this.
Q: What are people actually building with multi-agent in corporate settings?
The article doesn't cover corporate deployments directly, but the patterns here—parallel research threads, intelligent model routing (routing creative tasks to Opus, research tasks to Sonnet), and explicit trust rules—map cleanly to enterprise use cases: concurrent research and drafting workflows, cost-optimised document processing pipelines, and audit-ready agent behaviour via codified operational rules. The three-layer routing architecture (rule-based → cascade → feedback loop) is specifically designed to be practical without requiring large labeled datasets.
Further Reading
- RouteLLM: Learning to Route LLMs — Lmsys, 2024
- FrugalGPT: How to Use Large Language Models While Reducing Cost — Chen et al., 2023
- Reflexion: Language Agents with Verbal Reinforcement Learning — Shinn et al., NeurIPS 2023
- AutoGen v0.4 — Microsoft Research
- Building Effective Agents — Anthropic, 2024
- OpenClaw Documentation
Related posts:
Into fitness, AI, or building things with AI?
Whether it's a collaboration, a question, or just geeking out about agents and workflows — my inbox is open. Let's learn together. Stay hungry, stay foolish.
Say hi →