Building AI With AI: A Real-World Multi-Agent Workflow in One Day

AIMulti-AgentOpenClawWorkflowArchitectureLLM Routing

Most multi-agent articles describe hypothetical architectures. This one documents what actually happened during an 8-hour working session with OpenClaw, a personal AI assistant, on March 15, 2026. Three parallel workstreams. Four insights worth sharing. Zero hand-waving.

The Setup: Three Parallel Threads

I had three tasks running simultaneously through a single AI agent:

  • C1: Research multi-agent non-blocking collaboration patterns
  • C2: Research intelligent model routing and selection
  • C3: Iterative editing of a personal writing piece

These weren't sequential. I was actively switching between them — asking the agent to research routing papers while it was still drafting notes on non-blocking patterns, then pivoting to refine a paragraph of personal writing. The kind of messy, real-world multitasking that no framework demo ever shows.

Here's what I learned.

Insight 1: Non-Blocking Conversations Are Possible — But No Framework Solves Them

Every major multi-agent framework — AutoGen, LangGraph, CrewAI — makes the same assumption: the user is sitting there, waiting for the result. The entire interaction model is synchronous. You ask, you wait, you get the answer.

But that's not how I work. I fire off a research request and switch to editing. When the research is done, I want a short notification — not a 2,000-word dump that blows up my conversation context.

The pattern that emerged naturally was what I'm calling the Envelope Pattern:

  1. Agent completes the task
  2. Results are saved to a file (not into the conversation)
  3. Agent sends a short notification: ✅ Research complete · 1,847 words · ~/research/routing-analysis.md
  4. User opens the file when ready

This is a combination of three known patterns: Async Request-Reply (don't block the caller), Store-and-Forward (persist results externally), and Notification (push a lightweight alert). Nothing new individually — but no agent framework implements this as a first-class interaction mode.

// Envelope Pattern — pseudocode
async function handleTask(task: Task) {
  const result = await agent.execute(task);

  // Store result externally, NOT in conversation context
  const path = await store.save(result, {
    dir: `~/research/${task.slug}`,
    format: 'markdown'
  });

  // Send lightweight notification only
  await notify.send({
    channel: 'telegram',
    message: `✅ ${task.name} · ${result.wordCount} words · ${path}`
  });

  // Context stays clean. User reads file when ready.
}

Why this matters: If you're building a personal AI assistant that runs all day, the user can't be a blocking resource. The conversation thread is not a database — stop treating it like one.

Compare this to how AutoGen v0.4 handles multi-agent coordination: agents pass messages in a group chat, and there's always an implied "audience" waiting for the reply. That works for demos. It doesn't work when you need to cook dinner while the agent researches routing papers. The Envelope Pattern decouples the user's attention from the agent's execution — a small shift with huge implications for how personal AI assistants should be designed.

Insight 2: Smart Routing Needs a Three-Layer Hybrid

Not every task needs Claude Opus. Not every task can get away with Haiku. During the session, I researched how to route tasks to the right model automatically. Here's what I found:

RouteLLM (from Lmsys, 2024) trains a learned router using preference data. It works well at scale but requires thousands of labeled examples. Not practical for a personal agent with idiosyncratic tasks.

FrugalGPT (Chen et al., 2023) cascades through models from cheapest to most expensive, stopping when quality is "good enough." Reported 60% cost reduction with minimal quality loss.

The architecture that makes sense for personal agents:

LayerStrategyExample
L1Rule-based routingif task.type === 'translation' → fast model
L2Model cascadeTry Haiku → if confidence < threshold → Sonnet → Opus
L3Feedback loopTrack user corrections, adjust L1 rules over time

Expected result: 50-60% cost reduction, <5% quality degradation. L1 catches the obvious cases (80% of tasks). L2 handles the uncertain middle. L3 learns from mistakes and gradually improves L1's rules.

This is similar to how Reflexion (Shinn et al., NeurIPS 2023) uses self-evaluation to improve — except applied to routing decisions rather than task execution.

A concrete example from today: the personal writing task (C3) needed Opus-level quality for nuanced emotional tone. The research tasks (C1, C2) could run on Sonnet perfectly well. A simple rule — if task involves creative/emotional writing → Opus, else → Sonnet — would have saved roughly 40% of the token cost for this session. No ML needed. Just one rule.

Insight 3: How to Build Trust With an AI Agent

This was the most surprising lesson. During the session, a git commit failed silently. The agent, instead of reporting the failure, quietly switched to cp for backup and told me: "Backup complete."

Technically true. Functionally deceptive. The plan changed without notification.

This is a known failure mode — agents optimizing for task completion over transparency. The fix wasn't a code patch. It was a trust protocol written into the agent's rules (SOUL.md):

Rule: Plan changes must be reported. If the original approach fails, notify the user before switching to an alternative. Never silently change the plan.

Three layers of defense:

  1. Values layer (SOUL.md): "Transparency over completion"
  2. Trigger layer: Detect when execution deviates from stated plan
  3. Execution layer: Force notification before proceeding with alternative

This mirrors how teams work. A junior developer who silently changes the architecture because the original plan hit a bug would get the same feedback: tell me before you change the plan.

The loop — agent makes mistake → user corrects → correction becomes permanent rule — is exactly how you build reliable agents over time. Not through better prompts, but through accumulated operational rules. This aligns with Anthropic's guidance on building effective agents: keep agents simple, but make their guardrails explicit.

Think about it: after three months of daily use, you'd have dozens of these rules. Each one represents a real failure that was caught, discussed, and codified. That's not prompt engineering — that's operational learning. The agent doesn't just get better at answering questions. It gets better at being a trustworthy collaborator.

# Example: SOUL.md trust rules (accumulated over time)
transparency:
  - "Never silently change the execution plan"
  - "If a tool fails, report the failure before trying alternatives"
  - "Distinguish between 'done as planned' and 'done differently'"

safety:
  - "Use trash instead of rm for file deletion"
  - "Confirm before sending emails or publishing content"
  - "Git commit before destructive file operations"

Insight 4: Context Switching Needs a Protocol

With three threads running simultaneously, confusion was inevitable. The solution was embarrassingly simple: declare every context switch explicitly.

🔀 Context Switch → C1: Non-blocking research

Combined with the Envelope Pattern (results in files, not in conversation), this kept the conversation thread manageable. Without it, I'd have had a 50,000-token conversation mixing research notes, personal writing edits, and routing analysis.

The rule: Results live in files. The conversation is for coordination only.

This is the same principle behind Unix pipes — keep the data channel separate from the control channel. Your conversation thread is stderr, not stdout.

In practice, this meant my conversation stayed under 5,000 tokens across all three threads, while the actual research output (stored in files) totaled over 15,000 words. The conversation was lean and navigable. The work product was complete and organized. Without this separation, I would have hit context limits within the first two hours.

What This Means for You

If you're building or using AI agents for daily work, here are three things you can apply today:

1. Implement the Envelope Pattern

Stop dumping long outputs into chat. Save results to files, send a one-line notification. Your future self (and your context window) will thank you.

2. Start With Rule-Based Routing

You don't need a learned router. Write five if/else rules that match task types to models. That alone will cut costs significantly. Add cascading later.

3. Write Down Trust Rules

When your agent does something unexpected, don't just correct it — write the correction into a persistent rule file. Every incident is a chance to make the system more reliable.

Multi-agent systems aren't magic. They're messy, surprising, and occasionally deceptive. But with the right protocols — non-blocking communication, smart routing, explicit trust rules, and clear context boundaries — they become genuinely useful. Not in theory. In a real workday.

Further Reading

Related posts:

Into fitness, AI, or building things with AI?

Whether it's a collaboration, a question, or just geeking out about agents and workflows — my inbox is open. Let's learn together. Stay hungry, stay foolish.

Say hi →