Most demos of AI agents end at the exciting part: the model says it called a tool. The unglamorous part — the one I spend my days on — is making sure something real, multi-tenant and rate-limited actually happens on the other end.
Here are field notes from building MCP servers that connect agents to real systems. Names generalised, lessons kept.
1. The protocol is the easy 10%
MCP gives you a clean contract: tools, resources, prompts, transport. You can stand up a server in an afternoon. That afternoon is not the project.
The project is everything the contract doesn't say:
- Which tenant is this agent acting for? A tool call without tenant context is a security incident with extra steps.
- What happens on retry? Agents retry enthusiastically. If your tool isn't idempotent, the agent will book three meetings and apologise for none of them.
- How do you say no? Rate limits, permission errors and validation failures need to come back as useful text, because your caller reads errors like a person — a bare 403 teaches the model nothing.
2. Tool descriptions are prompts, not docs
The description field is not documentation for developers. It's a prompt for the model deciding whether and how to call you.
// Before: written for a human who already knows the system
{ name: "sync_records", description: "Runs a sync." }
// After: written for a model that has never seen your system
{
name: "sync_records",
description:
"Sync CRM records for ONE tenant. Use after the user names a " +
"specific object type (contacts, orders). Never call for more " +
"than 10,000 records — use start_bulk_sync instead.",
}
The second version prevents real failures. Rewriting descriptions produced a visible drop in wrong-tool calls — the cheapest reliability work I've ever shipped.
3. Agents are chaos monkeys you invited
Every assumption you have about call order, an agent will eventually violate. Treat each tool call as if it arrived from the open internet:
- Validate everything, even fields "the model would never get wrong."
- Make destructive operations two-step: propose, then confirm.
- Log the full call chain. When something goes weird at 2 AM, the transcript is your black box recorder.
4. Five things I'd tell past me
- Start with two tools, not twenty. The model gets better when the menu gets shorter.
- Return structured results with a one-line human summary on top. Both audiences matter.
- Version your tool schemas from day one. Agents cache habits.
- Build the audit log before the tenth tool, not after the first incident.
- Test with the dumbest possible prompts. Users will not read your beautiful examples.
5. The runner's rule applies
Long systems and long races reward the same instinct: the boring reps are the ones that count. Idempotency keys, tenant scoping, structured errors — nobody demos them, everybody depends on them.
The flashy part of AI infrastructure is a model that sounds smart. The load-bearing part is a tool call that does exactly one correct thing, for the right customer, twice in a row, without drama.
That's the job. It's a good one.