How I Built an Agent Factory That Ships Code While I Sleep • a[7]t

We are past the hype phase of AI coding assistants. Copilot, cursor-style autocomplete, chat-based code generation. Most teams have tried them. Some got real value, others did not.

The next move is the part fewer people are talking about. Agents that work autonomously and own the process from picking up a ticket to merging a pull request, with no human in the loop. Big shops like Stripe and Spotify are investing in that shift now.

I have been running a setup like this on IGNIO for ten days. 182 PRs merged, 206 issues closed. Most of it shipped while I was asleep or with my family. The technical write-up of the architecture lives in the Three-Body Agent deep dive. This piece is about the principles that made it work, because the architecture you choose matters less than the discipline you put around it.

Keep it boring

The setup is intentionally barebones. Cron jobs trigger agents on a schedule. A markdown rules file defines how each agent behaves. A CLI coding agent does the actual work. No custom framework wrapping the process, no elaborate harness, no twelve-thousand-line config trying to cover every edge case.

The simplicity is the strategy. Every model generation changes what agents can do, and complex setups become dead weight the moment a better tool arrives. Teams that built elaborate scaffolding around GPT-4 watched it become obsolete when Claude Code with native tool use shipped. The simpler your system, the faster you can swap the engine without rebuilding the car. The patterns that matter eventually get absorbed into the official tools. What survives long-term is the architecture around the agents, not the clever hacks inside them.

Context discipline

The instinct when setting up an agent is to give it everything. Full codebase context, all the docs, the issue description, conversation history, related PRs. The intuition is that more context produces better output. In practice the opposite is true. The more you dump on an agent, the worse it performs.

The prompt has to be surgical. A clear issue title, a focused description, the relevant files, the rules that govern behavior. That is the entire payload.

This is why breaking work into atomic issues matters. A ticket asking to “improve the transaction flow” is terrible for an agent: vague, multi-file, judgment-heavy, no clear stopping point. The same problem reframed as “email preferences default to false when subscription tiers are bypassed, causing reminders to silently not send” gives the agent a target it can hold in its context window without distraction. Smaller tickets produce less context pollution, fewer hallucinated connections, and a higher first-pass success rate.

The core principle: separate research from execution. You decide what to build and how it should work. The agent builds it. If you find yourself explaining both the problem and the solution in the same prompt, you have already done half the work yourself.

“Done” is provable

Every agent run needs an explicit stopping condition the system can verify on its own. A task is done when tests pass, CI is green, and a PR is open with a description explaining the reasoning. Anything less is the agent declaring victory.

Without a binary exit condition, agents drift. They open PRs with stubs that compile but fail the moment a real user touches them. I saw this early on: PRs that passed linting but had zero behavioral coverage, looking complete when they were not. The mandatory-test step removed the problem entirely. If the tests do not prove the behavior changed, the agent keeps working. If CI fails, the ticket stays out of the review queue. There is no room for “looks right.”

Short sessions over long ones

Each agent run gets a fresh session tied to a single issue and a single branch. When the work is done, the session dies. The next cron cycle starts clean.

I avoid long sessions. They accumulate context from earlier work and start hallucinating connections that do not exist. The agent references a variable from a file it edited three hours ago that has since been rebased, or applies a pattern from an unrelated ticket. The longer the run, the more phantom context leaks into the output. I learned this from a painful debug: agents running two hours produced worse code than ones finishing in thirty minutes, simply because they had more time to drift.

Adversarial code review

Before the Implementer pushes a PR, it runs a self-review through three agents with conflicting incentives. This is a pattern gaining traction in the agentic coding space, rooted in the observation that LLMs are inherently sycophantic. They want to agree with you. Instead of fighting that tendency, you can design a system where each agent’s desire to please works in your favor because they are trying to please different masters.

The Enthusiast is a hyper-aggressive bug hunter. It earns points for every issue it finds, scaled by severity. It produces a long list of potential problems, real and speculative, because it wants to maximize its score. Over-reporting is the goal here; cast a wide net.

The Adversary takes the list and tries to disprove every item. Points for a successful debunk, a harsh penalty for incorrectly dismissing a real issue. That asymmetric risk makes it aggressive about weak claims and cautious about anything legitimate.

The Referee evaluates both sides without bias. It reads the code, reads both arguments, and renders a verdict: real bug, false positive, or worth noting. It is rewarded for accuracy, so it has no incentive to side with either party.

Running this in production catches issues that CI alone misses. Type-safety gaps where a cast silently drops an error, edge cases in async handling, race conditions in queue processing. The exact bugs that usually only surface in production if you are unlucky, caught before the code leaves the branch.

Rules are the operating system

Every time an agent does something wrong, I add a rule. The rules file is a living document that grows with every mistake. The same error never happens twice.

The list looks like this in any of my repos:

Match existing patterns.
No new dependencies without justification.
Write tests for every behavioral change.
Document the why, not just the what.

Each line exists because an agent once made a bad decision I had to fix manually. Over time the file becomes the best documentation the project has, not because anyone sat down to write docs, but because every lesson was captured in real time as a direct response to a failure.

Rules accumulate and eventually start contradicting each other. “Keep it simple” and “add comprehensive error handling” pull in opposite directions. The agent tries to satisfy both and produces over-engineered code handling errors nobody will hit. So you periodically consolidate. Remove what is obsolete. Merge overlap. Refactor the rules the way you refactor code.

Own the outcome

Do not assume agents will magically adapt to your codebase. They will not read your mind, infer your architectural preferences, or understand your business context without explicit guidance. Every shortcut you take during setup costs you ten times more in cleanup later.

The agent writes the code, but you own the product. Review PRs when you need to. Open follow-up issues when the execution missed the mark. You still need to understand the architectural decisions and maintain the overall mental model. The agent is the workforce. You are the engineering lead.

This technology is early. It is not perfect. But it is already changing how I build, and the next year is going to be louder than the last one.