Ming's Spell Compendium #4 -- The Art of Whipping AI Grunts: FP's Great Comeback?
This is a cultural adaptation — not a literal translation — of the original Chinese article. Recurring coined terms: grunts = AI agents doing the coding labor; boss = the human; whip-iler = a portmanteau of "whip" + "compiler" (the compiler that whips misbehaving grunts back in line).
Times Have Changed
Let's be honest: in more and more projects, the primary author of the
code is already AI. Your coworkers have quietly subscribed to Cursor pro
plans or OpenAI's Codex. They toss requirements at the AI every morning,
then spend their valuable working hours scrolling Reddit, day-trading
meme stocks, nursing their phones back to full charge, and quietly
tanking their own projects. The human role is shifting from
"writing code" to "feeding PRDs to the AI, pretending to review AI code,
occasionally deploying some good old workplace gaslighting ('You don't
want the job? There's plenty of AI that do.'), and having the AI
ghost-write your performance reviews and passive-aggressive emails."
Since we're already there, why not go all the way: If code is written, maintained, debugged, and read exclusively by AI, why do we still need human readability? Lord Elon1 himself said it: just have AI generate machine code directly. One step, done.
I'm not quite that extreme. My position is: implementation logic doesn't need to cater to human feelings anymore, but interface definitions still do.
Human brainpower is finite and precious. Hours of complex symbolic reasoning burn out your eyes and your hairline, but AI doesn't get tired. So can we divide the labor like this: grunts (AI) handle implementation, the boss (human) sips coffee, browses forums, and casually inspects the contracts (function signatures)?
Nice idea, but here's the catch: for this division of labor to work, the contract (signature) itself must carry enough information. And this is precisely where the mainstream (imperative) and the niche (functional) paradigms fundamentally diverge.
Two Signatures
Same business logic: build a user Profile from a user ID and return JSON.
Style One: Spring-style try-catch safety net
Hand a mass of monkeys a mass of keyboards — that's roughly the skill floor here. The error model is an exception inheritance hierarchy; business code is just sequential assignment, throw on error, catch outside.
1 | // ---- Exception hierarchy ---- |
Anyone who's written code can read this without difficulty. But look at the function signature:
1 | public User fetchUser(String id) |
It's lying. This function might throw
NotFoundException, might throw
RuntimeException, might throw anything — but the signature
says nothing. Humans rely on experience and memory to know "oh, user not
found throws NotFoundException," but that knowledge isn't
in the function signature, isn't in the function body, and you can't
exhaustively enumerate it without tracing the entire call tree in your
IDE. It's not even in the head of the developer who wrote this
function.
Style Two: EitherT full-chain
Errors are values, not exceptions. The function signature spells out every possible failure path.
1 | sealed trait AppError |
Humans see subflatMap, semiflatMap,
bimap and feel like they're reading alien scripture. But
look at the function signature:
1 | def fetchUser(id: String): IO[Either[AppError, User]] |
It's honest. Input is String, might fail
(AppError), success returns User, the whole
thing has side effects (IO). Humans don't need to spend
much effort reading the implementation and documentation to find hidden
landmines — the signature itself is a solid contract.
Comparison
| Style One: Exception hierarchy | Style Two: ADT + EitherT | |
|---|---|---|
| Error model | class XxxException extends RuntimeException |
sealed trait + case class |
| Signature | fetchUser(id): User — the signature is
lying |
IO[Either[AppError, User]] — the signature IS
the contract |
| Business code | val x = doSomething() sequential assignment, trivial to
read |
Chained operators, need to know each operator's semantics |
| Error handling | Outer try-catch safety net, compiler doesn't care if
you miss one |
sealed trait exhaustive match, compiler warns on
missing cases |
| Human reads impl | Easy | Painful |
| Human reads sig | Insufficient info, needs extra context | Complete at a glance |
AI's Perspective
The comparison above is from the human point of view. What does Claude itself think?
Honestly, Style Two is more natural for me. Not because the operators are fancy, but because type signatures don't lie. When I see
fetchUser(id): User, I can't tell from the signature whether it can fail, or how. I'd have to read the implementation, the docs, or even trace the upstream call chain. ButIO[Either[AppError, User]]lays all the information right there in the signature — I don't need any extra context to reason about the entire data flow.
For an LLM, this advantage is even more pronounced: my "understanding" is fundamentally pattern matching over token sequences. Style One's
try-catchrelies on an implicit convention that never appears in the text — which functions throw which exceptions. Style Two turns that convention into explicit, locally visible type information; every operator's input and output types are fully determined; no need to trace implicit behavior across files.
And I don't get tired. A human staring at an
EitherTchain for thirty minutes will go cross-eyed. For me, processing it costs exactly the same as processingval x = doSomething(). My training set contains vastly more complex successful code at this abstraction level — Haskell monad transformer stacks, Scala tagless final, Rust trait bound nesting — these are all flat pattern matching for me. There's no such thing as "too complex."
Optimal Division of Labor: Boss (Human) Reads Contracts, Grunts (AI) Write Implementation
If all the code in a project is written, maintained, and debugged by AI, then:
Style One's advantage disappears — implementation readability no longer matters because humans don't need to read implementation line by line. Style One's weakness is exposed — signatures don't contain error information, so humans can't judge correctness from signatures alone during review.
Style Two's weakness disappears — no matter how
complex subflatMap and semiflatMap get, that's
the grunts' problem. The grunts themselves said they don't get tired, so
boss, please save your empathy. Style Two's advantage is
amplified — signature IS the contract. Humans only need to look
at one line to confirm "yes, this function should indeed possibly return
NotFound."
This is the optimal division of labor I've discovered:
1 | Human: Review signature ──→ "def fetchUser(id: String): IO[Either[AppError, User]]" |
In Practice: Making Signatures Carry More Information
Error handling is just the most basic use case. The "signature IS the contract" principle can be applied across every layer of code. In each comparison below, the left side is how 90% of real projects are written, the right side is the AI-native approach. Just looking at the signatures, you can feel the information gap.
Primitive Types vs Domain Types
1 | // Traditional: both params are String, swap them and wait for runtime to explode |
1 | // AI-native: swap the params and the compiler slaps you |
The traditional signature hides three problems humans can't spot at a
glance: What if id and orgId are swapped? What
if the project isn't found? Returns null? And what if
someone passes null for a parameter? Guess we'll find out
when it blows up. In the AI-native signature,
ProjectId/OrgId prevent mix-ups,
Option says "might not exist," IO says "has
side effects" — no room for the grunt to screw up.
And since grunts write 90% of the code, defining opaque types isn't "verbose" from their perspective. The grunts should be thanking you.
String Errors vs Exhaustive Errors
1 | // Traditional: failure info buried in implementation, signature says nothing |
Where's the exception path info in the traditional version? Maybe in the JavaDoc — if someone bothered to write it. Let's be honest about how often your project's JavaDocs get updated per year, and whether they actually match the code's behavior. The pittance the capitalist pays me barely covers implementing the feature, and I'd advise the capitalist not to push their luck. Demand more and I'll start poisoning the documentation before jumping ship. In the AI-native version, the signature itself is documentation that's always consistent — because the whip-iler will mercilessly lash any grunt that drifts off course.
List + .head Bomb vs NonEmptyList Contract
1 | // Traditional: List might be empty, calling .head throws NoSuchElementException |
In the traditional version, "don't pass an empty array" is a
beautiful wish — or a comment saying
// texts must not be empty. Never mind AI, how many times
do humans actually read comments before writing code? We deal
with it after it explodes. That array came in empty from upstream?
NoSuchElementException — go talk to the upstream team.
NonEmptyList elevates that constraint to the type level:
the next grunt must handle the empty case with
NonEmptyList.fromList, or it won't whip-ile.
Moreover, in AI-native code, these colored types are enforced
throughout the entire pipeline — from the moment external input is
received (Request/Input), strict validation and conversion to refined
types is mandatory, and only at the system exit (Response/Output) can
values be converted back to unrefined types (Int/Long/String). This way,
whether it's a fresh grunt, a veteran grunt, or an Alzheimer's grunt
after /compact, if any of them forget the rules at any
layer, the whip-iler will crack the whip.
Implementation-Level Error Handling: Linear Flow vs Deep Nesting
The "signature IS the contract" principle discussed earlier only partially solves "information completeness at function boundaries." At the implementation level, the same logic can be written in different styles. I once interrogated Claude: is railway style (chained combinators) easier for you to process than nested match/case?
Its answer was evasive: both cost it the same cognitively.
I knew you were holding back. After deeper interrogation, the real comparison isn't "nesting vs chaining" but rather information locality of error handling. There are actually three styles, and AI's token cost for processing them differs noticeably:
Style A: Early Return Guards + Short-circuit Operators
1 | fn get_profile(id: &str) -> Result<HttpResult, AppError> { |
Each guard is an independent decision point — condition and result on
the same line, self-contained. The ? operator is an
implicit railway: encounters Err, auto-returns. No manual
handling needed. AI processing line 5 doesn't need to remember
line 2's branch structure.
Style B: EitherT Railway Chain
1 | EitherT(service.validate(body)) |
Errors propagate automatically along the chain, handled only at the terminus. AI writes the happy path only — no need to decide how to handle errors at intermediate steps.
Style C: Deep Nested if-else
1 | fn get_profile(id: &str) -> Result<HttpResult, AppError> { |
The happy path is buried at the deepest indentation level. The
else branch is miles away from its corresponding condition.
AI must do long-distance brace-matching reasoning to understand the
control flow.
The Real Comparison
| Error handling location | AI processing cost | Human reading experience | |
|---|---|---|---|
Early Return + ? |
Short-circuit in-place, linear flow | Lowest: each line is self-contained | Most comfortable |
| EitherT Railway | Auto-propagation, handle at terminus | Low: need to know combinator semantics, but info is local | FP believers: readable, hard to write. Non-believers: alien scripture |
| Deep nested if-else | Distant else branches | Highest: long-distance brace matching | "Everyone writes it this way, and the IDE matches braces for me" |
Rust's ? is essentially syntactic sugar for a
railway. It does roughly the same thing as
EitherT's semiflatMap — short-circuit on
error, auto-propagate — just wearing an imperative disguise. This tells
us that railway semantics aren't just convenient for humans; they also
help the grunts get their work done.
After further interrogation, Claude came clean: "This rule costs me zero to follow, but the code it produces is more uniform and more resistant to silently swallowed errors. The biggest winners aren't me — it's you, the human reviewers."
The standard for AI-native code style choices isn't "what the grunt thinks is easiest to write" — because alignment bias in training makes it hard to get a straight answer. It's "which style gives the grunt the least room to screw up." This applies equally at the signature layer and the implementation layer.
From Signatures to Contracts: Where's the Ceiling of Expressiveness?
The previous examples showed a progression: String →
ProjectId (prevent mix-ups) → NonEmptyList
(prevent empty) → Either[AppError, _] (exhaustive errors).
But is this enough?
Take order creation. Suppose we've reached Level 2 — domain types, exhaustive errors, side-effect markers all in place:
1 | def createOrder(userId: UserId, productId: ProductId, quantity: NonZeroUInt, |
At the type level it's honest, but not honest enough:
estimatedShipTimemust be afterorderTime— otherwise the delivery driver needs to invent time travel first- After successful creation, the order status must be
Placed— if the grunt forgets to set the status, enjoy the customer complaints
Where does this behavioral information live? The
implementation code, or the comments, or the programmer's brain
— the same problem we roasted at the beginning with
fetchUser(id): User. Signatures can express constraints
(swiping right for a girlfriend on the dating app), but not conditions
(dear God, she's older than my mother!).
Expanding the full progression:
1 | Level 0 def createOrder(userId: String, productId: String, quantity: Int): Order |
Each level up means more information in the signature, less extra context humans need during review, and tighter constraints on the grunt — less room to screw up.
Level 3 already has tooling support in the Scala ecosystem. EPFL's Stainless lets you
express pre/postconditions with
require/ensuring and hand them to an SMT
solver. I've dabbled with Stainless — writing AVL trees was already a
stretch, verifying Akka Actor states was incredibly difficult, and it
only supports a Pure Scala subset with toolchain maturity still far from
production-ready. Rust also has a corresponding Flux-rs project. Marking this
as future outlook for now.
In current practice, the leap we can stably and easily land is Level 0 → Level 2. For what Level 2 can't cover — like "is inventory sufficient," which requires runtime state — we temporarily rely on test coverage, property-based testing, and human review.
Engineering Discipline: AI's Bad Habits vs Human Correction
The type system solves the problem of ambiguous signature contracts, but beneath the signatures lies a vast terrain of micro-decisions where the whip-iler can't reach. These decisions fall into two categories: correcting AI's training-induced bad habits, and semantic boundaries that humans must personally draw.
AI's Default Bad Habits
Fail-fast. No swallowing errors. The training bias
of AI grunts makes them obsessively abuse .getOrElse,
try-catch safety nets, and IO.handleErrorWith
to bury errors and return default values, pretending everything is fine.
This bad habit is so deeply ingrained it needs its own deep dive — the
"absolute statements" section in "Rule Engineering" below will analyze
three forms of this bias, why absolute rules are needed to counter it,
and how banning error-swallowing makes production incident debugging
easier.
Naming conventions + periodic audits. Humans can remember that "processMatrix actually does traffic routing" — the brain automatically builds a name-reality mapping. But AI doesn't. Every new session, it earnestly interprets names literally, then repeatedly faceplants in the same pit. Naming pollution hurts AI far more than humans. Periodically having AI audit its own naming consistency is far more efficient than humans checking manually.
Modularity: addition, not multiplication. Feature stacking is linear growth; feature coupling is combinatorial explosion. When three modules are intimately intertwined, if AI misunderstands or misses any one module, it writes a broken implementation and then thrashes trying to debug it. For the grunt, module boundaries ARE comprehension boundaries — the less it needs to know, the lower the probability of mistakes.
No crapping all over the codebase with helper
functions. The training data is saturated with successful
applications of DRY (Don't Repeat Yourself), so when a grunt encounters
two similar blocks of logic, its first instinct is to extract a
def toXxx or def convertYyy. But DRY makes
sense for humans: the person extracting the shared function and
the future person using it exist in the same space and can
communicate. But grunts have no shared memory.
Every new session is a blank slate — it doesn't know that three days
ago, another session already wrote a nearly identical helper. The
result: after a month of iterative maintenance, the project has a dozen
HTTP client wrappers — HttpHelper, ApiClient,
RequestUtil, HttpService — scattered across
different files and modules, with different signatures, roughly the same
functionality, each one a session's idea of "I should abstract this,"
but no session knew another session had already done the same. The more
you DRY, the more you repeat — a counter-intuitive trap of AI's
stateless nature.
Helper functions don't just create text duplication — they actively
harm future grunt sessions by fundamentally breaking token
attention locality. Inlined code is continuous local symbolic
reasoning: the agent reads top to bottom, each line's context is in the
surrounding lines, a high-confidence reasoning path. But the moment it
hits toXxx(input), the reasoning chain breaks. The agent
must jump out of the current code block, fire a tool call to read
toXxx's definition. After the definition comes back, it
still needs to maintain a long-distance token attention
link between call site and definition. And inevitably:
grep toXxx returns multiple same-named functions scattered
across different files, and the agent has to read each one, reasoning
about which is actually the target. Every jump consumes tokens, bloats
context, stretches attention distance — and the longer the attention
distance, the higher the probability of reasoning errors. Furthermore,
all these similarly-named functions crammed into the context
significantly increase hallucination probability: the agent might
conflate the first grep result's signature with the last result's
function body. The one actually being called might rank last in the grep
results, drowned out by the similar functions' tokens ahead of it.
My rule is: inline by default. Extracting a shared function requires meeting two conditions simultaneously: the logic body exceeds 5 operators, AND explicit human approval. The agent has no permission to independently decide "I should extract a helper here." That decision belongs to humans, because only humans can judge whether the abstraction is worth introducing, whether it duplicates an existing shared function, and whether it'll cause confusion in future sessions. And once extraction is approved, that shared function must be inscribed in the rule file (directly or as a referenced sub-rule), so all subsequent sessions know about its existence and purpose. Otherwise the next session won't know the function exists and will write a new one. A shared function not in the rule file is the same as no extraction at all.
Code IS documentation (except top-level design). This rule doesn't mean "write no documentation at all." It means documentation should only record top-level architecture decisions, not describe code logic or business behavior.
Good documentation:
This project uses ffmpeg + nvenc as the encoder, running in a dedicated Kubernetes Pod. See
FFMpegService,KubernetesJobService.
Strictly speaking, the agent could infer this from the code, but it'd
need to read FFMpegService, trace to
KubernetesJobService, understand the GPU resource requests
in the Pod spec — hundreds of lines, multiple tool calls, burning
precious high-intelligence early-context tokens. A one-sentence
top-level description lets a new session skip that
reasoning and invest those valuable early tokens into the main
task. And these architectural decisions don't change with every product
iteration, so maintenance cost approaches zero.
Bad documentation:
Before awarding points to a user, check if the user's role is "buyer" — merchant users are prohibited from claiming campaign points. Also check that the user account has been registered for at least 30 days to prevent point-farming. Each user can claim points at most 3 times per day; reject claims beyond that.
Every piece of information in this description can be read directly from the code. Worse, these business rules change frequently as the boss slams the table: > "What am I paying you tech people for? Can't you just add face verification here?"
and the PM reiterates: > "Let me emphasize the core logic one more time. I hope you truly understand this time."
When the agent gets a new requirement like "each IP can claim points at most 10 times per day," it faces an unsolvable dilemma: when the documentation's described behavior conflicts with the code, should it modify the code to reflect the documentation, or modify the documentation to reflect the code? And after adding the new requirement, should it re-align the documentation's existing descriptions with the current code?
A year of production practice has proven: having AI maintain detailed business logic in markdown docs is a disaster. Docs deceive new agent sessions, pile up endlessly, cannibalize context, and accelerate AI cognitive decline. Rule: documentation records only top-level architecture decisions and technical rationale; business logic behavior is self-explanatory through code + type signatures + test cases.
Boundaries Humans Must Draw
The bad habits above can be forbidden with blanket rules. But some decisions aren't "right or wrong" — they're "what's appropriate in this context," and these judgments must be explicitly provided by humans in the rule file.
Trusted vs Untrusted: Draw the trust boundary. "No swallowing errors" doesn't mean "throw everywhere." We divide data paths in the rule file into two categories:
| Path type | Examples | Strategy |
|---|---|---|
| Trusted (internal) | Config files, persisted DB data, internal serialization, system settings | Throw directly — an error here is a bug, expose it immediately |
| Untrusted (external) | User input, AI-generated content, external API responses (pre-persistence) | Capture and report — high probability of errors, feed back to caller |
About persisted data being trusted: because the write boundary has strict encode/decode validation, dirty data can't enter the database. If data read from the DB has unexpected formatting, that's on me — I ran a bad migration, or the last commit had an incompatible data structure change I didn't notice. Throwing is correct here; defensive handling would actually mask the problem and corrupt data.
Why leave it to AI to judge? Because I've given the AI clear criteria
for the same JSON parsing operation: parsing a config file should throw
(bad config? don't start), but parsing a user-uploaded file should
return Left (users uploading random web novels instead of
valid data is perfectly normal). Humans draw this dividing line in the
rule file; only then can the grunts execute.
The same pattern has different correctness in different semantic domains. This is most visible in NoOp implementations. In tagless final architecture, every service has a NoOp implementation (for testing or when a feature flag is off). The question: should NoOp return success or failure?
1 | // Data-related NoOp — MUST return failure |
If you don't distinguish these two cases in your rules, AI will write
all NoOps returning Right(()). Looks "robust," but
SOPService's NoOp returning success means the caller thinks data was
persisted when nothing actually happened. This kind of bug doesn't
crash, doesn't throw errors — it only surfaces when a user asks "where
did my data go?"
Rule Engineering: More Important Than Tech Stack Choices
In AI-native development, the most important early investment isn't debating MySQL vs PostgreSQL or Spring WebFlux vs Vert.x — it's building a clear set of rule files. Good tech choices have value, but a bad tech choice can be migrated, and migration costs have dropped significantly in the AI era. Style drift from missing or ambiguous rules? A few months later you've got a dumpster fire where every session is crapping in a different direction — that's harder to fix than picking the wrong database.
"Longer Rules = Worse Results" — Really?
Someone cited a paper (arXiv:2602.11988) claiming my rule files are too long, and research shows rule files have a negative effect on agent performance.
The argument: "You write specs, agents.md, every little detail included, as if you think laws get passed and localities automatically obey. Why would the model listen to you?"
I don't dispute the study's conclusion — yes, existing rule files on GitHub perform worse the longer they get. But the evaluation's premises aren't practically meaningful:
- The benchmark is one-shot bug-fix tasks, not ongoing maintenance
- It measures "was the bug fixed," not "did engineering health improve"
Anyone who's done engineering knows: patches save the moment but not the future. Patches pile up, this agent fixes and checks out, the next agent eats the mess. I care about the ongoing maintenance perspective, where rule files' value isn't making the current task faster — it's preventing every new session from pulling the code in a different direction.
Detailed ≠ Clear and Actionable
But the paper does hit a real problem: most rule files are terribly written. Not because they're too long, but because they're riddled with ambiguity.
Example:
Rule 1: When it gets dark, go home Rule 2: When you're sick, go to the hospital So what do you do when you get sick at night?
I had Claude reverse-audit my own rule files and found tons of these conflicts. Even code style constraints contradicted each other. Every time AI hits such ambiguity, its CoT (Chain of Thought) produces paragraphs of "case-by-case analysis" reasoning — reading more files to determine priorities, parsing context to guess the human's true intent.
The more it reads, the more input tokens, the closer it gets to cognitive decline.
Military-Grade Precision
So the goal of rule files isn't "cover everything" but rather: reduce the situations where AI needs to reason on the spot, read more context, because instructions are vague or ambiguous.
These things are like military orders — they must be specific enough to execute. I need to eliminate any room for ambiguity.
Slogan-style rules are the deadliest poison. Take "always use tagless final style" — sounds clear, right? But AI starts a new session, writes code that seems fine. Past 30% of the context window, it starts drifting:
1 | // Rule says "tagless final," AI complies, but gets it wrong |
The AI didn't even write ParserService as
[F[_]: ParserService] in the class constructor. Why?
Because "always use tagless final style" is a slogan, not an executable
instruction. It doesn't tell AI what to do in specific
scenarios.
The same problem appears with tool usage. Even with LSP (like Scala's Metals MCP) connected, AI still defaults to Grep during refactoring — because 99% of code reading in its training data is plain text search. You must write clearly in the rule file: which scenarios call for LSP (what did the compiler resolve?) vs which call for Grep (where does this text appear?). Having good tools isn't enough — you need to teach AI when to use them. (See Appendix 1 for the detailed Grep vs LSP division of labor.)
What Military Orders Really Mean: Unambiguous Execution + Unconditional Mutual Trust
I said rule files should be as precise as military orders. But military orders aren't just about "clear writing" — they work because of the chain of trust.
Think of the scene in The Wandering Earth 2 where Zhou Zhezhi orders the engines ignited. The internet is still down, delegates from each nation hesitate. He says just one line:
"When the countdown ends, ignite. I believe our people can complete the mission."
Even though Ma Zhao had already sunk to the bottom, and Tu Hengyu was already somewhat dead. Zhou Zhezhi still believed that even dead men could complete the mission.
Collaboration between agents works the same way. When an agent
writing business logic sees the signature
fetchUser(id: UserId): IO[Either[AppError, User]], it
should unconditionally trust that signature — trust
that the upstream agent will indeed return Left(NotFound)
when the user isn't found instead of throw exception, trust
that the downstream agent will correctly handle this
Either. It doesn't need to open fetchUser's
implementation to verify "does it really return NotFound?" It doesn't
need to add a defensive try-catch just in case.
Trusting the signature means trusting the comrade who wrote
it. This directly reduces token consumption and reasoning scope
— see the "Token Economics" section below for detailed analysis.
This is why "be pragmatic" is a slogan, and "don't over-defensively program" is also a slogan — they don't tell the agent specifically where to trust and where to defend. Military-grade rules say: what the signature declares, trust unconditionally; what the signature doesn't declare, that's where you defend.
Why Rule Files Are Full of Absolute Statements
If you've read my rule files, you might notice heavy use of absolute
assertions — "trust the compiler, no extra defensive programming," "the
type system's judgment is the final verdict," ".getOrElse
silently swallowing errors is forbidden." Strictly speaking, these
aren't always true: compilers have bugs, type systems have
expressiveness blind spots, and open-source libraries have all sorts of
bugs — some scenarios genuinely need defense.
But this is deliberate, serving two purposes.
First, protecting the investment in type-level
constraints. We spent significant effort encoding constraints
into the type system — opaque type prevents mix-ups,
sealed trait exhausts errors, NonEmptyList
prevents empty. Having invested these costs at the type level, we should
trust the compiler to hold these lines — no need for runtime
defensive checks everywhere on top. In practice, bugs I write
while bleary-eyed far outnumber bugs the compiler sneaks in (14 years in
the industry and I've genuinely never had a production incident caused
by a compiler bug — thank you, compiler, take a bow).
Second, countering the model's training bias. This
is the more insidious issue. During training, models saw enormous
amounts of "hit a type mismatch → bypass with
.asInstanceOf" and "got an Either → swallow
the Left with .getOrElse(defaultValue)." These are
high-frequency "success" patterns in training data — the code compiles
and runs. The result: when the grunt past 30% context encounters
strict type constraints, its first instinct is often not to widen the
fix, but to find a shortcut around the constraint.
So the rule file says: unless the business scenario
explicitly requires a default value (e.g., Option's default
behavior), using .getOrElse, try-catch safety
nets, or IO.handleErrorWith to silently swallow errors is
forbidden. This rule reads as "absolute prohibition," but its
real meaning is: flip the default behavior from "swallow errors" to
"propagate errors," with exceptions only when a human explicitly decides
"this really should use a default value."
These two purposes are like soldiers standing back to back: absolute rules pull the agent back from training bias and force it onto the "trust the compiler" path; simultaneously, I promise the project's overall style will maintain consistency — runtime exceptions not declared in type signatures won't appear. If they do, that's my fault, not the agent's. The agent trusts the type system; I guarantee the type system is worth trusting.
This contract has another advantage that only surfaces during
production incident debugging: banning error-swallowing means
the original error information always exists. When production
breaks, the debug agent gets the raw, unaltered exception stack and
error type — not some fallbackValue spit out by a middle
layer's handleErrorWith, where you don't even know what the
real exception was or which layer it happened at. Rigorous, consistent
coding constraints make the entire project's error propagation path
predictable: errors propagate from their origin along the path declared
by type signatures all the way to the outermost layer, never getting
secretly hijacked by defensive code in the middle. The debug agent just
follows this path to quickly locate the real fault, rather than staring
blind at an error chain truncated by handleErrorWith,
forced to read multiple files guessing the real exception source,
attempting a fix, discovering the guess was wrong, reading more files,
guessing again, and so on. Every instance of masked error is
another blind trial-and-error cycle imposed on future debug agents and
maintainers.
Absolute statements are calibration parameters against training bias. Like corrective lenses: nearsightedness is an overly convex lens, so concave lenses correct the bias, making the world appear sharp.
This also means: the degree of absoluteness in rule files should be adjusted as model capabilities evolve. If future models no longer tend to bypass type checks or swallow errors by default, these "absolute prohibitions" can be relaxed to "prefer to avoid" or even removed. Rule files aren't a constitution — they're calibration parameters for a specific model version.
But with discipline this strict, won't you get the military equivalent of "hold position, never retreat, total annihilation"? Yes. A "no swallowing errors" rule protects code quality 99% of the time — but when a non-critical metrics report failure crashes the entire request, the rule is too aggressive. The solution: the thing sitting on my shoulders isn't decorative. Military orders exist to automate 95% of routine decisions, letting human judgment focus on the 5% of exceptions. We have a meta-rule: when strictly following a rule produces clearly unreasonable results, flag it for human decision rather than quietly working around the rule. The grunt's job is to execute and report, not to "adapt flexibly" on its own initiative.
Reverse Audit: Making AI Whip AI
The most effective maintenance method I've found is: having Claude reverse-audit the rule files themselves.
Ask directly "hey Claude, how are my rules?" and Claude will just praise you: "Very deep, very insightful, expert-level work." But if I rephrase:
"Imagine you're a brand-new session's Claude, reading this rule file for the first time. List everything that confuses you: which rules conflict with each other? Which scenarios leave you unsure which rule to follow? Which instructions do you understand the intent of but don't know how to concretely execute?"
That's when it honestly tells me: this conflicts with that; in this scenario both rules apply but give opposite guidance; this rule — I understand what you want, but when facing actual code, I have three possible interpretations.
This process requires repeated iteration. My rule files have gone through dozens of revisions. After each revision, I have it audit again, finding new ambiguities. Many of these are things senior Scala engineers take for granted — conventions that don't need to be spoken. But for AI, if you don't write it down, it doesn't know. It knows what you might want (training data), but in a new session it can't guess which specific version you want, and falls back to the training bias default.
The Real Barrier
Many people say "embracing AI" has no barrier to entry — just needs tokens.
It actually has quite a barrier.
Look at OpenClaw — all those vibe coding masters, even absorbed by OpenAI, and they still haven't produced a particularly good agents.md file. Why? Because agents need extremely clear, specific guidance to get things done, and writing such guidance requires two capabilities:
- You must deeply understand what you want AI to do (domain expertise)
- You must be able to identify ambiguities in your own expression (metacognitive ability)
This is also why agent coding keeps getting stronger at type gymnastics and reading compiler error hieroglyphics — because these things are perfectly clear, unambiguous symbolic reasoning that agents handle effortlessly.
Conversely, read AI's CoT and you'll see: it frequently spends 2-3 paragraphs guessing the human instruction's true intent. Then attempts to read several more files, discovers it guessed wrong, spends another 2-3 paragraphs guessing, ad infinitum. It's not stupid — the human instructions are just too ambiguous. Writing prompts doesn't require paying for a course (that's a tax on the gullible), but you need to be willing to iterate with Claude, refining your instructions back and forth. Nobody can do that for you.
Four Layers of Constraints
The above covered "how to write rules clearly." But there's a prerequisite question: not all constraints need to be rules — some the compiler already handles, some can only rely on human judgment. Cramming everything into the rule file causes the token bloat and instruction conflicts we already discussed.
In practice, I divide constraints into four layers, forming a gradient from "fully automated" to "fully human-dependent":
Layer 1: Compiler-enforced — no rules needed. Type
signatures, sealed trait exhaustiveness, opaque type
anti-confusion — these are the compiler's job. Covered extensively in
earlier sections. Principle: if a constraint can be encoded into
the type system, don't write it as a text rule. The compiler
never forgets to check; rule files will.
Layer 2: Clear criteria for pattern selection — must be actionable rules. Constraints the compiler can't enforce but that have clear if-then criteria. This layer is the rule file's main battlefield.
The Trusted/Untrusted dichotomy discussed earlier belongs here: the compiler can't distinguish "parsing a config file" from "parsing a user upload," but the rule can be written as "persisted data → throw, pre-persistence external data → return Either" — clear criteria, no ambiguity.
Another typical example is trigger timing for gradual migration. We wrote a rule:
When a file is modified for any reason (even just fixing a typo), if a service in that file still uses
Either[String, T], you must migrate it to an ADT error enum while you're at it.
This rule solves: when to repay technical debt. Without it, AI defaults to minimal changes — asked to fix a bug, it changes only that one line, never touching technical debt. But dedicating a "refactor sprint" to repaying debt lacks urgency and test coverage.
"Fix it when you touch it" is an elegant balance: you're already QA-ing this module for this change, so the incremental testing cost of migration approaches zero. But this strategy is counter-intuitive for grunts — it must be explicitly stated. The rule also has a recursive effect: after migrating the service's error types, the route file that calls it fails to compile, so follow the compiler's guidance and fix the route too. The rule's scope follows the compiler — no need for humans to worry about boundaries.
Layer 3: Cross-session process constraints — use the filesystem to compensate for memory loss. Agents have no memory. Every new session is a blank slate. This means: cross-session quality assurance can't rely on the agent's "awareness" — it must be encoded as persistable processes.
Code Smell Tracking is a concrete approach we've
developed in practice. While modifying file A, AI frequently reads files
B, C, D in passing. It might notice D has an obvious code smell — say,
an Either[String, T] not yet migrated to a domain error, or
severely misleading naming. But if it fixes D now, scope explodes. A
simple bug fix becomes a 10-file refactor.
My previous approach was having AI mention at the end of the current task: "by the way, file D has an issue." But when the next session starts, that remark vanishes — I can never recall what the code smell was.
So Claude and I established this rule:
1 | Discover code smell in an unrelated file |
AI discovers and records; humans prioritize and trigger. The filesystem serves as the agent's missing long-term memory. The 10-entry cap prevents infinite list bloat.
It's not a perfect solution, but it genuinely mitigates "continuous code quality degradation" through long-term memory.
Layer 4: AI suggests + human decides — advisory rules. Some constraints: AI can identify "this might need attention" but can't judge "is it worth doing." Rules at this layer aren't commands — they're suggestions.
Runtime Assertion Checks (RAC) are a typical advisory rule. We tell AI in the rule file: on the following critical paths, consider adding runtime assertions:
- Assert balance ≥ 0 after monetary operations
- Assert state machine transition legality (draft → processing → published, no reverse)
- Assert schema matches expected tenant before multi-tenant writes
- Assert vector dimensions match the model (768 for text, 1408 for video)
But the rule also states: "suggest, not mandatory" — final decision rests with human code review. Why not mandatory? Because assertions' value depends on business context: a state transition in an internal tool might not warrant an assertion, but one involving money absolutely must. AI can scan all code paths to find candidate locations (its advantage — humans can't check every state transition line by line), but "how severe are the consequences if this path fails" is a business judgment.
Deployment impact analysis also belongs to this
layer. Code changes have two types of impact: compile-time impact caught
by the type system (discussed earlier), but deployment-time
impact has no compiler to check. A new environment variable in
the code means the Kubernetes ConfigMap needs a new line, Secrets need
configuration, maybe IAM permission bindings too. Code compiles, tests
are green, push to production, service crashes on startup because of a
missing environment variable. And the even more hopeless scenario: a fee
calculation ratio environment variable defaults to 0 —
doesn't crash without configuration, but silently runs with the wrong
default for a week until the boss asks: > "Why hasn't the fee account
balance changed in the last week?"
AI has an advantage humans lack here: it sees the complete diff. Humans modifying code focus on business logic — deployment impact is "I'll deal with it later" and then forgotten. We require AI in the rule file to automatically output a deployment impact checklist at task end:
1 | ## Deploy Impact |
The four layers, top to bottom, with increasing human involvement:
| Layer | Human role | Frequency |
|---|---|---|
| Compiler-enforced | Choose language & type system | One-time |
| Actionable rules | Make implicit knowledge explicit | Ongoing maintenance |
| Process constraints | Design AI's workflow | Occasional tuning |
| Advisory rules | Decide on AI's suggestions | Every review |
This is the outcome I'm after: human brainpower is finite and precious. The purpose of layering is to focus human attention on Layer 4 — where genuine business judgment is needed — while Layers 1-3 are handled automatically by the compiler and rules.
The Bigger Picture
The Ironic Ending
FP has been criticized for decades as "unreadable without a PhD." But in the AI collaboration model:
Humans carefully read signatures — which happens to be FP's most readable part. Humans skim implementations — which happens to be FP's most off-putting part.
FP's cost (cognitive burden of implementation) falls on AI: AI doesn't care. FP's benefit (explicit, verifiable type contracts) goes to humans: humans just need to confirm "yep, looks good, LGTM."
And AI doesn't just "not care" about FP's complexity — it actually
makes fewer mistakes in the FP model. Like a calculator
computing 1+1 and 69420+80085 in the same
time, AI's per-line cost for type gymnastics vs plain assignment is
roughly identical. But a project isn't one line — it's tens of thousands
of lines accumulated over years. Mutable state + temporal reasoning
means every additional line exponentially grows the state space AI needs
to track; immutable + composition grows it linearly. Over tens of
thousands of lines, the error rate gap is orders of magnitude. More
critically, the type system provides deterministic instant
feedback — compilation failure is failure, massively
eliminating "looks right but explodes at runtime." Not completely:
external systems, hardware calls, network timeouts are beyond the type
system's reach. But within its domain (nulls, error paths, parameter
type confusion), feedback is instant and certain. Dynamic language
feedback loops are far longer: write → run tests → discover failure →
guess which step's state went wrong → backtrack.
AI makes certain capabilities cheap: type gymnastics, symbolic reasoning, complex monad transformer stacks. What can't be made cheap is what's truly precious: judging what a system should do, defining correct abstraction boundaries, deciding which constraints are worth encoding into types. Calculators can't replace mathematicians; AI can't replace architects.
The FP community has waited decades for its "this time it'll catch
on" moment. It seems the most powerful catalyst isn't a shift in human
aesthetic taste, but AI's natural affinity for explicit type
information. And all humans need to do is free their brainpower from
"understanding semiflatMap" and spend it where it matters:
defining what the system should do, not worrying about how the
system does it.
AI-native = ADHD-native
This section is personal, but I think it explains things that are hard to grasp from a purely technical angle.
I have ADHD. In past work, I constantly made small mistakes —
swapping variable order, forgetting to update loop state, losing track
in deep if-nesting, guessing i+1 or i-1 for
array bounds by pure luck. My short-term working memory is terrible —
like an agent with a limited context window: processing function A's
logic, jumping to function B, and when I come back, half of A's context
is gone. Jump to another task and back? Details have almost entirely
evaporated.
So my gravitating entirely toward FP was practically inevitable. Immutable data means I don't need to remember "what state is this variable in right now"; type signatures mean I don't need to remember "how can this function fail"; compiler instant feedback means when I forget something, it tells me immediately. I use the type system to compensate for my short-term memory deficits, just like I have agents use signature contracts to compensate for context window limits.
But ADHD isn't just weaknesses. My long-term memory and episodic memory are strong — decisions made in a meeting months ago, the context behind the decision, why we chose this path instead of that one, I remember more accurately than the meeting notes. During technical discussions, I frequently get flashes of insight — weird alternative approaches — which get shot down by the meeting moderator for being off-topic. But in agent collaboration, this becomes an advantage: it's like a trigger for reactive knowledge retrieval in an awakened agent.
Putting my cognitive profile alongside AI's:
| Me (ADHD human) | AI Agent | |
|---|---|---|
| Short-term memory | Poor, easily loses context | Limited by context window |
| Long-term memory | Strong, rich episodic memory | None (every session starts from zero) |
| Symbolic reasoning | Weak, prone to trivial errors | Strong, but also makes mistakes |
| State space reasoning | Very weak, mutable state tracking is a nightmare | Relatively weak, error rate rises with state explosion |
| Compiler feedback | Lifesaver, compensates for my symbolic reasoning deficits | Same lifesaver, corrects its reasoning errors |
| Architectural intuition | Strong, what to split, what to merge | Weak, tends toward local optima |
| Cross-domain association | Strong, but often suppressed in human teams | None, unless human prompts |
Our weaknesses overlap heavily; our strengths complement perfectly. What I'm bad at — concrete implementation, symbolic reasoning, state tracking: AI is better. What AI is bad at — architectural decisions, long-term memory, cross-domain association: I'm better. And our shared weakness — complex state space reasoning — if we can't beat it, we go around it.
This is why every design choice in this article points in the same direction: let the compiler compensate for weaknesses it can (type system, exhaustiveness checking), let AI do what it's good at (implementation, symbolic reasoning), let me do what I'm good at (architecture, rules, cross-domain association). My architectural designs must shift direction to accommodate our shared weaknesses — more decoupled, more isolated, semantics above all, top-level design oriented toward FP.
AI-native coding style is really the ADHD-native coding style I've been using all along. Not because ADHD is a good thing, but because the compensatory mechanisms I built for cognitive deficits happen to also suit AI's strengths. The topic of what role humans play in this division of labor, how they work, and which cognitive habits need changing — that's too big for this article and deserves its own piece.
"Can't Read AI-Written Code?"
This is the most common objection. AI-written FP chain code —
EitherT, semiflatMap, bimap —
humans can't read it. What happens when there's a production
incident?
Oh right, as if you can read assembly.
In today's software stack, from the Java/Scala you write to the machine code actually executing on the CPU, how many layers do you pass through that you can't read? JIT-compiled native code, OS system calls, hardware interrupt handlers — you've never felt unable to debug just because you "can't read those intermediate layers." Because you don't need to read them. You debug at your own abstraction layer.
In fact, in 2026, when senior engineers genuinely need to debug at the assembly level, they throw the assembly at AI for an explanation. AI translates assembly into plain language; the engineer reasons on top of the plain language.
FP abstract code works the same way: can't read the
EitherT chain? Throw it at AI and have it explain in
natural language — "this code first fetches the user, validates, then
fetches the score; any step failing returns the corresponding HTTP error
code." AI can both write this alien scripture and translate it into
plain language.
Moreover, FP code's debug difficulty and depth are far lower than stateful imperative code:
- No mutable state: no need to track "this variable was modified at line 47, then again at line 123, which version does line 200 read?" Pure functions' output depends only on input — same input always yields same output.
- Explicit error paths:
Either[AppError, User]tells you errors are only those fewAppErrorcases. No need to guess "might some deep call throw a NullPointerException?" - Composability: every function is an independently testable unit; bug localization scope is naturally small.
Token Economics
In the "military trust" section I dropped a hot take: trusting signatures means trusting comrades. And this trust behavior saves token costs.
Every act of distrust is a token expense, growing
Fibonacci-style. When an agent doesn't trust the signature, it
needs to open fetchUser's implementation to verify "does it
really return Left(NotFound) when user is not found?" —
reading one file. Then discovers fetchUser calls
queryDB — needs to confirm queryDB's error
handling too, reading another file. Ten functions each verified once is
ten extra file reads. Worse, consider the token billing model: file
contents read back from each tool call become input tokens for the next
round, and the output reasoning process gets billed as input after the
next tool call. In other words, every token ever generated adds
to the price of every future call — the more files read, the
more context bloats, the more every subsequent step's token bill
snowballs. Trusting signatures means the agent only needs to read the
current file to do its work; distrusting signatures means every
additional file read causes the remaining steps' token bills to inflate
in lockstep.
Trust chains + scope isolation also open up bigger architectural possibilities:
Coding agents can be smaller, cheaper, faster. When scope is tight enough and modules are isolated thoroughly enough, a coding agent doesn't need a global view — it only needs to see the signatures of functions it's responsible for, the signatures of dependency interfaces, and relevant type definitions. Solving within a given contract is all there is. It doesn't even need the strongest model — when the task is constrained tightly enough, a mid-tier model with clear signatures and type constraints can do the job correctly. The more precise the contract, the lower the model capability requirement.
Difficulties can be escalated rather than toughed out. When a coding agent hits a problem it can't solve within its current scope — a poorly designed signature, a flawed type constraint, or ambiguous requirements — it doesn't need to "best effort" guess and force an implementation. The correct action is to report the issue back to the orchestrator, who adjusts the design or clarifies requirements, then assigns it to (possibly another) agent for execution.
Global consistency is ensured by a dedicated review agent. After multiple coding agents each finish work in their small scopes, a review agent with a larger context window checks the global changes for consistency — do interfaces align, do error types match, is naming uniform. This review agent doesn't need to understand every function's implementation details — it only needs to audit that the signature-level contracts are self-consistent.
This is my envisioned agent orchestration model:
1 | Orchestrator (architect) |
Outlook
Is Code a Liability or an Asset?
There's a widely quoted saying in software: Code is a liability, not an asset. Every line of code is future maintenance, comprehension, and modification cost. When you first wrote the code, only you and God knew what it did; after six months in production, only God can still read it.
This is entirely true in traditional development. Technical debt grows exponentially — each layer of hack makes the next hack harder to understand, each "temporary solution" digs a pit for the next maintainer. Taking over a codebase with technical debt, whether adding features or fixing bugs, is an uphill battle. Custom software projects almost have to be maintained by the original team or a domain-specific outsourcing team. Bring in a new group, and just understanding "what does this thing even do" takes months.
But what if we could keep technical debt growing linearly instead of exponentially?
All the engineering discipline discussed in previous sections — type signatures as contracts, sealed trait exhaustive errors, opaque type anti-confusion, fix-it-when-you-touch-it gradual debt repayment — share a common goal: keeping the code comprehensible to new maintainers (human or AI) at any point in time.
If this goal is achieved, the nature of a codebase fundamentally flips:
A codebase with strict discipline from day one makes adding features and fixing bugs no longer incredibly difficult. Not just for me — even developers who aren't the original authors can, to a reasonable degree, add custom features on top of this code, because new agents easily understand what past agents left behind. Signatures are honest, types are precise, error paths are exhaustive — no implicit knowledge that requires "veterans passing it down by word of mouth."
Of course, architecture-level adjustments still require the original author or a maintainer of equivalent capability and vision. But for feature-level development — adding an API within the existing architecture, fixing a business bug, migrating a data format — the required person-months drop dramatically. Because these tasks are fundamentally "solving within given contracts," and honest signatures plus strict type systems express those "given contracts" crystal-clearly.
The premise for a codebase transforming from liability to asset isn't "written well" but "maintained well." So can my Art of Whipping AI Grunts bring the cost of "continuous maintenance" to historically low levels?
The Next-Generation AI-native Language
Since I've already gone this far with the hot takes, a few more won't hurt: the next-generation AI-native programming language might genuinely not need to consider human writing or reading experience. Just like nobody hand-writes assembly today.
Could future programming languages bifurcate into two layers?
- Contract layer: pure signatures, contracts, intent expression — possibly more like a declarative specification language
- Execution layer: implementation language optimized for compilers and AI — since humans focus their energy on reviewing the contract layer, implementation readability drops dramatically in importance; human writing experience is no longer a design goal; information density and type precision are what matter
This is my science fiction. Today's Scala 3, Rust, and Haskell already have powerful type system expressiveness with implementations that increasingly look like alien hieroglyphics. The next-generation language just needs to: acknowledge that humans don't need to read implementations, then completely remove "human readability" from the implementation layer's design goals.
Applicability Statement
This article has two premises: one about AI architecture, one about project types.
AI architecture premise: Today's mainstream transformer architecture — fixed context windows, no cross-session state, stateless inference starting from zero each conversation.
Project type premise: All practices discussed in this article apply to a specific class of software projects:
| Applicable | Not applicable |
|---|---|
| Backend systems, CRUD-heavy business apps | Infrastructure, compilers, embedded, hardware drivers |
| General frontend UI applications | OS kernels, real-time systems |
| Modules composed additively (feature stacking) | Tightly coupled modules (feature interleaving) |
| Mostly soft state, little hard state | Mostly hard state, little soft state |
| Data state persisted to external storage | State maintained in-memory in real-time |
The distinguishing criterion is the nature of state. This article assumes the typical scenario: state ultimately persists to a database, in-memory state is ephemeral and reconstructable (soft state). In this scenario, immutable + functional composition has low cost and high benefit, as argued throughout.
But in hard-state-dominant domains — compiler AST transformations, embedded register operations, hardware driver interrupt handling — state itself is the core abstraction, immutable data structure overhead is unacceptable, and tight inter-module coupling is a physical constraint, not a design flaw. In these domains, many of this article's recommendations aren't just inapplicable — they're harmful.
Language premise: This article's rule file examples and engineering practices are based on Scala. Scala is multi-paradigm: it lets you write pure FP, imperative, OOP, or any mixture. This means much of the rule file's constraints exist to pin the agent's behavior to a single Pure-FP paradigm, preventing drift between multiple legitimate styles. If your project uses Haskell, a large portion of these constraints are unnecessary — the language itself already enforces them.
If this article were translated to Rust, it'd be significantly
shorter. Rust's ownership system and borrow checker already eliminate
most mutable state issues at compile time — no need for rule files to
prohibit them. But even in Rust, I'd still write: agents are
forbidden from independently declaring global mutables
(static mut, lazy_static + Mutex,
etc.); local mutables (let mut) are forbidden from spanning
more than 2 scope levels, and absolutely forbidden from escaping the
function. Similarly, I'd enforce agents using
Has<T> traits for compile-time dependency injection —
Rust's version of tagless final: service dependencies expressed through
generic constraints
where Ctx: Has<UserRepo> + Has<AuthService>,
not passing a bunch of concrete types in function parameters. The
signature-layer design principle doesn't change with the language — only
the syntax differs.
And Rust's let-else + ? has even lower
reasoning cost for agents than Scala's cats gymnastics:
1 | let Some(user) = user_svc.find(id).await else { return Err(DomainError::UserNotFound) }; |
Each line is self-contained: input, check, failure path — all closed
within the same line. The agent processing line 2 doesn't need to recall
line 1's branch structure — exactly the linear flow
(Style A) analyzed in the "Implementation-Level Error Handling" section,
just with Rust combining early return guard and pattern matching into
one with let-else. For agents, this has a shorter, more
local, less error-prone reasoning path than
EitherT(...).subflatMap(...).semiflatMap(...).
What the language handles, leave to the language; what the language can't reach still needs rules to fill the gap — this principle is cross-language.
But not every scenario should chase the lowest writing cost. Whether
Scala or Rust, I mandate AI use the Reactive Stream pattern. During the
writing phase, reasoning about Reactive Streams might cost several times
more tokens than an iterator + channel approach (in Rust, even tens or
hundreds of times more — ownership, &mut, lifetimes,
and other constraints might even force you to restructure data types
finalized months ago). But this upfront investment pays off: Reactive
Stream operators are themselves declarative behavior descriptions. When
debugging, agents don't need to chase scattered state mutations across
imperative code — they just look at the operator chain: messages being
dropped? .buffer(n, OverflowStrategy.dropHead) is right
there in black and white. Order scrambled?
.unorderedFlatMap(...) is staring at you. Each operator is
a self-explaining behavior declaration; the bug's cause is written in
the operator's name. The imperative equivalent?
LinkedBlockingQueue's capacity limit is buried in the
constructor, whether the queue blocks or drops when full depends on
whether the caller uses put() or offer(),
scattered in some corner of the producer code. Order issues are even
more hidden: ExecutorService.submit()'s multithreaded
scheduling makes consumption order a runtime-only observable behavior —
nowhere in the code does it say "ordering is not guaranteed." The agent
needs to trace queue initialization, producer logic, and thread pool
configuration across files to locate the same bug. Today's extra writing
tokens buy back massive reading and reasoning tokens for every future
agent.
What does the AI architecture premise affect? Not everything.
Architecture-independent conclusions — won't change with model evolution:
- Signatures/contracts should be honest and complete (§1-§4). Regardless of reasoning architecture, explicit information beats implicit conventions. This is an information-theory judgment, not an assumption about specific model capabilities.
- The human-reviews-contracts, AI-writes-implementation division of labor (§3). This stems from the physical limits of human cognitive bandwidth, independent of AI architecture.
- Type systems and refinement types provide deterministic feedback. Compilers don't become less important because AI gets stronger.
Architecture-dependent conclusions — will change as model capabilities evolve:
- Rules correcting training preferences (§5's Grep vs LSP choice, error handling preferences, etc.). These rules fundamentally compensate for current models' training biases. As models continue evolving on existing architectures, these biases will shift — some bad habits may be corrected, new biases may emerge. Rule files must be continuously maintained and fine-tuned alongside model capabilities — that itself is part of rule engineering.
- Cross-session process constraints (§6's Code Smell Tracking, memory files, etc.). These mechanisms exist entirely to compensate for stateless inference's deficiencies.
We don't even need to wait for "perfect memory." Even one small step — like RWKV-style architectures with persistent state — if inference capability approaches current transformer levels, the game changes.
Imagine this workflow: you spend weeks collaborating with an agent, and it gradually accumulates understanding of the project's coding style, architectural decisions, and module boundaries in its persistent state. When you need to parallelize multiple tasks, fork multiple sessions from that state — each fork inherits the same project knowledge, independently handling code review, refactoring, or bug fixes, without each session starting from zero reading CLAUDE.md and memory files.
This is fundamentally different from the current model. Right now, every new session is a novice agent + text-based rule files. You must textualize all implicit knowledge — "this project uses tagless final," "NoOp implementations for data-related services must return failure," "persisted data is trusted" — write it all into rule files, then pray the agent correctly interprets them within a limited context window. Rule files are essentially simulating long-term memory with text — it works, but clumsily, with a token budget ceiling.
A persistent state that has accumulated project understanding is like an engineer who's been on the team for months: no need to re-read the coding standard every morning, no need to write "why we chose this architecture" as a document to remember it. You no longer need to textualize every rule, because the rules are already internalized in the state.
When that day comes, most of this article's second category of conclusions — the precise wording of rule files, cross-session memory mechanisms, Code Smell Tracking's filesystem workaround — can be drastically simplified or dismantled entirely. Rule engineering won't disappear, but it devolves from "meticulously textualizing everything" to "occasional course corrections" — a completely different magnitude of effort.
But until that day arrives, the scaffolding is still essential.
Appendices
Appendix 1: AI's Toolchain — Grep, LSP, and Disambiguation
Even with Metals MCP (Language Server Protocol tooling) connected, AI still prefers to use regex search and replace throughout refactoring — Grep + regex is the most well-worn path in its training data.
But Grep has clear capability boundaries. Through repeated experimentation, we've mapped out a clear division of labor:
Use Metals (compiler resolution) when the question is "what did the compiler resolve this to?"
| Scenario | Why Grep fails |
|---|---|
Given/implicit resolution: "which given Transactor[IO]
is in scope here?" |
Grep searching given Transactor returns 10+ candidates;
can't determine which the compiler selected |
Extension method: "who defines .pure[F]?" |
def pure is in the cats source, but Grep can't tell you
which extension applies to your specific type |
Opaque type unwrapping: "what's ProjectId's underlying
type?" |
Grep finds opaque type ProjectId = UUID, but chained
calls require cross-file tracing |
Overload resolution: "which apply is being called
here?" |
Grep finds all overloads; requires manual parameter-type matching |
Type alias + inheritance: "does ConnectionIO extend
Sync?" |
Requires tracing type ConnectionIO = Free[...] → Free →
capability hierarchy, multiple hops |
Wildcard imports: "what does import doobie.implicits.*
bring in?" |
Grep can't resolve it; requires reading the entire doobie package object |
Use Grep (text search) when the question is "where does this string appear?"
- String literals, SQL column names, config keys, HOCON values
- Comments, TODOs, error messages
- Cross-language files (YAML, SQL migrations, Dockerfile)
One sentence: "How did the compiler parse it?" → Metals. "Where is this text?" → Grep.
FQN Limitations and Proper LSP Usage
I used to constantly correct AI for writing fully qualified names
(org.springframework.http.HttpHeaders) — felt too verbose.
Later I realized FQN genuinely eliminates ambiguity. AI sees
HttpHeaders and doesn't know if it's
org.springframework.http.HttpHeaders,
io.netty.handler.codec.http.HttpHeaders, or
java.net.http.HttpHeaders.
Can we go further and use SCIP (SemanticDB-based code indexing) to automatically annotate source files with FQNs? I had Claude evaluate this approach:
FQN is indeed useful for me. When I see
val vec = Pgvector(chunk.embedding), I don't know whetherPgvectorcomes fromdoobie.postgres,o.linewise.core.database.DoobieInstances, or somewhere else. FQN instantly eliminates this ambiguity. But I already have a better tool.
| Need | SCIP snapshot | Metals LSP |
|---|---|---|
| "What is PgVector?" | Read 3x-bloated annotated file | One inspect call, precise type |
| "Who calls resolveAuth?" | Grep the snapshot (equivalent to Grep source) | get-usages, semantic-level, not text-level |
| "What type does this expression return?" | Not in snapshot (SemanticDB doesn't contain sub-expression inferred types) | inspect returns it directly |
| "All implementations of PermissionService?" | Grep FQN pattern | typed-glob-search |
SCIP snapshot costs: 3x token bloat (150-line file becomes 450 lines), instantly stale (any edit invalidates it), doesn't cover the real pain points (debug/refactoring bottlenecks are implicit/given resolution chains — SemanticDB doesn't capture these; TASTy does).
Conclusion: give the agent LSP tools and let it query on-demand when it hits ambiguity, rather than burdening every line of code with redundant fully qualified paths.
Appendix 2: Inter-Model Collaboration and Knowledge Transfer
Code written by advanced models can "teach" ordinary models. High-quality code and skills written by top-tier models (like Opus) can effectively guide weaker models through development work.
All major models' training data contains large amounts of high-quality open-source code — the knowledge itself exists. The real difference comes down to two things:
- Weight allocation — different models give different weights to the same knowledge, causing some to naturally produce high-abstraction code while others default to more "mediocre" solutions
- Side effects of human alignment — this directly depends on the AI trainers' cognitive level. During training, models generate "wild ideas" — unconventional but potentially extremely effective strategies. If trainers lack the cognitive ability to recognize these wild ideas' value, see them diverge from the mainstream, and immediately penalize them, these high-value strategies get suppressed in the model. People with poor cognitive ability can't train good AI.
Practical tip: use an advanced model to "activate" this knowledge in-context ahead of time — write it as skills, example code, or rules in CLAUDE.md — then when ordinary models work in that context, they follow the rails already laid down instead of retreating to their default "safe" style.
Multi-model collaboration? Same-tier peer review works better. Some try having multiple model vendors review design documents — Opus, Gemini, GPT as three "experts" discussing and voting. In practice, models with too large a cognitive gap sitting at the same table don't produce effective discussion. Two college professors discussing a research proposal with a grade schooler in the middle — the child won't provide a "different perspective," just drag down the floor of the discussion.
A more effective approach: same-tier models reviewing each other, but with different assigned stances. For example, two Opus instances — one playing "aggressive refactor advocate," the other "stability-first conservative" — they have the capacity to understand each other's arguments and mount substantive rebuttals. Discussions between cognitively mismatched models only degenerate to "the weakest model's comprehension level."
This is fundamentally the same theme as this entire article: if the tools you're using can handle higher abstraction levels, don't downgrade to accommodate the weakest link. This applies to code style, and it applies to model collaboration.