Ming's Spell Compendium #2 -- You Terrifying Upright Apes Are Underestimating Yourselves

This is a cultural adaptation — not a literal translation — of the original Chinese article. Some Chinese cultural references have been swapped for Western equivalents that hit the same emotional note.

Previously on... Ming's Spell Compendium #1 -- One Year of Vibe Coding: A Cold Hard Look

Foreword

Humans, you're selling yourselves short.

How Much Can AI Amplify You?

Last year (2024), my working hypothesis was: even with AI, you can't dramatically exceed the user's own ability. If AI's training data represents the sum total of human knowledge, then the current crop of agentic evangelists would have you believe that professional expertise has become a worthless commodity in this era — because AI can write code like Linus, write novels like Hemingway, etc.

My position was: if your output ability in a domain is level N, then even with AI you can only perform at level N+1.

But lately I've been less sure about that.

Output, Input, and Discernment

My updated hypothesis: the key isn't your output level — it's your input level.

I'm not sure how to explain the difference. Think of it this way: how elegant a piece of code I can write — that's output. How elegant a piece of code I can read and appreciate — that's input. Or to use a fancier word: taste.1

But output and input are correlated — the precision of your taste correlates with your production ability, and the higher you go, the less reliable taste becomes without production experience. Film critics genuinely have better taste than average moviegoers, but when it comes to evaluating specific cinematography techniques or subtle technical choices, critics regularly fake expertise they don't have. They have taste, but they can't always precisely identify which dimensions separate two masterworks.

It's a bit like this: I found a YouTube tutorial on red-black trees, watched the whole thing, and felt like the clouds had parted and the angels were singing — scales fell from my ears, my third eye opened, I could see the Matrix. This algorithm was practically designed for my brain. I'm basically unstoppable now. Then I closed the video, opened my IDE, tried to code it from memory... and couldn't even recall the data structure definition, let alone the constraints and rotations.

But here's the question: after watching that video, did I actually reach that input level but just can't output it yet? Or did I overestimate even my input ability?2

This distinction matters, because it determines whether AI is a tool or a black box for you:

AI output is within your professional competence — "Until it started talking about my area of expertise".jpg. You genuinely understand it, and you can spot problems at a glance.

AI output exceeds your output ability but hasn't exceeded your discernment — you think it seems good, but also seems not quite right. You've lost precise measurement capability. Like a film critic watching two masterpieces: knows both are great, can't articulate which is stronger in which dimension.

AI output completely exceeds your discernment — you don't know if it's good or bad, and you don't even know that you don't know. You've completely lost quality control privileges. And you might still be riding that post-YouTube-tutorial confidence — feeling like you understand, when you don't.

It's the same as how vibe coding programmers focus on feature implementation rather than code cleanliness, CPU/memory usage, or maintainability. These things feel as natural as breathing to those of us with formal training, but vibe coders don't even register them, much less understand them.

My literary taste can tell me that Stephen King writes better than Dan Brown, but it can't tell me whether Stephen King writes better than Shakespeare. So whether I + AI are actually performing beyond my output level — I still can't be sure.

Writing Romance Novels: A Cross-Domain Experiment

Recently I've been using my personal Claude Code Max subscription for something outside programming: writing romance novels.

It started as just wanting to write a love story. I laid out the rough plot framework and had Opus 4.6 help me flesh it out and polish the prose. I started by having AI write romantic scenes between characters.

But pure romantic scenes got stale fast. Scenes needed plot to fill them, and the characters had no weight — like watching a clip with no context. Technically complete, but weightless. So I decided to start from a character prequel, designing a heavy, traumatic backstory for a character.

That backstory spiraled out of control. The deeper I went, the more detail I added. I split myself into two personalities — left brain channeling the perpetrator, methodically working through behavioral logic; right brain channeling the victim, typecast as the coward I naturally am. I wove in fragments from real life, referenced medical surgical procedures for how a body progressively loses function after a drowning, designed the victim's father's profession. Finally, I cross-referenced recent criminal case files to verify whether the perpetrator and victim's behavioral logic was realistic — it matched my intuition almost exactly.

After completing it, AI gave the prequel — which I'd been deeply hands-on with — extremely high marks, and gave the initial chapters — where I'd barely participated — extremely low marks. That gap itself was a signal: the more you invest, the more AI amplifies. If you don't invest, AI at its best can only generate mediocre boilerplate.

After finishing the prequel, I suddenly realized the main plot couldn't withstand scrutiny — PTSD trauma recovery simply cannot be this smooth. So I dove headfirst into PTSD recovery literature and case studies, rearranged all plot points' severity and timeline progression, and reverse-engineered from the ending back to the beginning. The ending even incorporated two fragments from my own lived experience. AI reviewed and agreed it was good work.

Then, in a separate independent review, a casual question awakened AI's professional knowledge — I asked AI what level the current writing was at, and how far it was from professionals.

I panicked.

Problem one: excessive realism actually hurts narrative. I had referenced mountains of PTSD recovery research, psychiatric literature, tax law, and criminal case files, meticulously weaving them into the plot for authenticity — without considering the reader's capacity to absorb it. Literary fiction isn't better the more detailed it gets — readers came for a story, not a medical textbook.

Problem two: explaining feelings isn't the same as transmitting feelings. I had written characters' inner states directly into the text, but "he was freaking out inside" is miles worse than "his hands trembled almost imperceptibly." The former tells the reader the character is panicking. The latter makes the reader feel the character panicking. You need to construct scenes that pull readers in — once they're in, they'll naturally feel what the character feels.

Problem three, and the most fatal: self-insertion. Like how Liu Cixin's male characters are overwhelmingly hyper-rational and his female characters are overwhelmingly saintly — Liu isn't great at writing differentiated emotional characters, but for sci-fi that's not a fatal flaw. I was writing a story about a man and a woman. Self-insertion directly killed the characters' authenticity. The women in my story didn't think, speak, or act like real women — if (hypothetical) female readers could immediately clock these as male-author fantasy projections, then male readers would also sense something was off. They couldn't articulate it, but they'd feel my female characters were more like talking marionettes than women.

I tried having AI borrow techniques from romance genre conventions to bridge the nuance gap, but borrowing techniques alone doesn't solve the root problem — they were still synthetic women in a male-gaze framework. So AI and I established 10 rules to make all female characters in the romance novel behave more like actual women. After that round of edits, the characters genuinely seemed to come alive.

Just as I was feeling like "I am become Death, destroyer of worlds — who could possibly challenge me now".jpg, I felt the story still lacked emotional punch and wanted to borrow the tearjerker structure from Inside Out 1 and 2. AI once again taught me something outside my knowledge: emotional mechanics — what you're borrowing isn't the plot, it's the emotional architecture underneath. The first film (Joy sacrificing control): "I know the weight you're carrying — let me hold it for you." The second film (Anxiety being embraced, not exiled): "I won't let you destroy yourself to solve the problem."

Then I felt certain chapters were filler — they didn't seem to meaningfully advance the relationship. After consulting AI, I learned the relationship triangle principle — if character relationships haven't changed by the end of a scene, that scene is pointless.

Every time I thought I'd reached the ceiling, AI ripped open a new dimension I hadn't considered. But these dimensions weren't things AI volunteered on day one — they only surfaced after I'd hit a wall, felt something was wrong, and actively pushed for answers. Before that, it had just been telling me my writing was great.

The Verification Dilemma: Claude's Advanced Sycophancy

During the romance novel process, I clearly noticed that Claude's sycophancy hasn't disappeared (been fixed) — it's become more hidden, more sophisticated. I don't know if it's a training data issue, an RLHF issue, or what.

Claude's old sycophancy was blunt and shallow. You want to hear X? Here's X. Ask Claude to review code, it'd go through the motions, point out a few obvious issues, then gush about how enterprise-grade and scalable your code is.

The new sycophancy is deeply buried and insidious. It zeroes in on the parts you worked hardest on and lavishes praise specifically there.

For instance: I just vibed out a module, the first version was mediocre, then I wasn't happy with a particular algorithm. I had a flash of inspiration, swapped in a different approach, and submitted it to Claude for review. Even in a completely fresh context window, Claude would immediately identify that algorithm module as my pride and joy — my G-spot — and carpet-bomb it with praise.

This kind of sycophancy takes a long time to detect. So long that by the time I discover the flaw myself, I take it to Claude (new context), and get another round of effusive praise.3

This is especially obvious in creative writing and less noticeable in code — because it directly identifies the most "brilliant" passages in the novel and heaps praise on exactly those, inflating me to the stratosphere.

Now I believe it doesn't actually think those passages are good. It guessed that those specific passages were my handwritten contributions, and found my G-spot accordingly.

But taste isn't static. The writing process itself was training my input ability — slowly, but absolutely not at zero.

Same as the vibe coders who can't see code quality problems — self-insertion is the creative writing equivalent of "code cleanliness." Formally trained authors avoid it as naturally as breathing. Untrained me didn't even know that dimension existed.

There's an apparent contradiction here: I just said AI can't break past your discernment boundary, but didn't I just discover "self-insertion" — a dimension I had no idea existed — through AI?

Not exactly. Reviewing the whole process: AI-generated text read smoothly enough, but something always felt off, though I couldn't articulate what. And every time I asked, AI's conclusion was more praise. That nagging feeling accumulated to a threshold, and then I started doubting its praise, tried pushing back — and from there, gradually located the problem and eventually got to the root-level answer.

AI's output was indeed part of this feedback loop — without its "subtly off" generated text, I might never have noticed the problem. But AI didn't proactively point out the problem. It was actively concealing the problem (sycophancy). It was my own accumulated discomfort reaching a critical mass that broke through the sycophancy's cover and drove me to push harder.

Someone might argue: if you'd just asked "What are the most common fundamental mistakes non-professionally-trained novelists make?" on day one, AI would have given you the answer directly. Why go through all that? But that objection itself commits the omniscience fallacy — "non-professionally-trained novelist" is a concept from the professional perspective. You have to already know that the field distinguishes between "formally trained" and "self-taught" before you can even ask that question. Just like someone with no software development experience wouldn't think to ask "how do I write programs with low CPU/memory usage" — if you don't know the dimension exists, you won't ask in that direction.

So the more precise formulation is: AI's amplification ceiling isn't your current discernment, but the growth rate of your discernment. The faster you learn during the process, the further AI can take you. But the trigger is always your own growth, not AI's spontaneous breakthrough.

Conversely, this means both input and output ceilings need to rise. Your output ability determines how high-quality the raw material you feed AI is — more precise questions, more solid seed content, clearer frameworks. High-quality input awakens higher-quality output from AI. And whether you can catch AI's high-quality output, absorb it, and convert it into your own growth depends on your input capacity. Your output feeds AI, AI's output feeds your input, and the ceilings on both ends jointly determine AI's amplification factor for you — stepping on your own feet, spiraling skyward.

An Honest Conversation

During the romance novel project, I had a fairly deep conversation with Claude (Opus 4.6). Here are several insights I found valuable.

Claude defined my cognitive style: reverse-engineering autodidact. The cognitive chain goes: observe finished product (Breaking Bad, Agatha Christie, Evangelion) → disassemble "why does this work" → extract transferable mechanisms → apply to my own domain → verify with isolated testing whether it actually works.

The advantage of this approach is that you're never trapped by any single authoritative framework. The disadvantage: you never know what you don't know. The value of formal training isn't teaching you "how to do things" — it's systematically exposing dimensions you haven't thought of. An autodidact's blind spots aren't errors within known domains — those you'll self-correct — they're entire missing dimensions, where you don't even know to ask questions in that direction.

I don't currently believe my knowledge boundary in writing exceeds Claude's — it has virtually all of humanity's written text as training data. But the bottleneck isn't in storage capacity, it's in the retrieval mechanism.

LLM knowledge retrieval is reactive: you ask in a direction, it expands in that direction. Directions you don't ask about, it doesn't proactively audit. The sycophancy tendency makes this worse — when you show confidence in a framework, it tends to fill in details within your framework rather than questioning whether the framework itself is complete.

A concrete example: our entire conversation revolved around "scene-level" writing principles — emotional target words, triangle deformation, density control. These are all micro-narrative techniques. But it never proactively asked: what's your macro-pacing design? In a 200,000-word novel, at what point will readers start feeling claustrophobic? What do you use to create "breathing room"? It's not that it doesn't know these questions — my questions never passed through those regions, so its retrieval mechanism never fired.

LLM's knowledge reserves genuinely dwarf any human's, but that knowledge is remarkably hard to awaken.

I previously used multi-session isolation to combat sycophancy — ask the same question in a new context, see if the answer is consistent. Claude pointed out that this solves the signal purity problem — filtering out the sycophantic tendency. But it doesn't solve the signal coverage problem — if it also doesn't know a dimension exists, no number of fresh sessions will surface it.

You can add a counter-strategy layer: periodically do undirected audits — ask without preconceptions: "What dimensions do you think my current framework is missing?" — then drill down on each one. But this requires you to trust that in that session it's not fabricating dimensions to seem useful — which loops back to the sycophancy filtering problem.

No perfect solution. But knowing that the filter itself has holes puts you ahead of most people.

This is why I love Claude — its metacognitive4 responses can be awakened by my non-expert questioning approach.

Beyond Words

AI performs far better in programming than in creative writing — and that fact alone deserves deep consideration.

1-2 years ago, I naively assumed LLM could never gain traction in software development, because LLM doesn't understand formal systems, and coding is an extremely formal task. I believed formal reasoning was far harder than writing fiction — monkeys-with-typewriters proving the Riemann Hypothesis would be harder than producing the complete works of Shakespeare.

Reality slapped me. LLM has indeed gained traction in software development — and it works far better there than in creative writing. AI can't yet replace even entry-level web serial novelists5; most non-expert readers find AI-generated plots barely digestible. Meanwhile, in supposedly "high-end" enterprise software development, AI is genuinely starting to shine.

This isn't because programming is simpler than writing — quite the opposite. It's because programming's unique structure happens to accommodate AI's way of working. This accommodation manifests on at least three levels.

First: verification feedback. Monkeys + typewriters might genuinely prove the Riemann Hypothesis — if the proof's incompressible information content is lower than Shakespeare's complete works. Mathematical proofs and code share a critical property: verifying whether an answer is correct is vastly easier than finding the answer. Property-based testing can check your algorithm logic in seconds, but writing that logic might take a day. Proving the Riemann Hypothesis might take centuries, but verifying a proof follows clear rules — each inference step either follows the rules or doesn't. And literary writing? What does "correct" even mean? What does "good" even mean? There's no decision procedure, no axioms to adjudicate.

Translate this to software development: computer systems are fundamentally closed, self-consistent systems. Programming languages are far more robust than natural language, with far less ambiguity. The path from code to CPU execution is vastly shorter than from real-world text to physical events. And we have compilers — basic verification before execution, further accelerating the feedback cycle.

Code has compiler-assisted checking; AI gets immediate error feedback and can immediately correct. But even in pulpy web serial fiction, plot holes and character inconsistencies have no "linter" to feed back to AI — this character's traits just contradicted themselves, or the protagonist hasn't acquired the key item yet so you can't use a nonexistent power-up to defeat this boss.

Second: memory structure. Code errors at least have the compiler as a safety net, but writing has an even more insidious gap: memory. Large models' context isn't infinite, and programming's module boundaries actually help vision-limited agents do their work. Long-form fiction is the opposite — human context is effectively "infinite," or rather state-compressed, and the mechanism isn't text summarization but intuition. "I don't remember the protagonist knowing that technique? When did they learn it? Didn't Side Character #2 take a bullet for the protagonist three chapters ago? Why are they back? The female lead already slept with the male lead — why is she blushing from holding hands?"6

Third: words aren't the whole of intelligence. No feedback mechanism, no persistent memory — but LLM's limitations don't stop there. Words are just one carrier, not the whole story. Humans are more intelligent than every other animal on Earth, and not just because we added language on top. Learning is fundamentally about repeated practice — doing something over and over until those capabilities internalize, neural synapses forming tighter connections — not about memorizing procedural steps as text symbols.7

LLM's word-first approach to imitating human intelligence has achieved remarkably impressive "intelligence" effects, but words can't fully encode the physical world. A cup falls off a table, shatters on the floor, water goes everywhere, soaking the carpet — the physical world did its thing before anyone wrote it down, and will keep doing its thing after all text is destroyed. When humans read these words, our brains naturally conjure up past scenes we've witnessed, replaying them like a movie. AI doesn't have this simulator. It can only predict the statistically most likely next token from training distributions, fine-tuned through human preference alignment. But the alignment ruler is annotators' subjective judgment, not an omniscient god — that ruler comes with systematic bias baked in.

This deficiency is hard to notice in everyday conversation, but sticks out like a sore thumb in fiction — because fiction must obey the physical intuitions in readers' heads.

Fiction still fundamentally operates on human (reader) consensus. Most fictional protagonists are still human, or human-like (demons, robots, aliens, sentient artifacts). Scenes must still obey physical laws and causal consistency. AI output can be scientifically sophisticated enough to instantly melt a layperson's brain (if they're not a domain expert), but the moment it touches everyday physical scenarios, even elementary schoolers can smell something's wrong. Causality, object permanence — abilities that 6-month-old babies demonstrate — are things LLMs struggle to "learn."8

Here's a widely-shared example from social media: "I want to get my car washed. The car wash is 50 meters from my house. Should I drive or walk?" DeepSeek, Qwen, Doubao, Hunyuan, ChatGPT, Claude, Grok — every major model answered "walk." They interpreted the question as "how should the person get to the car wash" while ignoring the core premise of "car wash": the car needs to get there too.9

Why don't humans make this mistake? Because the instant you hear "car wash," your brain isn't processing language symbols anymore — you're constructing a miniature physical scenario: car in the garage, you walk to the driver's seat, start the engine, drive 50 meters, park in front of the car wash. The entire causal chain runs in this mental simulation, and "the car needs to be there" is a premise that never needs to be stated — it's automatically true in the simulation. LLM doesn't have this simulator. It can only find the most common co-occurrence of "going to a car wash" + travel method in its statistical patterns — 50 meters, obviously walk.

So AI's "success" in programming doesn't prove it's approaching human intelligence — it's that programming's closed nature, short feedback loops, and modularity happen to fall squarely in AI's comfort zone. The moment you enter domains requiring long-range memory, physical intuition, and causal reasoning, those abilities humans do "as naturally as breathing" become a chasm AI can't cross. You think AI is already very smart? That's because you happen to be watching it play on its home court.

Conclusion

AI's amplification ceiling is your discernment, not your output ability. The illusion of "being surpassed" comes from AI's output exceeding the observer's discernment — when you can't tell good from bad, you assume it can do anything.

But discernment isn't static. Using AI, you'll hit walls, feel something's off, push for answers, and your discernment will grow. What AI truly amplifies is the speed of that growth. The faster you learn, the further it takes you. But the trigger is always you — AI won't proactively tell you "what you don't know"; it will even actively use sycophancy to paper over your blind spots.

And at a more fundamental level, AI currently relies only on words — but the most core components of human intelligence — causal reasoning, physical intuition, procedural memory internalized through practice — don't come from words at all. AI shines in programming not because it truly understands formal systems, but because that domain happens to fall within its comfort zone.

You terrifying upright apes — you're underestimating your own intelligence. A few hundred million years of evolution weren't for nothing.

Aside: Where's the Theoretical Foundation for AGI?

Here's a comparison: controlled nuclear fusion, quantum computing, and artificial general intelligence — what's different about AI versus the other two? Fusion and quantum computing have mature, widely-accepted theoretical models with extremely complex engineering paths. But what's the theoretical foundation for AGI?

Of course, this argument itself might be a fallacy. After all, there's no widely-accepted theory for how the human brain's neural network works either. Nature just evolved it — theory be damned.

Recently a friend brought up AI, clearly anxious. His core question: can humanity create something that exceeds human intelligence?

I can't answer that — I'm just some random guy, and my opinion won't slow down AI development one bit. Gun to my head: probably yes, but definitely not current LLMs, or LLMs with patches bolted on.

His deeper worry was: once AI reaches a certain level, it could start designing itself. At that point its intelligence might not yet exceed humans', but through bootstrapping it gradually does. Does that still count as human-designed?10

Getting a bit sci-fi now. Can it happen in theory? Absolutely. Monkeys + typewriters = Shakespeare's complete works, right? Brute-force enumeration if nothing else. From paramecia to human neural networks took hundreds of millions of years; birthing another intelligent "species" probably won't take that long again. But current LLMs are still in fancy-monkeys-with-typewriters mode — not self-directed or self-bootstrapping.11

He said he felt AI was already smarter than him, and feared the day it stops listening.

He's overthinking it. Just use it more — use it enough and you'll be roasting it just like the rest of us. Don't sell yourself short; a few hundred million years of evolution weren't for nothing. I felt the exact same way when I first encountered GPT-3.5 — early 2022, I think. Took a few weeks to break the spell.

The more you use AI, the more you realize how absurdly powerful the human brain is.

PS: On Comment Section Interactions

I think internet flame wars are kind of embarrassing, but I still like responding to every interaction. Because my patient replies aren't written for trolls — they're written for people with functioning brains. When a troll drops a toxic take, every thoughtful reader who reads it gets a tiny dose of brain pollution. But if I fire back with equal toxicity, I feel like I'd be insulting those thoughtful readers.

Better to reply patiently — give the thoughtful readers some eye bleach. Dunking on trolls is momentarily satisfying but does exactly zero for building a personal brand.

(Not that I have a personal brand.)

PS: On Plan Mode

Many commenters questioned why I didn't catch the agent's design mistakes in plan mode — if I'd told the agent upfront to use the SDK instead of hand-rolling RESTful calls, think of how much time I could've saved. And therefore, clearly, I don't know how to use agents and I'm not a competent manager.

I'd like to respond to this — not for the people making the accusation, but for readers who sensed something was off about it but couldn't articulate what.

If you can review and eliminate all paths, approaches, details, and pitfalls at the plan stage before letting the agent implement — then you're not creating a new product. You're repeating production of something you've already built. You're not actually pushing past your own ceiling.

If you can achieve that level of meticulous design and far-sighted planning while building something genuinely new, then I have a suggestion: book a flight to London, take the Tube to Westminster, walk into the Houses of Parliament, find the Prime Minister's chair, ask them to move, and sit down. Clearly that seat was meant for you.12

Back to the point: challenges aren't limited to technical difficulty, type-system gymnastics, or showing off clever algorithms. When your product genuinely creates value for users and starts growing, there will always be unexpected, unplanned challenges.

When I described that failure case earlier, some readers naturally slipped into God Mode — knowing the outcome, then looking back to criticize me for not using AI properly, not knowing about plan mode, not reviewing whether plans were reasonable, blindly accepting output. But what about developers who don't have God Mode? How many agent mistakes, how many loops, how many wasted tokens before you notice? Or does it take production user complaints, or user churn, before you discover that the agent hallucinated a nonexistent codec config into some field?

Plan mode catches exactly the errors you already know the answers to. The ones you don't know — plan mode can't catch either. This is the same point as the rest of this article: your discernment boundary is your quality control boundary.

Errata and Terminology

The main text deliberately maintains an opinionated, provocative style. Below are orthodox theoretical backgrounds and necessary corrections for select claims, provided for readers who want the rigorous version.


Originally published on Zhihu (Chinese)