// AI-native:签名强制非空,调用方在 call site 处理空 case defbatchEmbed(texts: NonEmptyList[String]): IO[NonEmptyList[Embedding]]
传统写法里"不能传空列表"是一个口头约定,或者注释里的一行
// texts must not be empty。AI 不看注释。
它会老老实实传一个空列表进去,然后
NoSuchElementException。NonEmptyList
把这个约定提升到了类型层面——调用方必须用
NonEmptyList.fromList 处理空 case,否则编译不过。
传统写法里,到了 Service 层你看到一个
String id,得往上追三层才知道这是什么 ID。AI-native
写法里,任何一层的签名都是自解释的——这恰恰是"人类只读签名"这个分工模型能成立的前提。
工程纪律:AI
的默认坏习惯需要规则约束
类型系统能解决签名层面的问题,但 AI
在实现层面还有一些训练带来的坏习惯,需要用显式规则去纠偏。
Fail-fast,禁止吞错误。 AI 的 RLHF
训练奖励"鲁棒性",默认倾向是用
.toOption、.getOrElse(默认值)、Try(x).toOption
吞掉异常装作岁月静好。但在生产系统中,吞错误是藏匿 bug
的温床——涉及资金、机械臂、高功率输出的场景下,异常停机的代价可能只是暂停,吞异常继续运行可能导致财产甚至生命危险。必须明确禁止。
命名规范 + 定期审计。 人类能记住"processMatrix
其实干的是流量分发"——大脑会自动建立名实不符的映射。但 AI
不会,每开一个新
session,它都会老老实实按字面意思理解,然后在同一个坑上反复栽倒。命名污染对
AI 的伤害远大于对人类。定期让 AI
自己审计命名一致性,比人类自己检查效率高得多。
关于 FQN 的反思。 我曾经一直纠正 AI
不要写全限定名(org.springframework.http.HttpHeaders),觉得太啰嗦。后来我意识到
AI 是对的。一个视野局限在当前文件的 AI,看到
HttpHeaders 这个词会怎么想?是
org.springframework.http.HttpHeaders?io.netty.handler.codec.http.HttpHeaders?还是
java.net.http.HttpHeaders?只有 FQN 能消除这个歧义。人类靠
IDE 的跳转和脑子里的项目地图来消歧,AI
读的是纯文本,它没有这些上下文。当然,这本质上是工具链的问题——现在 AI
读纯文本,下一步应该让 AI 直接读 AST、接
LSP,拿到带作用域的符号信息。在那之前,FQN 是 AI
唯一的消歧手段,不该被人类的审美偏好覆盖掉。
模块化:做加法不做乘法。
功能叠加是线性增长,功能交叉是组合爆炸。一个函数干三件事,AI
理解错任何一件都会全盘出错。对 AI
来说,模块边界就是理解边界——边界越清晰,AI 犯错的概率越低。
代码即文档。 不维护 markdown
设计文档——类型签名就是接口契约,测试用例就是行为规范。一年的生产实践证明,让
AI 维护 markdown
文档是反模式:文档会过时、会和代码不一致、会消耗宝贵的上下文窗口。文档越堆越长,反噬上下文,加速
AI 降智。与其维护一份随时可能过时的
markdown,不如把精力花在让代码自解释上。
人类对齐的副作用——这直接取决于 AI
训练师的认知水平。模型在训练过程中会产生一些"鬼点子"——非常规但可能极其有效的策略。如果训练师的认知能力不足以识别这些鬼点子的价值,一看和主流路线不符就直接惩罚掉,这些高价值策略就会在模型中被压制。但对于资深工程师来说,这类"鬼点子"往往恰恰是最有价值的洞察。认知能力差的人,训练不出好
AI。
该文件为我个人自用的CLAUDE.md规则文件,维护生产项目已有一年时间,从
Cursor + opus 3.7 时代至现在Claude-Code + opus 4.6时代。
请注意:该文件也仅可用于 Claude code opus
4.6,我不是对 codex 和 gemini 有偏见,OpenAI 的 25k USD credits
今年6月份就要过期了,它(gpt-5.4-xhigh/codex-5.3-xhigh)要是真有自媒体和AI教父们宣传的那么牛逼plus,我能把这些credits放过保质期?
实测效果就是 codex 对函数式编程风格遵循不理想,对 opus 4.6
有效的简短提示词,你不掰开揉碎了讲给 AGENTS.md 它必定会跑偏。
因为编码风格,个人品味原因,请勿整段复制甚至替换掉你的规则文件。你可以让
Claude
阅读我的规则文件,分析后你再决定有哪些可以采纳,有哪些不适合你当前项目。我个人的编码和架构风格非常激进,因为许多代码/系统烂是烂在骨子里的,在这些屎山系统上打“优雅”的补丁实际上还是在推高屎山的海拔,而并没有真的降低技术债务。所以我的架构设计风格从来是大胆激进,没什么不能改也没什么不敢改的,活着的系统一定要经常重新审视,只要能降低技术债务,要敢于进行局部甚至底朝天的重写;只有死了的系统才是永恒不变的。
This file provides guidance to Claude Code (claude.ai/code) when
working with code in this repository.
Overview
A multi-tenant backend service built with http4s
(Scala 3 / cats-effect). It provides document management, AI-powered
features (embeddings, RAG), video SOP generation, and real-time
communication capabilities.
Refactoring Philosophy
Prefer radical type-level refactors over conservative
patches. This is a statically-typed Scala 3 codebase with
tagless final — the compiler catches all downstream breakage. When
fixing an issue, always choose the solution that encodes the constraint
in the type system, even if it touches many files. A 15-file signature
change that the compiler verifies is safer than a
1-file patch with a runtime check.
Don't minimize blast radius — maximize type safety.
Changing a method from List[T] to
NonEmptyList[T] across 6 files is not "risky" — the
compiler finds every call site. A runtime .toNel.get hidden
in one file is the real risk.
The compiler is the last line of defense. If a
refactor compiles, it's correct. Treat compilation as the acceptance
test for type-level changes.
Write-cost is near zero. AI writes 90%+ of code, so
the cost of touching more files is negligible. Optimize for correctness
and compile-time safety, not for minimal diff.
Type precision is not over-engineering.
Over-engineering means unnecessary abstractions, config flags, strategy
patterns for one implementation. Using NonEmptyList over
List, ProjectId over UUID, or
propagating constraints through signatures is the opposite — it removes
complexity (runtime checks) by shifting it to the compiler. "Avoid
over-engineering" applies to architecture, not to type-level
precision.
// Access via summoner SOPService[F].getSOP(tenant, id)
NoOp Result Patterns - Caller Decides:
NoOp implementations return results (not throw exceptions). The
caller decides if it's an error:
1 2 3 4 5 6 7 8 9 10
// NoOp returns result indicating "not processed" classSOPServiceNoop[F[_]: Applicative] extendsSOPService[F]: // Either methods → Left with error message defcreateSOP(...) = Left("Service not available").pure[F]
// BAD — nested match breaks the for-comprehension flow for result <- service.doSomething(...) response <- result match caseRight(value) => Ok(value.asJson) caseLeft(err) => BadRequest(err.asJson) yield response
// GOOD — lift Either/Option into F for value <- Sync[F].fromEither(parseJson(raw).leftMap(e => RuntimeException(e.message))) response <- Ok(value) yield response
// GOOD — use EitherT.foldF when both paths have logic for body <- req.req.as[Body] response <- EitherT(service.doSomething(body)).foldF( err => Logger[F].warn(s"Failed: $err") *> BadRequest(err.asJson), value => Created(value.asJson) ) yield response
No premature helpers: Don't extract single-use
private methods that just wrap a match. Inline the logic at
the call site.
Case classes over manual cursor decoding: For
external API payloads, define case classes with
derives Decoder and decode once with .as[T],
then pattern-match on decoded fields. Avoid manual
hcursor.downField(...).get[T](...) chains.
SDK/Library Priority Order
When integrating external services (e.g., Google Cloud, AWS,
Firebase), prefer libraries in this order:
Typelevel-wrapped Scala SDK (e.g., from
typelevel.org ecosystem)
Official Java/Kotlin SDK (wrap with
Async.blocking)
Use when no Scala alternative exists
Implement yourself (HTTP client)
Last resort, only when SDK unavailable or unsuitable
Example: For Google Gemini integration, use the
official Java SDK wrapped with Async[F].blocking rather
than implementing raw HTTP calls.
Multi-Tenancy with Schema
Isolation
The database uses PostgreSQL schema isolation for multi-tenancy:
System schema (public): Stores shared
data (tenants, users, system settings, quotas)
Tenant schemas (tenant_<id>):
Each tenant gets an isolated schema for their data (projects, documents,
SOPs, etc.)
Migration System: - System migrations:
db/migration/system/ - run once at startup - Tenant
migrations: db/migration/tenant/ - run for each tenant
schema - Migrations run automatically at startup for all existing
tenants - New tenant schemas are migrated on creation - Never modify
existing migration files; always create new versioned files
Tenant Schema Access: All tenant routes follow the
pattern: /api/org/{tenant}/...
Fail Fast - No Silent
Error Swallowing
CRITICAL RULE: Never silently swallow errors.
Arbitrary tolerance pollutes the database and hides bugs.
Forbidden patterns (ALWAYS):
1 2 3 4 5 6
// BAD - silently converts errors to None/null json.as[T].toOption json.as[T].getOrElse(defaultValue) either.toOption Try(x).toOption result.getOrElse(null)
Error Handling Strategy - Trusted vs Untrusted
Paths:
Path Type
Examples
Strategy
Trusted (internal)
Config files, system settings, DB schema data, internal
serialization
Throw exception - low probability of error, if it
fails it's a bug
Untrusted (external)
User input, AI-generated content, external API responses
Catch and report - high probability of error,
report back to user/AI to fix
1 2 3 4 5 6 7 8 9 10
// TRUSTED PATH - throw on failure (system internal data) val config = configJson.as[AppConfig].getOrElse( thrownewRuntimeException(s"Config decode failed: ${configJson}") )
// UNTRUSTED PATH - catch and report to caller (user/AI content) val result = userJson.as[UserContent] match { caseRight(v) => v caseLeft(err) => returnBadRequest(s"Invalid content format: ${err.message}") }
Typed Error Model (ADT
Errors)
Use enum error types, not
Either[String, T]. Services define sealed error
enums for known failure modes. Routes pattern-match on the enum to
decide HTTP status — no string parsing.
Scope: one error enum per logical failure domain, not per
service. - If two methods share most failure modes → one shared
enum - If two methods have different failure modes → separate enums -
Shared subset across domains → compose via wrapping:
ParseError.Embedding(EmbeddingError)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
// Separate enums — methodA and methodB have different failure modes traitDocumentService[F[_]]: defimportUrl(url: String): F[Either[ImportError, Document]] defparseContent(docId: DocumentId): F[Either[ParseError, Content]]
// Shared enum — indexDocument and indexSop share the same failure modes traitRagIndexService[F[_]]: defindexDocument(docId: DocumentId): F[Either[IndexError, Unit]] defindexSop(sopId: SopId): F[Either[IndexError, Unit]]
Rules: - Named variants for known
failures. Each variant carries structured context (IDs, limits,
types), not string messages. - Sdk /
Other(message: String) variant for unexpected
errors that don't warrant their own case yet. - Compose, don't
flatten. When service A calls service B, wrap B's error:
case Embedding(cause: EmbeddingError), not
case EmbeddingFailed(message: String). - Route
mapping: Each error variant maps to exactly one HTTP status.
The match is exhaustive — compiler enforces handling every variant. -
Define in feature's models.scala. -
Migrate incrementally. New services use typed errors.
Existing Either[String, T] services migrate when next
modified.
Feature Module Organization
Each feature module follows a consistent structure:
1 2 3 4 5
features/<feature>/ ├── <Feature>Service.scala # Business logic (tagless final trait + impl) ├── <Feature>Repository.scala # Data access layer (Doobie queries) ├── models.scala # Domain objects and DTOs └── README.md # Feature documentation
Tests use: - munit with cats-effect support for test
framework - TestContainers for PostgreSQL integration
tests (automatic database provisioning) - Doobie munit
for query checker tests
What NOT to test (waste of time): - Case classes -
no value without complex methods - JSON serialization - Circe is already
well-tested - Config class definitions - if config is wrong, app fails
to start anyway - Framework behavior (http4s, Doobie) - already
well-tested by the community
What TO test (valuable): - Security validation
(e.g., tenant name injection prevention) - Error handling / fallback
logic - Real database integration with TestContainers - Business logic
that has actual branching/computation - Assembly/wiring that connects
our components together
Update the OpenAPI spec: After route/DTO changes,
update src/main/resources/openapi/documentation.yaml
Validate the spec: Run
swagger-cli validate to ensure correctness
Follow authentication patterns: All API routes
require Firebase JWT authentication (except system/health
endpoints)
Code Formatting
Use scalafmt (configured in .scalafmt.conf):
1 2
./mill reformat # Format all Scala sources ./mill checkFormat # Check formatting without modifying
Code style rules: - No fully-qualified names
in code. Always use imports. - Context bounds: use
{A, B, C} syntax (Scala 3.6 aggregate bounds), not
colon-separated. - Opaque types for domain values. AI
writes 90%+ of code, so write-cost is near zero while compile-time
safety is free. Use opaque types with smart constructors for all entity
IDs, constrained strings, and bounded numbers. Defined in
core/domain/Ids.scala and
core/domain/Types.scala. - Type-level constraints
flow E2E. Encode invariants in types (opaque types,
NonEmptyList, refined types) and propagate them through
all layer signatures: route → service → repository.
Never downgrade a constraint to a weaker type and re-validate internally
— that hides the requirement from callers and defeats compile-time
safety. Unwrap/weaken only at the true system boundary: SQL
interpolation, Java SDK calls, job parameter serialization. -
.toString over
.value.toString. Opaque types erase at runtime, so
s"...$opaqueId" and opaqueId.toString just
work — no need to unwrap first. - NonEmptyList over
List + .get/.head. When
a method logically requires non-empty input (batch embeddings,
IN clauses, etc.), use NonEmptyList[T] in the
signature — including repository methods — instead of
List[T] with a runtime .toNel.get or
.head. Callers use NonEmptyList.fromList to
handle the empty case at the call site. - No premature
helpers. Don't extract single-use private methods. Inline at
call site. - Generic over specific. One
queryParam[T] not three type-specific parsers. -
Proactive naming review. When reading or modifying
code, flag misleading, stale, or inconsistent names to the user. For
internal names (classes, properties, methods) —
recommend renaming directly. For external names
(request/response DTOs, DB-serialized JSONB fields) — suggest the better
name but note migration implications. Common smells: Kotlin-era suffixes
(Kt), field names that don't match their type
(name holding an ID), stale comments referencing deleted
code, generic names that obscure domain meaning.
Logging
Use val logger (not context-bound
injection). Create a local
val logger = Slf4jLogger.getLogger[F] or pass as
constructor param. Logger is too common to justify tagless-final
injection overhead.
Milestone logging for long operations. Every
long-running call (external API, DB migration, video processing,
embedding) should log at each major step so operators can see progress
and diagnose hangs.
Log level in loops: If each iteration is fast
(e.g., processing a list of items), use
debug/trace. If each iteration is slow (e.g.,
transcoding, embedding backfill), info is appropriate.
Log levels:error = unexpected
failures that need attention. warn = degraded but
recoverable. info = lifecycle events, milestones, external
calls. debug = per-item processing in loops, internal
state.
Runtime Assertion Checks
(RAC)
Insert runtime assertions on critical paths. RACs
catch inconsistent state early, before it propagates downstream and
corrupts data. Always enabled in dev/testing; switchable off in
production via config flag.
Implementation: Use a shared
RAC.assert(condition, message) helper that checks a config
flag. When disabled, assertions are no-ops. When enabled, they throw
immediately.
What should have RAC: - Money/balance
operations — assert balance >= 0 after debit, assert credit
+ debit = expected total - Inconsistent state
transitions — assert valid transitions in state machines (e.g.,
SOP stage: draft→processing→published, never published→draft; RAG index:
PENDING→INDEXING→INDEXED, never backward). Throw immediately on invalid
transition to prevent downstream pollution. - Tenant
isolation — assert search_path matches expected tenant schema
before writes. Wrong schema = cross-tenant data leak. -
Embedding dimensions — assert vector length matches
expected dimension (768 for text, 1408 for video) before pgvector
insert. Wrong dimension corrupts similarity search silently. -
Idempotency — assert no duplicate job submission for
same entity (K8s jobs, Quartz jobs). Duplicates waste resources. -
Invariant preservation — any operation where a
post-condition violation would silently corrupt data rather than fail
visibly.
What should NOT have RAC: - Input validation (use
typed errors instead — that's user-facing, not assertion) -
Performance-sensitive hot loops (use debug logging instead) - Conditions
already enforced by the type system (that's the compiler's job)
CI/CD Pipeline
IMPORTANT: Do NOT build and push Docker images from local
machine. Always commit and push to git to trigger CI for
building images. Only build locally if explicitly requested by the
user.
Branches and Docker Tags: - develop
branch → gcr.io/${PROJECT_ID}/api:develop -
testing branch →
gcr.io/${PROJECT_ID}/api:testing - master
branch → gcr.io/${PROJECT_ID}/api:latest and
:master - Git tag vX.Y.Z →
gcr.io/${PROJECT_ID}/api:vX.Y.Z
Git Push: - If SSH push fails (e.g. VPN/proxy blocks
port 22), switch to HTTPS temporarily.
Deployment: - GitHub Actions workflow:
.github/workflows/build-and-push-gcr.yml - Builds with
Mill, then Docker image, pushes to Google Container Registry - Requires
GCR_SERVICE_ACCOUNT secret (GCP service account JSON) -
Deployed via ArgoCD on Kubernetes (not docker-compose
or manual shell) - Deploy manifests live in a separate repo:
deploy/overlays/{dev,testing,prod}
Deploy impact reporting: When a code change involves
deploy-affecting changes, output a summary of what needs to be updated
in the deploy repo. Examples: - New environment
variable → add to ConfigMap or Secret in the overlay, reference
in Deployment env - New configuration field → add to
application.conf ConfigMap - New sidecar
container → add container spec to Deployment manifest -
New volume/secret mount → add Volume + VolumeMount to
Deployment - New external service dependency → may need
NetworkPolicy, ServiceAccount, or IAM binding
Format the output as a checklist the user can apply to the deploy
repo. Do NOT suggest docker-compose changes or manual
docker run / kubectl apply commands.
Key Dependencies & Services
External Services
Integration
Firebase Admin SDK: User authentication and JWT
verification
Vertex AI: Text embeddings, Gemini models, Document
AI for OCR
Google Cloud Storage (GCS): Document and video file
storage
LibreChat: Optional integration for chat
interface
Optional Features
Kubernetes Job Delegation: - Video processing
(FFmpeg, VideoSeal) can run as K8s jobs instead of in-process - Enable
via KUBERNETES_JOBS_ENABLED=true - Requires service account
with GCS access and K8s job permissions
反面案例:我见过有人给 AI 写一整屏的
prompt,项目背景、技术栈版本、上次对这个功能做了什么变更、老板在刚才的会上又提出了什么苟芘不通的脑洞、甚至控诉
AI 上个提交写的功能包含有什么 bug
导致自己昨晚半夜被oncall起来加班回滚……全部塞进去。结果 AI
的回复也是一整屏的废话,把你的背景复述一遍,再加上一堆不痛不痒的"建议"。信息密度越低,AI
的产出质量越低。
但这里有一个前提:你自己的认知水平决定了 AI
产出的上限。
低水平的认知写出来的低水平 prompt,只能换来 AI
低水平的产出。你说不出"考虑 FFT/DFT 方案,低频域插入特征"这种话,AI
就只会给你一个最朴素的明文水印方案。AI
拥有的知识远超任何个人,但它需要你用正确的钥匙去开门,而这把钥匙就是你自己的专业认知。
对于 AI
的编码坏习惯(比如前面提到的防御性编程),实践中有效的做法是:给
AI 制定明确的编码风格
rule,然后定期做无定向审计。所谓无定向审计,不是带着具体问题去查,而是随机抽检代码,看
AI 有没有偷偷跑偏。我的经验是(仅限 Opus 4.6 1M),制定 rule 以后 AI
遵循得还不错,上下文长了会有些风格上的小偏差,但路线上不太会偏离了。
说到底,这些原则在没有 AI
的时代就是好的工程实践。但以前你可以"先欠着"——反正人类同事能靠默契和记忆力填坑。
所以更精确的表述是:AI
放大的上限不是你此刻的鉴别力,而是你鉴别力的成长速度。用的过程中学得越快,AI
能带你走得越远。但触发器永远是你自己的成长,不是 AI 的主动突破。
反过来,这也意味着输入和输出的上限都需要提高。你的输出能力决定了你能给
AI
多高质量的素材——更精准的提问,更扎实的种子内容,更清晰的框架。高质量的输入才能唤醒
AI 更高质量的输出。而你能否接住 AI
的高质量输出、消化它并转化为自己的成长,取决于你的输入接受能力。你的输出喂给
AI,AI 的输出喂给你的输入,两端的上限共同决定了 AI
对你的放大倍数——左脚踩右脚,螺旋升天。
所以 AI
在编程领域的"成功"并不能证明它已经接近人类智能——是编程领域的封闭性、短反馈和模块化恰好落在了
AI
的能力舒适区内。而一旦进入需要长程记忆、物理直觉和因果推理的领域,人类那些「像呼吸一样自然」的能力就成了
AI 难以跨越的鸿沟。你觉得 AI
已经很聪明了?那是因为你恰好在它最擅长的场地上观察它。
结语
AI 放大的上限是你的鉴别力,不是你的输出能力。「被超越」的错觉,来自
AI
的输出超过了观察者的鉴别力——你分不清好坏的时候,会误以为它什么都行。
但鉴别力本身不是静止的。用 AI
的过程中你会撞墙、会觉得不对劲、会追问,然后你的鉴别力会成长。AI
真正放大的,是这个成长的速度。你学得越快,它能带你走得越远。但触发器永远是你自己——AI
不会主动告诉你「你不知道什么」,它甚至会积极地用谄媚掩盖你的盲区。
关键是:agent 解决类型体操并不是靠作弊(比如到处 asInstanceOf 或者
any),而是通过反复尝试,真正填补了这些形式化过程的
gap。这不是绕过,反而给了我足够的信心——说不定 AI 真的能写出通过 FM
检查的代码。如此一来,代码的正确性就有了形式化保证,这远比单元测试覆盖更全面、更可靠。
但第二层目前还是另一个世界。FM 级别的验证错误——z3
求解失败、refinement type 约束不满足——agent
处理起来仍然非常吃力。不过第一层的突破让我相信,这条路是有希望的。
局限性
当然,这个方案也有局限。
实际上现在的 formal method
工具链和生态还是很贫瘠,基本上只支持一门语言很有限很小的一个子集。有些工程上常用的语法/模式在
FM 那边都是
unsound,或者尚未证明。更不用说动不动就陷入死循环/无解证明了——稍不注意,z3
求解器要在比宇宙空间还大的可能性里搜索,到宇宙毁灭那一天都证明不出来。
更要命的是,kernel 里充斥着大量寄存器地址、flag 常量,而我们 inspect
内存的时候这些都是关键信息。一旦 AI
压缩上下文时把这些细节遗忘了,它的下一个决策可能是完全反向误导你——不是「不够好」,而是「彻底错误」。这也是为什么在这类场景下,我宁愿忍受长上下文带来的降智,也绝不肯让
agent 压缩上下文。这些信息是真的不能丢。
更何况,debug
的过程本身就不是线性的。我们经常同时考虑多种可能性,分别验证。A
路线走不通,切换到 B 路线——但这不意味着 A
路线就是死路,可能只是当时的认知还不够。等在 B
路线里积累了新的理解,回头一想,A
路线当初做过的尝试似乎并非死路一条。此时我再回到 A
路线上来——我人是回来了,AI 呢?你大概率会面对一个清纯的新手 AI,对之前在
A 路线上的所有探索一无所知。人类 debug 的经验是螺旋式积累的,而 agent
的记忆是一次性的。
我曾在清华开源操作系统社区做过一次报告,分享了在 AI
辅助下开发内核网卡驱动的踩坑经验。当时使用的模型是 Claude 3.7 到
4.0,效果完全是帮倒忙——90% 的信息都是幻觉误导。AI 混合了 dwmac 从 2.x 到
5.x 各个版本的行为,甚至牵扯到高通/Intel 芯片的代码,更不用说 PHY
芯片的寄存器、C22/C45 协议这些了。
这类领域的训练数据和讨论比较封闭,厂家文档藏着掖着,要注册会员才能获取。全链路做过这款芯片的人可能全世界不到
100 人,AI 没有清晰可借鉴的经验。Kernel 仓库里的代码有 10-20
年历史,几千次提交,十几万行代码,还有一大堆不同厂商不同架构导致的
workaround——这些对 AI 来说都是噪音。
一开始我是外行,协作方式是 AI
做鞭子、我做牛马——我负责执行 AI
给出的方案并反馈错误信息。但随着我越来越懂,发现 AI
净给我胡扯,于是角色翻转:我指定逆向路径,AI
帮我执行,不厌其烦地一遍一遍做 bit
级别对比实验。最终的结论和下一轮路线决策是人 + AI 一起产出的,而不是让
AI 直接大跨度地做下一步。
这个案例其实印证了前面的几个观点:训练数据稀缺的领域 AI
幻觉尤其严重;长上下文中的寄存器级细节一旦丢失就是灾难;而 debug
的螺旋式推进过程,目前的 agent
架构根本无法胜任。但反过来说,当人类掌握了主导权之后,AI
作为一个不知疲倦的执行者,在 bit 级别的重复对比实验上确实帮了大忙。
Many data systems use polling refresh to display lists, which can
cause a delay in updating content status and cannot immediately provide
feedback to users on the page. Shortening the refresh time interval on
the client side can lead to an excessive load on the server, which
should be avoided.
To solve this problem, this article proposes an event subscription
mechanism. This mechanism provides real-time updates to the client,
eliminating the need for polling refresh and improving the user
experience.
Terminologies and Context
This article introduces the following concepts:
Hub: An event aggregation center that receives
events from producers and sends them to subscribers.
Buffer: An event buffer that caches events from
producers and waits for the Hub to dispatch them to subscribers.
Filter: An event filter that only sends events
meeting specified conditions to subscribers.
Broadcast: An event broadcaster that broadcasts the
producer's events to all subscribers.
Observer: An event observer that allows subscribers
to receive events through observers.
The document discusses some common concepts such as:
Pub-Sub pattern: It is a messaging pattern where the sender
(publisher) does not send messages directly to specific recipients
(subscribers). Instead, published messages are divided into different
categories without needing to know which subscribers (if any) exist.
Similarly, subscribers can express interest in one or more categories
and receive all messages related to that category, without the publisher
needing to know which subscribers (if any) exist.
Filter:
Topic-based content filtering mode is based on topic filtering
events. Producers publish events to one or more topics, and subscribers
can subscribe to one or more topics. Only events that match the
subscribed topics will be sent to subscribers. However, when a terminal
client subscribes directly, this method has too broad a subscription
range and is not suitable for a common hierarchical structure.
Content-based content filtering mode is based on message content
filtering events. Producers publish events to one or more topics, and
subscribers can use filters to subscribe to one or more topics. Only
events that match the subscribed topics will be sent to subscribers.
This method is suitable for a common hierarchical structure.
Functional Requirements
Client users can subscribe to events through gRPC Stream, WebSocket,
or ServerSentEvent.
Whenever a record's status changes (e.g. when the record is updated
by an automation task) or when other collaborators operate on the same
record simultaneously, an event will be triggered and pushed to the
message center.
Events will be filtered using content filtering mode, ensuring that
only events that meet the specified conditions are sent to
subscribers.
Architecture
flowchart TD
Hub([Hub])
Buffer0[\"Buffer drop oldest"/]
Buffer1[\"Buffer1 drop oldest"/]
Buffer2[\"Buffer2 drop oldest"/]
Buffer3[\"Buffer3 drop oldest"/]
Filter1[\"File(Record = 111)"/]
Filter2[\"Workflow(Project = 222)"/]
Filter3[\"File(Project = 333)"/]
Broadcast((Broadcast))
Client1(Client1)
Client2(Client2)
Client3(Client3)
Hub --> Buffer0
subgraph Server
Buffer0 --> Broadcast
Broadcast --> Filter1 --> Buffer1 --> Observer1
Broadcast --> Filter2 --> Buffer2 --> Observer2
Broadcast --> Filter3 --> Buffer3 --> Observer3
end
subgraph Clients
Observer1 -.-> Client1
Observer2 -.-> Client2
Observer3 -.-> Client3
end
After listening to the change event, debounce and re-request the
list interface, and then render it.
When leaving the page, cancel the subscription.
Servers should follow these steps:
Subscribe to push events based on the client's filter.
When the client's backlog message becomes too heavy, delete the
oldest message from the buffer.
When the client cancels the subscription, the server should also
cancel the broadcast to the client.
Application /
Component Level Design (LLD)
flowchart LR
Server([Server])
Client([Client: Web...])
MQ[Kafka or other]
Broadcast((Broadcast))
subgraph ExternalHub
direction LR
Receiver --> MQ --> Sender
end
subgraph InMemoryHub
direction LR
Emit -.-> OnEach
end
Server -.-> Emit
Sender --> Broadcast
OnEach -.-> Broadcast
Broadcast -.-> gRPC
Broadcast -.-> gRPC
Broadcast -.-> gRPC
Server -- "if horizon scale is needed" --> Receiver
gRPC --Stream--> Client
For a single-node server, a simple Hub can be implemented using an
in-memory queue.
For multi-node servers, an external Hub implementation such as Kafka,
MQ, or Knative eventing should be considered. The broadcasting logic is
no different from that of a single machine.
Failure Modes
Fast Producer-Slow Consumer
This is a common scenario that requires special attention. The
publish-subscribe mechanism for terminal clients cannot always expect
clients to consume messages in real time. However, message continuity
must be maximally guaranteed. Clients may access our products in an
uncontrollable network environment, such as over 4G or poor Wi-Fi. Thus,
the server message queue cannot become too backlogged. When a client's
consumption rate cannot keep up with the server's production speed, this
article recommends using a bounded Buffer with the
OverflowStrategy.DropOldest strategy. This ensures that
subscriptions between consumers are isolated, avoiding too many unpushed
messages on the server (which could lead to potential memory leak
risks).
This document proposes an event subscription mechanism to address the
delay in updating content status caused by polling refresh. Clients can
subscribe to events through any long connection protocol, and events
will be filtered based on specified conditions. To avoid having too many
unpushed messages on the server, a bounded buffer with the
OverflowStrategy.DropOldest strategy is used.
Implementing this in Reactive Streams is straightforward, but you can
choose your preferred technology to do so.
In the previous
post, we discussed how to implement a file tree in PostgreSQL using
ltree. Now, let's talk about how to integrate version
control management for the file tree.
Version control is a process for managing changes made to a file tree
over time. This allows for the tracking of its history and the ability
to revert to previous versions, making it an essential tool for file
management.
With version control, users have access to the most up-to-date
version of a file, and changes are tracked and documented in a
systematic manner. This ensures that there is a clear record of what has
been done, making it much easier to manage files and their versions.
Terminologies and Context
One flawed implementation involves storing all file
metadata for every commit, including files that have not changed but are
recorded as NO_CHANGE. However, this approach has a
significant problem.
The problem with the simple and naive implementation of storing all
file metadata for every commit is that it leads to significant write
amplification, as even files that have not changed are recorded as
NO_CHANGE. One way to address this is to avoid storing
NO_CHANGE transformations when creating new versions, which
can significantly reduce the write amplification.
This is good for querying, but bad for writing. When we need to fetch
a specific version, the PostgreSQL engine only needs to scan the index
with the condition file.version = ?. This is a very cheap
cost in modern database systems. However, when a new version needs to be
created, the engine must write \(N\)
rows of records into the log table (where \(N\) is the number of current files). This
will cause a write peak in the database and is unacceptable.
In theory, all we need to do is write the changed file. If we can
find a way to fetch an arbitrary version of the file tree in \(O(log(n))\) time, we can reduce unnecessary
write amplification.
Non Functional Requirements
Scalability
Consider the worst-case scenario: a file tree with more than 1,000
files that is committed to more than 10,000 times. The scariest
possibility is that every commit changes all files, causing a decrease
in write performance compared to the efficient implementation. Storing
more than 10 million rows in a single table can make it difficult to
separate them into partitioned tables.
Suppose \(N\) is the number of
files, and \(M\) is the number of
commits. We need to ensure that the time complexity of fetching a
snapshot of an arbitrary version is less than \(O(N\cdot log(M))\). This is theoretically
possible.
Latency
In the worst case, the query can still respond in less than
100ms.
PostgreSQL has a keyword called LATERAL. This keyword
can be used in a join table to enable the use of an outside table in a
WHERE condition. By doing so, we can directly tell the
query optimizer how to use the index. Since data in a combined index is
stored in an ordered tree, finding the maximum value or any arbitrarily
value has a time complexity of \(O(log(n))\).
Finally, we obtain a time complexity of \(O(N \cdot log(M))\).
Performance
Result: Fetching an arbitrary version will be done in tens of
milliseconds.
1 2 3 4 5 6 7 8 9 10 11 12 13
explain analyse select f.record_id, f.filename, latest.revision_id from files f innerjoinlateral ( select* from file_logs fl where f.filename = fl.filename and f.record_id = fl.record_id -- and revision_id < 20000 orderby revision_id desc limit 1 ) as latest on f.record_id ='f5c2049f-5a32-44f5-b0cc-b7e0531bf706';
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Nested Loop (cost=0.86..979.71rows=1445 width=50) (actual time=0.040..18.297rows=1445 loops=1) -> Index Only Scan using files_pkey on files f (cost=0.29..89.58rows=1445 width=46) (actual time=0.019..0.174rows=1445 loops=1) Index Cond: (record_id ='f5c2049f-5a32-44f5-b0cc-b7e0531bf706'::uuid) Heap Fetches: 0 -> Memoize (cost=0.57..0.65rows=1 width=4) (actual time=0.012..0.012rows=1 loops=1445) " Cache Key: f.filename, f.record_id" Cache Mode: binary Hits: 0 Misses: 1445 Evictions: 0 Overflows: 0 Memory Usage: 221kB -> Subquery Scan on latest (cost=0.56..0.64rows=1 width=4) (actual time=0.012..0.012rows=1 loops=1445) -> Limit (cost=0.56..0.63rows=1 width=852) (actual time=0.012..0.012rows=1 loops=1445) -> Index Only Scan Backward using file_logs_pk on file_logs fl (cost=0.56..11.72rows=158 width=852) (actual time=0.011..0.011rows=1 loops=1445) Index Cond: ((record_id = f.record_id) AND (filename = (f.filename)::text)) Heap Fetches: 0 Planning Time: 0.117 ms Execution Time: 18.384 ms
Test Datasets
This dataset simulates the worst-case scenario of a table with 14.6
million rows. Specifically, it contains 14.45 million rows representing
a situation in which 1,400 files are changed 10,000 times.
1 2 3 4 5
-- cnt: 14605858 selectcount(0) from file_logs; -- cnt: 14451538 selectcount(0) from file_logs where record_id ='f5c2049f-5a32-44f5-b0cc-b7e0531bf706';
A file tree is a hierarchical structure used to organize files and
directories on a computer. It allows users to easily navigate and access
their files and folders, and is commonly used in operating systems and
file management software.
But implementing file trees in traditional RDBMS like MySQL can be a
challenge due to the lack of support for hierarchical data structures.
However, there are workarounds such as using nested sets or materialized
path approaches. Alternatively, you could consider using NoSQL databases
like MongoDB or document-oriented databases like Couchbase, which have
built-in support for hierarchical data structures.
It is possible to implement a file tree in PostgreSQL using the
ltree datatype provided by PostgreSQL. This datatype can
help us build the hierarchy within the database.
TL;DR
Pros
Excellent performance!
No migration is needed for this, as no new columns will be added.
Only a new expression index needs to be created.
Cons
Need additional mechanism to create virtual folder entities.(only if
you need to show the folder level)
There are limitations on the file/folder name length.(especially in
non-ASCII characters)
Limitation
The maximum length for a file or directory name is limited, and in
the worst case scenario where non-ASCII characters(Chinese) and
alphabets are interlaced, it can not be longer than 33
characters. Even if all the characters are Chinese, the name can not
exceed 62 characters in length.
Based on PostgreSQL documentation,
the label path can not exceed 65535 labels. However, in most cases, this
limit should be sufficient and it is unlikely that you would need to
nest directories to such a deep level.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
select escape_filename_for_ltree( '一0二0三0四0五0六0七0八0九0十0'|| '一0二0三0四0五0六0七0八0九0十0'|| '一0二0三0四0五0六0七0八0九0十0'|| '一0二0' ); -- worst case len 34 select escape_filename_for_ltree( '一二三四五六七八九十'|| '一二三四五六七八九十'|| '一二三四五六七八九十'|| '一二三四五六七八九十'|| '一二三四五六七八九十'|| '一二三四五六七八九十'|| '一二三' ); -- Chinese case len 63
1
[42622] ERROR: label string is too long Detail: Label length is 259, must be at most 255, at character 260. Where: PL/pgSQL function escape_filename_for_ltree(text) line 5 at SQL statement
How to use
Build expression index
1
CREATE INDEX idx_file_tree_filename ON files using gist (escape_filename_for_ltree(filename));
Example Query
1 2 3 4 5
explain analyse select filename from files where escape_filename_for_ltree(filename) ~'ow.*{1}' and record_id ='1666bad1-202c-496e-bb0e-9664ce3febcb';
Bitmap Heap Scan on files (cost=32.12..36.38rows=1 width=28) (actual time=0.341..0.355rows=8 loops=1) Recheck Cond: ((record_id ='1666bad1-202c-496e-bb0e-9664ce3febcb'::uuid) AND (escape_filename_for_ltree((filename)::text) <@ 'ow'::ltree)) Heap Blocks: exact=3 -> BitmapAnd (cost=32.12..32.12rows=1 width=0) (actual time=0.323..0.324rows=0 loops=1) -> Bitmap Index Scan on idx_file_tree_record_id (cost=0.00..4.99rows=93 width=0) (actual time=0.051..0.051rows=100 loops=1) Index Cond: (record_id ='1666bad1-202c-496e-bb0e-9664ce3febcb'::uuid) -> Bitmap Index Scan on idx_file_tree_filename (cost=0.00..26.88rows=347 width=0) (actual time=0.253..0.253rows=52 loops=1) Index Cond: (escape_filename_for_ltree((filename)::text) <@ 'ow'::ltree) Planning Time: 0.910 ms Execution Time: 0.599 ms
Explaination
PostgreSQL's LTREE data type allows you to use a sequence of
alphanumeric characters and underscores on the label,
with a maximum length of 256 characters. So, we get a special character
underscore that can be used as a notation to
build our escape rules within the label.
Slashes(/) will be
replaced with dots(.). I
think it does not require further explanation.
Initially, I attempted to encode all non-alphabetic characters into
their Unicode hex format. However, after receiving advice from other
guys, I discovered that using base64 encoding can be
more efficient in terms of information entropy. Ultimately, I decided to
use base62 encoding instead to ensure that no illegal
characters are produced and to achieve the maximum possible information
entropy.
This is the final representation of the physical data that will be
stored in the index of PostgreSQL.
If you want to store an isolated file tree in the same table, one
thing you need to do is prepend the isolation key as the first label of
the ltree. For example:
By doing this, you will get the best query performance.
Summary
This document explains how to implement a file tree in PostgreSQL
using the ltree datatype. The ltree datatype
can help build the hierarchy within the database, and an expression
index needs to be created. There are some limitations on the file/folder
name length, but the performance is excellent. The document also
provides PostgreSQL functions for escaping and encoding file/folder
names.
Appendix: PostgreSQL
Functions
Entry function (immutable
is required)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
CREATEOR REPLACE FUNCTION escape_filename_for_ltree(filename TEXT) RETURNS ltree AS $$ DECLARE escaped_path ltree; BEGIN select string_agg(escape_part(part), '.') into escaped_path from (select regexp_split_to_table as part from regexp_split_to_table(filename, '/')) as parts;
return escaped_path;
END; $$ LANGUAGE plpgsql IMMUTABLE;
Util: Escape every part
(folder or file)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
createor replace function escape_part(part text) returns text as $$ declare escaped_part text; begin select string_agg(escaped, '') into escaped_part from (selectcasesubstring(sep, 1, 1) ~'[0-9a-zA-Z]' whentruethen sep else'_'|| base62_encode(sep) ||'_' endas escaped from (select split_string_by_alpha as sep from split_string_by_alpha(part)) as split) asescape; RETURN escaped_part; end; $$ language plpgsql immutable
Util: Split a string into
groups
Each group contains only alphabetic characters or non-alphabetic
characters.
CREATEOR REPLACE FUNCTION split_string_by_alpha(input_str TEXT) RETURNS SETOF TEXT AS $$ DECLARE split_str TEXT; BEGIN IF input_str ISNULLOR input_str =''THEN RETURN; END IF;
WHILE input_str !='' LOOP split_str :=substring(input_str from'[^0-9a-zA-Z]+|[0-9a-zA-Z]+'); IF split_str !=''THEN RETURN NEXT split_str; END IF; input_str :=substring(input_str from length(split_str) +1); END LOOP;
RETURN; END; $$ LANGUAGE plpgsql
Util: base62 encode function
By using the base62_encode function, we can create a string that
meets the requirements of LTREE and achieves maximum information
entropy.
CREATEOR REPLACE FUNCTION base62_encode(data TEXT) RETURNS TEXT AS $$ DECLARE ALPHABET CHAR(62)[] :=ARRAY[ '0','1','2','3','4','5','6','7','8','9', 'A','B','C','D','E','F','G','H','I','J', 'K','L','M','N','O','P','Q','R','S','T', 'U','V','W','X','Y','Z','a','b','c','d', 'e','f','g','h','i','j','k','l','m','n', 'o','p','q','r','s','t','u','v','w','x', 'y','z' ]; BASE BIGINT :=62; result TEXT :=''; val numeric :=0; bytes bytea := data::bytea; len INT := length(data::bytea); BEGIN FOR i IN0..(len -1) LOOP val := (val *256) + get_byte(bytes, i); END LOOP;
WHILE val >0 LOOP result := ALPHABET[val % BASE +1] ||result; val :=floor(val / BASE); END LOOP;
Move semantics make it possible to safely transfer resource ownership
between objects, across scopes, and in and out of threads, while
maintaining resource safety. — (since C++11)
1 2 3 4 5 6 7 8 9 10 11 12 13 14
void f() { vector<string> vs(100); // not std::vector: valid() added if (!vs.valid()) { // handle error or exit }
ifstream fs("foo"); // not std::ifstream: valid() added if (!fs.valid()) { // handle error or exit }
// ... } // destructors clean up as usual
C++ 提出了 RAII 这一先进概念,几乎解决了资源安全问题。但是受限于 C++
诞生年代,早期 C++ 为了保证资源安全,只支持左值引用(LValue Reference) +
Clone(Deep Copy)
语义,使得赋值操作会频繁深拷贝整个对象与频繁构造/析构资源,浪费了很多操作。C++11
开始支持右值引用,但是仍然需要实现右值引用(RValue Reference)的
Move(Shallow Copy)。同时,C++ 无法检查多次 move 的问题和 move
后原始变量仍然可用的问题。
Welcome to Hexo! This is your very
first post. Check documentation for
more info. If you get any problems when using Hexo, you can find the
answer in troubleshooting or
you can ask me on GitHub.