Agent Skills as Files vs Database: What 5,000 Skills Look Like in Production
I built an open-source cognitive memory database for AI agents called YantrikDB. Over the past several months I noticed something while watching agents write their own skills at runtime: the formats everyone uses to store skills — Anthropic’s SKILL.md files with YAML frontmatter, Voyager’s skill_library/*.py files — were optimized for a use case that doesn’t exist anymore.
Those formats were built for humans to author, review, version-control, and edit skills. Git diffs. Code review. IDE editing. They are filesystem-shaped because the workflow was filesystem-shaped.
But the actual production workload looks different. Skills get written by agents at runtime via define-style APIs. Skills get retrieved by agents at inference, not opened in editors. Outcome events get emitted by agent runs, not authored by humans. The “skill catalog” of an autonomous learner is not a folder of files anyone is grooming — it’s memory.
So I measured what happens when you scale that mismatch. The paper is on Zenodo: Skill as Memory, Not Document. Code, scripts, and raw CSVs are reproducible at github.com/yantrikos/yantrikdb-server. The numbers below are what came out.
What I measured
Section titled “What I measured”A 5,000-skill corpus, deterministic seed, 100 ground-truth queries, top-K=5 retrieval, cl100k_base tokenizer. Three things:
- Token cost of delivering relevant skills into the LLM’s context, full-catalog disclosure vs progressive filesystem retrieval vs a database-native substrate.
- Retrieval latency at 5K-skill scale.
- Invalid-skill admission rate — what happens when an agent writes a malformed skill into the catalog.
Result 1: token cost
Section titled “Result 1: token cost”| Disclosure pattern | Mean tokens per query | Notes |
|---|---|---|
| Full-catalog dump (the naive baseline) | 919,200 | Exceeds Claude 3.7’s 200K window and GPT-4 Turbo’s 128K — literally cannot fit |
| SKILL.md filesystem retrieval (top-K=5 with frontmatter) | 549 | Indexed; this is the realistic comparison |
| SKILL.md retrieval, frontmatter stripped (body only) | 369 | Ablation: what happens if you strip the YAML |
| Database-native substrate (top-K=5) | 369 | Metadata stored as indexed columns, never enters context |
The 919,200 number is real but it’s not the point. Nobody actually dumps full catalogs at 5K skills — it physically doesn’t fit. The honest comparison is against indexed filesystem retrieval: a 1.49× ratio.
What surprised me is what an ablation showed. When I stripped the YAML frontmatter from retrieved SKILL.md content, the gap collapsed to 1.000× — perfectly identical to the substrate. The entire 1.49× difference is YAML frontmatter overhead (about 36 tokens per retrieved skill). The substrate’s win on token cost is architectural-attribution: it stores metadata as indexed columns the database queries on but never returns to the LLM. Filesystem retrieval can’t do that without building a separate metadata index — which is what “fair indexed baseline” already assumes.
So: small but architecturally clean. Not a hero number. The substrate’s actual story is something else.
Result 2: retrieval latency
Section titled “Result 2: retrieval latency”p50 = 87.3 ms. p95 = 106.3 ms. At 5,000 skills, single-node deployment. Measured at engine commit c886e9e on Windows 11 + Docker Desktop with a bundled MiniLM-L6-v2 embedder.
This is “fast enough that latency stops being the constraint” but not “10× faster than a filesystem walk.” Filesystem walks at 5K skills are also fast. The substrate wins on consistency under load (Raft-replicated, doesn’t degrade with concurrent writes) but the single-node p50 isn’t where the architectural difference lives.
Result 3: invalid-skill admission
Section titled “Result 3: invalid-skill admission”This is where the actual story is. I generated 90 adversarially-malformed skills covering 18 failure classes — typos in trigger phrases, malformed applies_to, oversized bodies, missing required fields, schema violations, broken regex in triggers, the things an LLM will plausibly produce when authoring skills.
| Substrate | Invalid skills admitted | Rate |
|---|---|---|
| Filesystem (SKILL.md + YAML-parseable acceptance) | 68 of 70 | 97% |
| Database-native (schema validation at write time) | 0 of 70 | 0% |
Filesystem catalogs accept almost everything that parses as YAML. The substrate rejects everything that violates the schema at write time. The 97% / 0% gap isn’t a tuning difference — it’s the difference between “validate semantics later when an agent tries to use the skill” and “validate at the API boundary on write.”
This matters because of where the cost of admission lands. Filesystem catalogs accept malformed skills silently and they show up later as agent failures at inference time, with no clean attribution back to the bad write. Schema validation catches the problem at the source.
So what’s the actual claim
Section titled “So what’s the actual claim”The framing I landed on after three rounds of adversarial redteam is: skill as memory, not document.
Filesystem catalogs are documents — they have YAML frontmatter humans read, body sections humans format, file organizations humans browse. Memory has different optimization criteria: retrieval cost (indexed disclosure, projected-out metadata), write safety (schema enforcement at the boundary), and machine consumption (compact body, no human-readable wrappers).
I don’t claim YantrikDB is a new database architecture. All the individual primitives it ships — typed records, vector index, append-only logs, schema validation, Raft replication — exist in prior work. What I’m pointing at is a category mismatch: using human-editorial formats as agent memory substrates produces three coupled failure modes (token burn, slowdown, invalid-skill admission) that compound at scale, and the cleanest fix is treating skill storage as a database problem rather than a filesystem problem.
What I left out (deliberately)
Section titled “What I left out (deliberately)”The honest version of this paper has explicit limits. I document them because they’re the things a careful reader would catch:
- The 5,000-skill corpus is synthetic. Real catalogs with human-author jargon, near-duplicates, and naming drift will degrade retrieval differently. A follow-up paper measuring real catalogs is pre-specified for 2026-08-04.
- I did not yet benchmark against a competently-built Postgres + pgvector + JSON-schema + audit-table baseline. That is the most important deferred comparison — it’s also in the follow-up.
- End-to-end agent task success (does the substrate actually make the agent better at downstream tasks?) is measured at the substrate level here, not the agent level. Also in the follow-up.
The follow-up paper is a 12-week commitment, not vapor.
Reproducibility
Section titled “Reproducibility”Everything is reproducible. The 5K-skill corpus and 100 ground-truth queries live at benchmarks/skill_recall/ in the public repo. Four benchmark scripts re-generate every table in the paper:
git clone https://github.com/yantrikos/yantrikdb-server.gitcd yantrikdb-server/benchmarks/skill_recallpip install tiktoken matplotlibpython token_cost_bench.pypython kill_test_ablation.pypython k_sensitivity_and_tokenizer_bench.pypython hallucination_admission_bench.pyRaw CSV outputs are in the Zenodo bundle alongside the paper. If your numbers differ from mine, I want to know — open an issue or email me.
Discussion question
Section titled “Discussion question”Where are you storing your agent’s skills today? Filesystem because it was the default? Vector store because you needed retrieval? Custom database because you hit one of these failure modes already? I’m collecting field reports for the follow-up paper — if you have a real-world catalog and any of this resonates (or contradicts your experience), I’d genuinely like to hear it.
Cite as: Sarkar, P. (2026). Skill as Memory, Not Document: A Database-Native Substrate for Agent Skill Catalogs. Zenodo. https://doi.org/10.5281/zenodo.20128887
Author: Pranab Sarkar, Independent Researcher · ORCID 0009-0009-8683-1481 License: CC BY 4.0 Related software: YantrikDB v0.8.13