Agent Skills as Files vs Database: What 5,000 Skills Look Like in Production

Name: YantrikDB
Author: Yantrikos

I built an open-source cognitive memory database for AI agents called YantrikDB. Over the past several months I noticed something while watching agents write their own skills at runtime: the formats everyone uses to store skills — Anthropic’s SKILL.md files with YAML frontmatter, Voyager’s skill_library/*.py files — were optimized for a use case that doesn’t exist anymore.

Those formats were built for humans to author, review, version-control, and edit skills. Git diffs. Code review. IDE editing. They are filesystem-shaped because the workflow was filesystem-shaped.

But the actual production workload looks different. Skills get written by agents at runtime via define-style APIs. Skills get retrieved by agents at inference, not opened in editors. Outcome events get emitted by agent runs, not authored by humans. The “skill catalog” of an autonomous learner is not a folder of files anyone is grooming — it’s memory.

So I measured what happens when you scale that mismatch. The paper is on Zenodo: Skill as Memory, Not Document. Code, scripts, and raw CSVs are reproducible at github.com/yantrikos/yantrikdb-server. The numbers below are what came out.

What I measured

A 5,000-skill corpus, deterministic seed, 100 ground-truth queries, top-K=5 retrieval, cl100k_base tokenizer. Three things:

Token cost of delivering relevant skills into the LLM’s context, full-catalog disclosure vs progressive filesystem retrieval vs a database-native substrate.
Retrieval latency at 5K-skill scale.
Invalid-skill admission rate — what happens when an agent writes a malformed skill into the catalog.

Result 1: token cost

Disclosure pattern	Mean tokens per query	Notes
Full-catalog dump (the naive baseline)	919,200	Exceeds Claude 3.7’s 200K window and GPT-4 Turbo’s 128K — literally cannot fit
SKILL.md filesystem retrieval (top-K=5 with frontmatter)	549	Indexed; this is the realistic comparison
SKILL.md retrieval, frontmatter stripped (body only)	369	Ablation: what happens if you strip the YAML
Database-native substrate (top-K=5)	369	Metadata stored as indexed columns, never enters context

The 919,200 number is real but it’s not the point. Nobody actually dumps full catalogs at 5K skills — it physically doesn’t fit. The honest comparison is against indexed filesystem retrieval: a 1.49× ratio.

What surprised me is what an ablation showed. When I stripped the YAML frontmatter from retrieved SKILL.md content, the gap collapsed to 1.000× — perfectly identical to the substrate. The entire 1.49× difference is YAML frontmatter overhead (about 36 tokens per retrieved skill). The substrate’s win on token cost is architectural-attribution: it stores metadata as indexed columns the database queries on but never returns to the LLM. Filesystem retrieval can’t do that without building a separate metadata index — which is what “fair indexed baseline” already assumes.

So: small but architecturally clean. Not a hero number. The substrate’s actual story is something else.

Result 2: retrieval latency

p50 = 87.3 ms. p95 = 106.3 ms. At 5,000 skills, single-node deployment. Measured at engine commit c886e9e on Windows 11 + Docker Desktop with a bundled MiniLM-L6-v2 embedder.

This is “fast enough that latency stops being the constraint” but not “10× faster than a filesystem walk.” Filesystem walks at 5K skills are also fast. The substrate wins on consistency under load (Raft-replicated, doesn’t degrade with concurrent writes) but the single-node p50 isn’t where the architectural difference lives.

Result 3: invalid-skill admission

This is where the actual story is. I generated 90 adversarially-malformed skills covering 18 failure classes — typos in trigger phrases, malformed applies_to, oversized bodies, missing required fields, schema violations, broken regex in triggers, the things an LLM will plausibly produce when authoring skills.

Substrate	Invalid skills admitted	Rate
Filesystem (SKILL.md + YAML-parseable acceptance)	68 of 70	97%
Database-native (schema validation at write time)	0 of 70	0%

Filesystem catalogs accept almost everything that parses as YAML. The substrate rejects everything that violates the schema at write time. The 97% / 0% gap isn’t a tuning difference — it’s the difference between “validate semantics later when an agent tries to use the skill” and “validate at the API boundary on write.”

This matters because of where the cost of admission lands. Filesystem catalogs accept malformed skills silently and they show up later as agent failures at inference time, with no clean attribution back to the bad write. Schema validation catches the problem at the source.

So what’s the actual claim

The framing I landed on after three rounds of adversarial redteam is: skill as memory, not document.

Filesystem catalogs are documents — they have YAML frontmatter humans read, body sections humans format, file organizations humans browse. Memory has different optimization criteria: retrieval cost (indexed disclosure, projected-out metadata), write safety (schema enforcement at the boundary), and machine consumption (compact body, no human-readable wrappers).

I don’t claim YantrikDB is a new database architecture. All the individual primitives it ships — typed records, vector index, append-only logs, schema validation, Raft replication — exist in prior work. What I’m pointing at is a category mismatch: using human-editorial formats as agent memory substrates produces three coupled failure modes (token burn, slowdown, invalid-skill admission) that compound at scale, and the cleanest fix is treating skill storage as a database problem rather than a filesystem problem.

What I left out (deliberately)

The honest version of this paper has explicit limits. I document them because they’re the things a careful reader would catch:

The 5,000-skill corpus is synthetic. Real catalogs with human-author jargon, near-duplicates, and naming drift will degrade retrieval differently. A follow-up paper measuring real catalogs is pre-specified for 2026-08-04.
I did not yet benchmark against a competently-built Postgres + pgvector + JSON-schema + audit-table baseline. That is the most important deferred comparison — it’s also in the follow-up.
End-to-end agent task success (does the substrate actually make the agent better at downstream tasks?) is measured at the substrate level here, not the agent level. Also in the follow-up.

The follow-up paper is a 12-week commitment, not vapor.

Reproducibility

Everything is reproducible. The 5K-skill corpus and 100 ground-truth queries live at benchmarks/skill_recall/ in the public repo. Four benchmark scripts re-generate every table in the paper:

git clone https://github.com/yantrikos/yantrikdb-server.git
cd yantrikdb-server/benchmarks/skill_recall
pip install tiktoken matplotlib
python token_cost_bench.py
python kill_test_ablation.py
python k_sensitivity_and_tokenizer_bench.py
python hallucination_admission_bench.py

Raw CSV outputs are in the Zenodo bundle alongside the paper. If your numbers differ from mine, I want to know — open an issue or email me.

Discussion question

Where are you storing your agent’s skills today? Filesystem because it was the default? Vector store because you needed retrieval? Custom database because you hit one of these failure modes already? I’m collecting field reports for the follow-up paper — if you have a real-world catalog and any of this resonates (or contradicts your experience), I’d genuinely like to hear it.

Cite as: Sarkar, P. (2026). Skill as Memory, Not Document: A Database-Native Substrate for Agent Skill Catalogs. Zenodo. https://doi.org/10.5281/zenodo.20128887

Author: Pranab Sarkar, Independent Researcher · ORCID 0009-0009-8683-1481 License: CC BY 4.0 Related software: YantrikDB v0.8.13