SKILL.md Ablation: What Makes Agent Skills Reliable

SKILL.md Ablation: What Makes Agent Skills Reliable

A corpus of 40 skills shows that the performance gap is not about topic coverage. It is about whether the file encodes the decisions the model would otherwise improvise: when to activate, how to branch, what never to do, what to disclose, where state lives, and how results should return.

40+ skills studied · Anthropic production baseline · 10 technical dimensions

01 /

Executive Summary

The Quality Gap Is Structural, Not Topical

The strongest skills in the corpus do not merely expose a tool. They remove runtime ambiguity. They turn decisions the model would otherwise infer on the fly into explicit operating surfaces: activation boundaries, routing rules, error recovery, negative constraints, output formats, and state lifecycles.

The weakest skills fail in the opposite direction. They look usable because the topic is present, but the agent is left to improvise everything that matters after the happy path. That gap compounds with longer context windows, shifting model behavior, and pressure-filled situations where the base model is most tempted to guess.

Corpus

Full read of 40 public skills across distinct packages.

Dimensions

Activation, routing, failure handling, state, format, and more.

Quality Gap

10×–20×

Top-tier skills compress ambiguity; low-tier skills expand it.

Largest Delta

The missing decision surfaces that most wrappers never specify.

Top Tier

Encode decisions the model can execute

Reliable skills narrow the action space. They tell the agent when to activate, how to branch, where to write, what to return, and what failures should trigger escalation instead of another blind attempt.

Bottom Tier

Expose tools but outsource judgment

Weak skills read like wrappers or man pages. They describe capabilities, but the model still has to infer the operational contract in real time, which is exactly where drift and unsafe improvisation show up.

"The best skill files are not richer descriptions. They are executable decision systems written in natural language."

02 /

The 10 Dimensions

Ten Dimensions Separate Prompt Injections From Operating Manuals

The 10 dimensions below are the recurring fault lines in the corpus. Together they describe whether a skill behaves like a precise runtime contract or a loose essay about a tool.

Activation Boundaries

Clear "use this / do not use this" logic prevents accidental invocation in the wrong context.

Selective Loading

Good skills route to sub-files or references instead of forcing every deep detail into every invocation.

Intent and Complexity Routing

The skill should branch explicitly between direct execution, adaptive loops, and longer multi-step or research work.

Error Handling

Reliable skills specify recovery paths, stop conditions, and when to escalate instead of retrying blindly.

Negative Constraints

High-risk actions need explicit "never" and "do not" rules, not just optimistic examples.

Security Disclosure

The skill should make the data boundary legible: what leaves the machine, what stays local, and what is retained elsewhere.

State Architecture

Any persistence mechanism needs location, size, promotion, and demotion rules or it decays into unbounded context sludge.

Cross-Skill Contracts

Related skills should declare typed reads, writes, and preconditions instead of leaving coordination implicit.

Output Formats

Different modes should return different structures. A browse action and a deep-dive report should not share one fuzzy template.

Frontmatter Metadata

Bins, environment variables, OS constraints, and install steps should be machine-readable rather than buried in prose.

Most of the quality gap in the corpus is not evenly distributed across all ten dimensions. It clusters around the dimensions that make the model's decision process observable and stable under pressure.

03 /

Dimension Breakdown

Where the Gaps Actually Show Up

Selective loading beats inline encyclopedias

The corpus repeatedly shows that long, always-loaded skill bodies are not a sign of rigor. They are usually a sign that the author has not separated routing from reference material. The best files stay lightweight at invocation time and defer depth to targeted sub-files or reference sections.

Example	Always-loaded footprint	What works	Failure mode
Git by Ivan G Davila	141 lines	Quick-reference rules route the agent to deeper files only when a task actually needs them.	Depth is conditional, not paid up front.
Anthropic production skills	Concise shell + references	Core behavior stays short, while richer examples sit behind secondary references.	The model can stay oriented without carrying dead weight.
API Gateway by byungkyu	664 lines	Broad routing coverage exists, but it is paid on every invocation regardless of relevance.	The model loads a service directory before it knows whether the service matters.

Routing and failure handling remove improvisation

The most technically sophisticated skills in the corpus succeed because they make branching explicit without pretending the boundary is perfectly crisp. In practice, the useful distinction is not "simple versus deep" by call count. It is whether uncertainty, branching, and execution shape imply a direct path, an adaptive loop, or a longer research workflow.

Tier	Where it lives	Size limit	When it loads	Lifecycle rule
HOT	`memory.md`	100 lines max	Always	Only the most frequently used facts stay here.
WARM	`projects/` and `domains/`	200 lines each	Only on context match	Promote after repeated use; demote after inactivity.
COLD	`archive/`	Unbounded, but queried	Explicit request only	Never delete without user confirmation.

Tier

Where it lives

Size limit

When it loads

Lifecycle rule

HOT

memory.md

100 lines max

Always

Only the most frequently used facts stay here.

WARM

projects/ and domains/

200 lines each

Only on context match

Promote after repeated use; demote after inactivity.

COLD

archive/

Unbounded, but queried

Explicit request only

Never delete without user confirmation.

The Core Meta-Pattern Is Explicit Decision Surfaces

Across the corpus, the same deeper principle keeps reappearing. Reliable skills do not merely tell the model what tool is available. They precompute the decisions the model would otherwise have to infer under runtime pressure.

Surface 01

When to activate

Positive triggers alone are insufficient. Strong descriptions also declare the contexts in which the skill should stay dormant.

Surface 02

How to branch

Explicit route selection prevents the model from using the same strategy for bounded tasks, adaptive loops, and long-horizon research.

Surface 03

How to recover

Good skills turn failure into a bounded state machine instead of an open-ended retry loop.

Surface 04

What never to do

Negative constraints catch the "plausibly helpful but wrong" actions that base-model alignment cannot reliably prevent.

Surface 05

Where state lives

Without load rules and lifecycles, persistence becomes an unbounded context leak.

Surface 06

What format returns

Output structure should depend on intent. The skill should not force the model to guess the deliverable at the end of the pipeline.

Interpretation

The broad pattern is simple: weak skills tell the model what to infer; strong skills hand the model the inference outputs directly. That is why the latter are more stable, more auditable, and less sensitive to context drift.

06 /

Anthropic Comparison

Anthropic Production Skills Solve a Different Half of the Problem

The Anthropic production skills for docx, pptx, xlsx, and pdf share several Tier 1 and Tier 2 traits: concise activation descriptions, negative guards, quick-reference structure, and separate reference files for depth. Their relative weakness is different. They skew toward execution scaffolding more than rich decision scaffolding.

Anthropic Strength

Working implementation discipline

These skills are concise, install-aware, reference-backed, and designed for predictable execution inside a real product environment.

Scraped Tier 1 Strength

Richer decision scaffolding

The strongest scraped skills often go further on routing, decision trees, failure ladders, and multi-skill contracts than the production baselines do.

Takeaway

The strongest skill is not merely longer, shorter, stricter, or more complete. It combines the decision scaffolding of the best custom skills with the operational discipline of production-grade references.

07 /

Checklist

A Practical Scoring Rubric

One useful way to operationalize the analysis is a simple heuristic score: one point for each of the ten decision surfaces below. In this corpus, 9–10 landed in Tier 1, 7–8 in Tier 2, 4–6 in Tier 3, and 0–3 in Tier 4.

Self-Improving + Proactive Agent Ivan G Davila

10/10

AdMapix fly0pants

9/10

Docker Ivan G Davila

8/10

Web Search by Exa Ishan Goswami

7/10

API Gateway byungkyu

5/10

Tushare Pro AlphaFactor

4/10

Nano Banana Pro Peter Steinberger

3/10

Camsnap Peter Steinberger

1/10

✓
Does the activation description include at least one explicit negative condition, not just a positive use case?
✓
Is there a quick-reference shell that routes to deeper files or sections instead of loading everything inline?
✓
Are the core rules named and structured clearly enough that the model can execute them as a protocol?
✓
Is there a traps or anti-patterns section that catches predictable failure modes before the model reaches them?
✕
Is there an explicit error table or troubleshooting ladder that maps failures to bounded next actions?
✕
Are the highest-risk actions fenced by unambiguous "never" or "do not" rules?
✕
Does the skill disclose what data leaves the machine, what remains local, and what external systems retain?
✕
If the skill stores state, does it define location, size limits, access patterns, and lifecycle transitions?
✕
Do related skills declare typed contracts or rationale, rather than merely mentioning each other by name?
✕
Does frontmatter declare bins, env vars, OS constraints, and install steps in a machine-readable way?

Those six failing checks are where most of the corpus quality gap actually lives. They are also the surfaces where long-context drift and safety regressions are hardest to catch if the file leaves them implicit.

SKILL.md Ablation: What Makes Agent Skills Reliable

The Quality Gap Is Structural, Not Topical

Ten Dimensions Separate Prompt Injections From Operating Manuals

Activation Boundaries

Selective Loading

Intent and Complexity Routing

Error Handling

Negative Constraints

Security Disclosure

State Architecture

Cross-Skill Contracts

Output Formats

Frontmatter Metadata

Where the Gaps Actually Show Up

Selective loading beats inline encyclopedias

Routing and failure handling remove improvisation

State needs lifecycle, not vibes

Tiering the Corpus

The Core Meta-Pattern Is Explicit Decision Surfaces

When to activate

How to branch

How to recover

What never to do

Where state lives

What format returns

Anthropic Production Skills Solve a Different Half of the Problem

A Practical Scoring Rubric