Analysis & Commentary
Dhyaneesh DS
← Home
March 2026
Field Report No. 3
Field Report

SKILL.md Ablation: What Makes Agent Skills Reliable

A corpus of 40 skills shows that the performance gap is not about topic coverage. It is about whether the file encodes the decisions the model would otherwise improvise: when to activate, how to branch, what never to do, what to disclose, where state lives, and how results should return.

01 /

The Quality Gap Is Structural, Not Topical

The strongest skills in the corpus do not merely expose a tool. They remove runtime ambiguity. They turn decisions the model would otherwise infer on the fly into explicit operating surfaces: activation boundaries, routing rules, error recovery, negative constraints, output formats, and state lifecycles.

The weakest skills fail in the opposite direction. They look usable because the topic is present, but the agent is left to improvise everything that matters after the happy path. That gap compounds with longer context windows, shifting model behavior, and pressure-filled situations where the base model is most tempted to guess.

Corpus
40
Full read of 40 public skills across distinct packages.
Dimensions
10
Activation, routing, failure handling, state, format, and more.
Quality Gap
10×–20×
Top-tier skills compress ambiguity; low-tier skills expand it.
Largest Delta
6
The missing decision surfaces that most wrappers never specify.
Top Tier
Encode decisions the model can execute

Reliable skills narrow the action space. They tell the agent when to activate, how to branch, where to write, what to return, and what failures should trigger escalation instead of another blind attempt.

Bottom Tier
Expose tools but outsource judgment

Weak skills read like wrappers or man pages. They describe capabilities, but the model still has to infer the operational contract in real time, which is exactly where drift and unsafe improvisation show up.

"The best skill files are not richer descriptions. They are executable decision systems written in natural language."

02 /

Ten Dimensions Separate Prompt Injections From Operating Manuals

The 10 dimensions below are the recurring fault lines in the corpus. Together they describe whether a skill behaves like a precise runtime contract or a loose essay about a tool.

01

Activation Boundaries

Clear "use this / do not use this" logic prevents accidental invocation in the wrong context.

02

Selective Loading

Good skills route to sub-files or references instead of forcing every deep detail into every invocation.

03

Intent and Complexity Routing

The skill should branch explicitly between direct execution, adaptive loops, and longer multi-step or research work.

04

Error Handling

Reliable skills specify recovery paths, stop conditions, and when to escalate instead of retrying blindly.

05

Negative Constraints

High-risk actions need explicit "never" and "do not" rules, not just optimistic examples.

06

Security Disclosure

The skill should make the data boundary legible: what leaves the machine, what stays local, and what is retained elsewhere.

07

State Architecture

Any persistence mechanism needs location, size, promotion, and demotion rules or it decays into unbounded context sludge.

08

Cross-Skill Contracts

Related skills should declare typed reads, writes, and preconditions instead of leaving coordination implicit.

09

Output Formats

Different modes should return different structures. A browse action and a deep-dive report should not share one fuzzy template.

10

Frontmatter Metadata

Bins, environment variables, OS constraints, and install steps should be machine-readable rather than buried in prose.

Most of the quality gap in the corpus is not evenly distributed across all ten dimensions. It clusters around the dimensions that make the model's decision process observable and stable under pressure.

03 /

Where the Gaps Actually Show Up

Selective loading beats inline encyclopedias

The corpus repeatedly shows that long, always-loaded skill bodies are not a sign of rigor. They are usually a sign that the author has not separated routing from reference material. The best files stay lightweight at invocation time and defer depth to targeted sub-files or reference sections.

Example Always-loaded footprint What works Failure mode
Git by Ivan G Davila 141 lines Quick-reference rules route the agent to deeper files only when a task actually needs them. Depth is conditional, not paid up front.
Anthropic production skills Concise shell + references Core behavior stays short, while richer examples sit behind secondary references. The model can stay oriented without carrying dead weight.
API Gateway by byungkyu 664 lines Broad routing coverage exists, but it is paid on every invocation regardless of relevance. The model loads a service directory before it knows whether the service matters.

Routing and failure handling remove improvisation

The most technically sophisticated skills in the corpus succeed because they make branching explicit without pretending the boundary is perfectly crisp. In practice, the useful distinction is not "simple versus deep" by call count. It is whether uncertainty, branching, and execution shape imply a direct path, an adaptive loop, or a longer research workflow.

Fig. 01 Route by decision criteria, not by a false binary
Routing Panel
Choose a path by visible cues
Decision cues
Uncertainty, branching, and execution shape
The routing surface should explain why a task needs a direct path, an adaptive loop, or a research workflow. It should not collapse to call count.
Direct path
Low ambiguity, bounded steps
  • Known schema and narrow failure surface.
  • Short deterministic chain with explicit exit criteria.
Adaptive path
Conditional loop with checkpoints
  • Moderate ambiguity or stateful tool interaction.
  • Plan → act → inspect → adapt with bounded retries.
Research path
Longer horizon, synthesis-heavy work
  • Open questions, multiple sources, wider monitoring surface.
  • Progress reports, stop conditions, and explicit delivery format.

The key move is not to force work into a binary split. It is to expose the cues and operating shape that determine how the model should proceed.

Robust Pattern
Explicit failure ladders

The best examples specify what to try first, what to try next, what not to repeat, and when to escalate. They turn error recovery into a protocol rather than a vibe.

Common Failure
Optimistic prose

Weak skills often document only the happy path. When something breaks, the model is left to invent retries, workarounds, or unsafe shortcuts in real time.

State needs lifecycle, not vibes

Stateful skills are useful only if they also describe lifecycle. memory.md is not a strategy. The strongest pattern in the corpus separates state into tiers with load rules and promotion criteria, which keeps long-running skills from degenerating into arbitrary context accumulation.

Tier Where it lives Size limit When it loads Lifecycle rule
HOT memory.md 100 lines max Always Only the most frequently used facts stay here.
WARM projects/ and domains/ 200 lines each Only on context match Promote after repeated use; demote after inactivity.
COLD archive/ Unbounded, but queried Explicit request only Never delete without user confirmation.
Cross-Skill Contracts
Typed reads and writes

Related skills work better when they declare what they consume, what they mutate, and which preconditions must already hold. That turns multi-skill coordination from social guesswork into a protocol.

Output Formats
Different modes, different returns

Browse, analyze, and deep-dive should not collapse into one generic answer template. The output contract should shift with the action mode, audience, and data shape.

04 /

Tiering the Corpus

Applied as a heuristic rubric, the 10 dimensions separate the corpus into four contract tiers. The tiers do not measure raw implementation polish in isolation; they measure how much of the runtime operating contract is carried by the skill file instead of inferred on the fly.

Fig. 02 Count, score band, and contract profile belong in one frame
Tier Panel
How to read the four tiers
Read the numbers correctly
Count, score band, and contract profile are different signals
Count describes how the sample is distributed. The score band is a heuristic bucket on the 10-point rubric. Reliability here means how explicit the runtime contract is.
Tier 1
Production-grade contract
5 skills · score band 9–10
Tier 2
Mostly reliable shell
12 skills · score band 7–8
Tier 3
Useful wrapper, weak contract
15 skills · score band 4–6
  • Often works on the obvious path, then degrades when branching or failure handling matters.
  • Examples: API Gateway, Tushare Pro.
Tier 4
Capability blurb
8 skills · score band 0–3
  • The model must invent activation logic, recovery, and safe output structure almost from scratch.
  • Examples: Baidu Search, Discord.

The important divide is not "has functionality" versus "does not." It is whether the file itself carries enough of the operating contract that the model can execute without inventing the missing rules at runtime.

05 /

The Core Meta-Pattern Is Explicit Decision Surfaces

Across the corpus, the same deeper principle keeps reappearing. Reliable skills do not merely tell the model what tool is available. They precompute the decisions the model would otherwise have to infer under runtime pressure.

Surface 01

When to activate

Positive triggers alone are insufficient. Strong descriptions also declare the contexts in which the skill should stay dormant.

Surface 02

How to branch

Explicit route selection prevents the model from using the same strategy for bounded tasks, adaptive loops, and long-horizon research.

Surface 03

How to recover

Good skills turn failure into a bounded state machine instead of an open-ended retry loop.

Surface 04

What never to do

Negative constraints catch the "plausibly helpful but wrong" actions that base-model alignment cannot reliably prevent.

Surface 05

Where state lives

Without load rules and lifecycles, persistence becomes an unbounded context leak.

Surface 06

What format returns

Output structure should depend on intent. The skill should not force the model to guess the deliverable at the end of the pipeline.

Interpretation

The broad pattern is simple: weak skills tell the model what to infer; strong skills hand the model the inference outputs directly. That is why the latter are more stable, more auditable, and less sensitive to context drift.

06 /

Anthropic Production Skills Solve a Different Half of the Problem

The Anthropic production skills for docx, pptx, xlsx, and pdf share several Tier 1 and Tier 2 traits: concise activation descriptions, negative guards, quick-reference structure, and separate reference files for depth. Their relative weakness is different. They skew toward execution scaffolding more than rich decision scaffolding.

Anthropic Strength
Working implementation discipline

These skills are concise, install-aware, reference-backed, and designed for predictable execution inside a real product environment.

Scraped Tier 1 Strength
Richer decision scaffolding

The strongest scraped skills often go further on routing, decision trees, failure ladders, and multi-skill contracts than the production baselines do.

Fig. 03 A qualitative map of decision scaffolding and execution scaffolding
Qualitative Matrix
How to read the comparison on mobile
Read the axes
Decision scaffolding is different from execution scaffolding
Decision scaffolding means the skill precomputes routing, failure handling, constraints, and output logic. Execution scaffolding means setup, tool usage, and verification are operationally concrete.
Tier 1 custom skills
High decision scaffolding
Strong routing, failure ladders, and mode-specific returns
These files are decision-rich, but they do not always ship with the same install-aware execution discipline as productized baselines.
Target synthesis
High on both axes
Explicit contract plus operational discipline
The strongest pattern combines rich decision surfaces with typed handoffs, monitoring hooks, and predictable execution.
Anthropic production baseline
Higher execution scaffolding
Lean, install-aware, reference-backed shells
These skills are operationally disciplined, but often less branching-rich than the strongest custom exemplars.
Thin wrappers
Low on both axes
Capability blurbs with little contract structure
They announce features without making either the decision process or the execution discipline concrete enough to trust under pressure.

The strongest future pattern is not a choice between the two families. It is a synthesis: rich decision scaffolding plus disciplined execution scaffolding.

Takeaway

The strongest skill is not merely longer, shorter, stricter, or more complete. It combines the decision scaffolding of the best custom skills with the operational discipline of production-grade references.

07 /

A Practical Scoring Rubric

One useful way to operationalize the analysis is a simple heuristic score: one point for each of the ten decision surfaces below. In this corpus, 9–10 landed in Tier 1, 7–8 in Tier 2, 4–6 in Tier 3, and 0–3 in Tier 4.

Those six failing checks are where most of the corpus quality gap actually lives. They are also the surfaces where long-context drift and safety regressions are hardest to catch if the file leaves them implicit.