agent-skills-eval — AI Agent Evaluation tool screenshot
AI Agent Evaluation

agent-skills-eval: Best AI Agent Evaluation for Devs in 2026

7 min read·

agent-skills-eval proves whether a `SKILL.md` improves model output by running a baseline-vs-skill A/B eval, grading both with a judge model, and writing machine-readable artifacts for CI.

Pricing

Open-Source

Tech Stack

TypeScript, Node.js, OpenAI-compatible chat APIs, JSON/JSONL artifacts

Target

developers shipping Agent Skills

Category

AI Agent Evaluation

What Is agent-skills-eval?

agent-skills-eval is one of the best AI Agent Evaluation tools for devs shipping Agent Skills. Built by darkrishabh, it runs each eval twice—with_skill and without_skill—then uses a judge model to score both outputs against assertions, so you can measure whether a SKILL.md actually improves results. It targets engineers using Anthropic's Agent Skills spec, but it also works with any OpenAI-compatible backend or local server that speaks the chat API.

Quick Overview

AttributeDetails
TypeAI Agent Evaluation
Best Fordevelopers shipping Agent Skills
Language/StackTypeScript, Node.js, OpenAI-compatible chat APIs, JSON/JSONL artifacts
LicenseMIT
GitHub StarsN/A
PricingOpen-Source
Last ReleaseN/A

Who Should Use agent-skills-eval?

  • Agent Skills authors validating a SKILL.md before they merge it into a production workflow. agent-skills-eval is built for the exact question, "does this skill improve the model or just add prompt bloat?"
  • Platform and QA teams that need repeatable, artifact-backed checks in CI. The workspace layout, judge outputs, and benchmark.json make it easy to fail builds on regression instead of eyeballing transcripts.
  • Indie hackers shipping AI assistants who want a cheap but disciplined eval loop. agent-skills-eval gives you a baseline run, a skill-enabled run, and a report without forcing you into a hosted observability suite.
  • Teams running OpenAI-compatible or local models that need the same evaluator across providers. If your target is OpenAI, Anthropic via a compat layer, Groq, Together, or a local Llama endpoint, agent-skills-eval can still drive the comparison.

Not ideal for:

  • Teams that only want raw prompt logging and no evaluation logic.
  • Workflows with no judge model available, since agent-skills-eval depends on a scorer for pass/fail decisions.
  • Purely deterministic unit tests where LLM judgment adds noise instead of signal.

Key Features of agent-skills-eval

  • Baseline-vs-skill A/B runs — Every eval is executed twice with the same prompt, once with the skill loaded and once with the skill stripped out. That makes the lift attributable to the skill, not to prompt variance or model luck.
  • Judge-graded scoring — The judge model sees the eval's expected_output and assertions, then grades each arm independently. This gives you pass/fail results with evidence instead of a single subjective score.
  • OpenAI-compatible provider layer — agent-skills-eval can talk to any backend that exposes the OpenAI chat shape. That includes OpenAI, Anthropic through compat layers, Groq, Together, and local Llama servers without special casing the evaluator.
  • TypeScript SDK plus CLI — You can run a one-liner in CI with npx, or embed the evaluator in a custom TypeScript pipeline with evaluateSkills(). The SDK is the path for dashboards, multi-skill rollups, and custom reporters.
  • Portable artifacts — The workspace outputs JSON, JSONL, and static HTML. That means you can diff iteration-N results, archive them in CI, or ship the report to any static host without standing up a database.
  • Tool-call assertions — agent-skills-eval is not limited to text similarity. It can validate whether an agent called the right tool, which matters for workflows where function calling is the actual product behavior.
  • Spec-compliant file layout — The evaluator follows the agentskills.io spec, including SKILL.md validation, evals/evals.json, iteration-N artifact structure, and frontmatter rules. That lowers the chance of passing local checks while failing in another runtime.

agent-skills-eval vs Alternatives

ToolBest ForKey DifferentiatorPricing
agent-skills-evalValidating Anthropic-style Agent SkillsDual-run with_skill vs without_skill comparison with judge gradingOpen-Source
promptfooBroad prompt and model regression testingWider matrix testing across prompts, providers, and assertionsFreemium / Open-Source
OpenAI EvalsOpenAI-centered eval pipelinesTight fit with OpenAI workflows and model evaluation conventionsOpen-Source
LangSmith EvaluationsTracing-centric AI QA and dataset managementStrong observability and dataset workflows around LLM appsPaid / Freemium

Pick agent-skills-eval when the unit of value is a skill package and you need to know whether that package changes behavior. Pick promptfoo when you want a more general-purpose regression harness across lots of prompts and providers, even if the skill concept is not central.

Pick OpenAI Evals when your stack is already centered on OpenAI and you want a familiar evaluation surface. Pick LangSmith Evaluations when tracing, datasets, and app-level observability matter more than the specific SKILL.md baseline test.

If you need trace-level debugging while you tune an eval, pair agent-skills-eval with OpenTrace. If the skill lives inside a multi-agent pipeline, OpenSwarm handles orchestration while agent-skills-eval handles verification.

How agent-skills-eval Works

agent-skills-eval treats a skill directory as a benchmark package, not as a loose collection of prompts. It validates SKILL.md, reads the eval definitions, and expands each case into a workspace with iteration-N artifacts so every run is reproducible and diffable. The core data model is a two-arm comparison: the same prompt goes through the target model with the skill in context, then through the same model without the skill as the baseline.

The evaluator is provider-driven rather than model-specific. A Provider implementation wraps anything that can answer an OpenAI-style chat request, which is why the same runner can work with hosted APIs, compat gateways, or local inference servers. The judge model then scores both arms against the same assertions, so the result is based on criteria you defined instead of a raw completion length or a vibes-based review.

npx agent-skills-eval ./skills \
  --target gpt-4o-mini \
  --judge gpt-4o-mini \
  --baseline \
  --strict

The command above runs the skill folder as an eval suite, enables the baseline comparison, and forces strict validation so bad metadata or malformed eval files fail early. Expect a workspace folder with meta.json, benchmark.json, per-eval subfolders, and a static report you can open directly in a browser or publish to GitHub Pages.

Pros and Cons of agent-skills-eval

Pros:

  • Direct skill attribution — The with_skill and without_skill split makes it clear whether the skill changed behavior.
  • CI-friendly artifacts — JSON and JSONL outputs fit build pipelines, diff tools, and custom dashboards without scraping HTML.
  • Fast provider swapping — OpenAI-compatible support means you can move between cloud models and local inference without rewriting the evaluator.
  • Good fit for tool-use agents — Tool-call assertions catch failures that text-only evals miss.
  • Low operational overhead — Static HTML reports and file-based artifacts mean no database, no queue, and no hosted backend.

Cons:

  • Judge quality matters — If your judge model is sloppy, the eval result will be sloppy too. agent-skills-eval does not fix weak scoring criteria.
  • Needs disciplined eval authoring — A bad SKILL.md or vague assertions produce noisy signals and weak conclusions.
  • Not a full observability suite — If you need tracing, lineage, and long-term telemetry, agent-skills-eval should sit beside OpenTrace, not replace it.
  • Baseline runs cost extra tokens — The --baseline mode doubles model execution for each eval, which matters on expensive models.
  • Strict mode can be unforgiving--strict is useful in CI, but it will surface schema and layout mistakes that casual local runs might ignore.

Getting Started with agent-skills-eval

npm install agent-skills-eval
OPENAI_API_KEY=... npx agent-skills-eval ./skills --target gpt-4o-mini --judge gpt-4o-mini --baseline --strict

After the run, agent-skills-eval writes a workspace with the raw outputs, judge decisions, and a static report under the current iteration folder. If your backend is not OpenAI, configure the provider settings in YAML or the SDK so the evaluator can reach your OpenAI-compatible endpoint without changing the eval content.

Verdict

agent-skills-eval is the strongest option for validating Agent Skills when you need a baseline-vs-skill comparison instead of a single raw score. Its main strength is evidence-backed regression testing; its caveat is that judge quality and assertion quality still control the result. Use it if you want a repeatable answer, not a demo.

Frequently Asked Questions

Looking for alternatives?

Compare agent-skills-eval with other AI Agent Evaluation tools.

See Alternatives →

You Might Also Like