Major Innovation in Agent Skills! Anthropic Upgrades Skill Factory with Nuclear-Level Evals System, Developers: Old Skills Revived
Major Innovation in Agent Skills! Anthropic Upgrades Skill Factory with Nuclear-Level Evals System, Developers: Old Skills Revived
Intelligent Gorilla AI Organization | Editor: Xixi
In the field of AI agents, if you have used Agent Skills, you must be familiar with skill-creator, a no-code building tool for skills released by Anthropic in 2025.
However, after building a skill, there was still no way to know whether the skill was useful, whether the new model could still be used, whether it ran accurately, or how effective it was...
On March 3, Anthropic's official blog quietly released a significant update titled "Improving skill-creator: Test, measure, and refine Agent Skills." This upgrade has truly matured Claude's "skill factory."
From "seemingly usable" to "testable, measurable, and iterable," it has completely resolved the biggest pain point for skill authors, which is "how good is the skill I created?"
01 - Review of Agent Skills: A Key Step from General Assistant to Professional Agent
In October 2025, Anthropic officially launched Agent Skills, a modular and reusable "skill package" system. A folder contains SKILL.md instructions, scripts, and resources, which Claude automatically loads when needed, significantly enhancing performance in document generation, data analysis, brand compliance, and other scenarios.
Skills have covered the entire Claude.ai, Claude Code, and API platforms, and have opened a GitHub repository (currently with over 80,000 stars). However, the biggest limitation of the early version was that non-technical users could only iterate based on intuition, unable to quantify and verify effectiveness.
There are two types of Skills:
1. Capability Enhancement
Tasks that the model originally "could not do" or "could not do stably" are stabilized by injecting specific techniques and patterns through Skills.
2. Preference Encoding
The model can perform every step but needs to be strictly ordered according to the team's specific process.
Five Highlights of This Upgrade:
- Evals (Automated Evaluation): Users only need to describe "test prompt + expected output" for skill-creator to automatically run verification.
- Benchmark Mode: Batch run standardized tests, output pass rates, time taken, token consumption, and other hard metrics.
- Multi-Agent Parallel Execution: Independent clean contexts to avoid contamination, significantly increasing testing speed.
- Comparator (Blind Comparison): A/B testing of two skill versions.
- Description Tuning (Trigger Description Optimization): Automatically analyzes sample prompts and suggests modifications to descriptions.
02 - No Reason Not to Install! This Update Revives Old Skills
Anthropic's update to skill-creator has quickly sparked discussions among AI agent practitioners and developers.
03 - The CI/CD Moment for AI Agents: From Artwork to Engineering Product
Anthropic's upgrade to skill-creator essentially brings the most mature "test-benchmark-iterate" closed loop from software engineering to ordinary users and enterprise teams with low barriers. This means that Agent Skills are no longer a one-time prompt project that is "written and thrown away," but a "living asset" that can be continuously maintained, compatible across model versions, and optimized with data.
In the short term, the biggest beneficiaries are developers and enterprise users who have accumulated a large number of custom skills in Claude Code / Cowork.
From a broader perspective, this update further solidifies Anthropic's "toolchain moat" in the Agent ecosystem.

