SkillOpt lifts agent scores by 23.5 points without changing weights

Agent performance may not require a new model when the procedure around the model can be trained. Microsoft Research published SkillOpt on June 30, 2026 at 16:50:02 UTC, showing a way to optimize natural-language skill files as if they were trainable parameters, while leaving model weights untouched.

The headline number is unusually large for a method outside the model. With GPT-5.5 in direct chat, SkillOpt raised the average across six benchmarks from 58.8 to 82.3, a +23.5-point absolute gain. Microsoft says it evaluated the method across SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMathematicianBench, and ALFWorld, using seven target models from GPT-5.5 down to Qwen3.5-4B, and three execution modes: direct chat, Codex, and Claude Code. Across 52 evaluation cells, SkillOpt was best or tied for best.

The result matters because agent work often fails in the workflow, not only in raw reasoning. SkillOpt searches over edits to a skill file, keeps a rejected-edit buffer, uses validation splits, and performs slower updates to avoid unstable prompt drift. The final artifacts are not giant hidden prompts. Microsoft reports a median final skill length of roughly 920 tokens across six case studies, with only one to four accepted edits in the final file.

The most interesting detail is transfer. A spreadsheet skill trained inside Codex and then moved into Claude Code lifted the no-skill baseline from 22.1 to 81.8, slightly above the 80.4 achieved by training directly in Claude Code. That suggests the learned procedure is not merely memorizing one harness’s tool names. It is capturing a reusable way to solve a class of tasks.

There are limits. SkillOpt needs reliable evaluators or verifiers, so it fits domains where success can be measured. But for enterprise agents, that is often the point: spreadsheets, document QA, search, coding workflows, and internal operations usually have tests, answer keys, or review gates. If the result holds outside benchmarks, the agent stack gets a new adaptation layer: small, auditable skill files that can be trained, versioned, rolled back, and moved across models. Source: Microsoft Research, June 30, 2026.

SkillOpt lifts agent scores by 23.5 points without changing weights

Related Articles

GitHub Copilot harness matches native agents across five coding benches

A 2,000-person AI assistant attack test raises a harder question about responses

Gemini 3.5 Flash gets computer use, and HN focuses on trust boundaries