Microsoft open sources AI evaluation framework for enterprise agents – InfoWorld

Microsoft open sources AI evaluation framework for enterprise agents - InfoWorld https://indiaprimetv.com/uncategorized-en/microsoft-open-sources-ai-evaluation-framework-for-enterprise-agents-infoworld/

Microsoft has open-sourced an AI evaluation framework that converts natural-language requirements into executable tests, expanding its push into enterprise AI governance as organizations struggle to validate agent behavior before production deployments systematically.
The framework, called ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing), generates evaluation scenarios, datasets, metrics, and scorecards from written specifications, product requirements, and governance documents, Microsoft said in a blog post announcing the release.
“Agents fail in ways that are hard to see,” Microsoft wrote in the blog post. “They drift from policy, produce unsafe outputs in edge cases, and behave differently in production than they did in testing. Generic benchmarks do not catch these failures because they are not built around your policies, your agent, or your use case.”
Rather than requiring developers to manually create evaluation suites, ASSERT translates written intent into reusable tests that can be integrated into AI development pipelines, the company said in the blog post.
With ASSERT, Microsoft is entering an increasingly competitive AI evaluation market that already includes platforms such as LangChain’s LangSmith, Braintrust, Patronus AI, Galileo, Arize AI’s Phoenix, and Promptfoo, which help enterprises benchmark, monitor, and validate large language model applications.
The release comes as enterprises rapidly expand AI agent deployments while formal evaluation practices remain the exception rather than the rule.
“Most organizations, in fact, 99% of them, do not evaluate any AI agents pre-production,” said Anushree Verma, senior director analyst at Gartner.
According to Verma, the industry’s next competitive advantage will depend less on advances in reasoning models than on how effectively organizations simulate and stress-test AI agents before deployment.
“The next competitive moat in agentic AI is not about the sophistication of reasoning models or the underlying architecture,” she said. “It will be about the depth and realism of the training environment through agentic simulation, particularly for mission-critical deployments.”
Gartner estimates that by 2029, more than 75% of domain-specific agents designed without agentic simulation in regulated industries will fail to deliver value.
Forrester sees enterprises moving toward behavioral evaluation but says most organizations have yet to make it a formal production requirement.
“Most enterprises are still in an intermediate stage where behavioral evaluation is inconsistently applied rather than treated as a formal production gate,” said Biswajeet Mahapatra, principal analyst at Forrester.
According to Forrester data, more than 45% of organizations are already using AI agents, and another 25% are piloting them, yet many continue to struggle with scaling because of immature governance and limited operational rigor.
“The net is that behavioral evaluation is becoming important, but for most organizations it is still ad hoc or tool-driven rather than a standardized release gate enforced across the lifecycle,” Mahapatra said.
Microsoft said ASSERT uses large language models as judges, with model-generated evaluations agreeing with human reviewers 80% to 90% of the time in the company’s internal validation.
That level of agreement can help automate large portions of AI testing, but should not be treated as a standalone governance mechanism, Mahapatra said.
“An 80% to 90% agreement rate with human reviewers indicates strong alignment but is not sufficient as a standalone control for governance or compliance,” he said.
Instead, enterprises should adopt layered oversight where AI evaluates AI at scale while humans retain supervisory accountability for high-risk, regulated, or ambiguous scenarios. Buyers should also watch for bias, consistency issues, and overreliance on a single model acting as both generator and evaluator, he added.
Microsoft released ASSERT under the MIT open-source license, allowing organizations to inspect, modify, and integrate the framework into existing AI development workflows.
But open sourcing a framework does not eliminate questions around evaluation neutrality, Mahapatra said.
“Open sourcing under an MIT license reduces lock-in concerns and enables broad interoperability across model ecosystems,” he said. “However, it does not fully eliminate trust or conflict-of-interest questions because the originating vendor still influences how evaluation criteria, scoring logic, and definitions of acceptable behaviour are encoded.”
Instead of relying on a single evaluation framework, enterprises should validate AI systems against multiple evaluation approaches and retain ownership of internal evaluation policies, he said.
Gyana Swain is a seasoned technology journalist with over 20 years’ experience covering the telecom and IT space. He is a consulting editor with VARINDIA and earlier in his career, he held editorial positions at CyberMedia, PTI, 9dot9 Media, and Dennis Publishing. A published author of two books, he combines industry insight with narrative depth. Outside of work, he’s a keen traveler and cricket enthusiast. He earned a B.S. degree from Utkal University.

source

Leave a Reply

Your email address will not be published. Required fields are marked *

CIOs Face Mounting Pressure to Deliver AI ROI as the Business-IT Divide Reaches a New High - PR Newswire https://indiaprimetv.com/uncategorized-en/microsoft-open-sources-ai-evaluation-framework-for-enterprise-agents-infoworld/
Latest Updates

CIOs Face Mounting Pressure to Deliver AI ROI as the Business-IT Divide Reaches a New High – PR Newswire

    Searching for your content… In-Language News Contact Us 888-776-0942 from 8 AM – 10 PM ET Jun 11, 2026, 09:31 ETShare this articleNew Experis research reveals CIOs’ top priorities have shifted dramatically in just one year, as AI transforms what leadership requires.MILWAUKEE, June 11, 2026 /PRNewswire/ — One year ago, cybersecurity kept CIOs up at […]

    Read More
    Anthropic just proposed taxing itself to pay for the jobs its AI destroys - Fortune https://indiaprimetv.com/uncategorized-en/microsoft-open-sources-ai-evaluation-framework-for-enterprise-agents-infoworld/
    Latest Updates

    Anthropic just proposed taxing itself to pay for the jobs its AI destroys – Fortune

      Anthropic on Wednesday joined growing calls for the artificial intelligence industry to find ways to cushion people from the technology’s disruptions, announcing an initial $200 million investment to research AI’s impact on jobs and the economy.Alongside new policy proposals from the maker of the Claude chatbot, Anthropic CEO and co-founder Dario Amodei published an essay on his personal website that expanded on […]

      Read More
      Hospitality Finder launches package for English Legends at Chart Hills - Golf Business News https://indiaprimetv.com/uncategorized-en/microsoft-open-sources-ai-evaluation-framework-for-enterprise-agents-infoworld/
      Latest Updates

      Hospitality Finder launches package for English Legends at Chart Hills – Golf Business News

        The Green Room will bring its distinctive hospitality concept to golf for the first time during the English Legends at Chart Hills.Related TopicsPublishedThe Green Room, Hospitality Finder’s premium sports hospitality experience, will make its debut in golf this summer during the English Legends, part of the Staysure Legends Tour schedule, which is being held at […]

        Read More