← Back to home

AI Testing

35 items tagged with this topic

Recent

Older

Official Sourcesfrom Microsoft Research Blog

AutoAdapt: Automated domain adaptation for large language models

Deploying large language models (LLMs) in real-world, high-stakes settings is harder than it should be. In high-stakes settings like law, medicine, and cloud incident response, performance and reliability can quickly break down because ada…

Podcasts & Newslettersfrom ChinaTalk

Not Another Dev Tool

My Favorite Person Needs a Cofounder

Watchlistfrom Anthropic Engineering

Quantifying infrastructure noise in agentic coding evals

Agentic coding benchmarks like SWE-bench and Terminal-Bench are commonly used to compare the software engineering capabilities of frontier models—with top spots on leaderboards often separated by just a few percentage points. These scores…

Official Sourcesfrom Microsoft Research Blog

AsgardBench: A benchmark for visually grounded interactive planning

Imagine a robot tasked with cleaning a kitchen. It needs to observe its environment, decide what to do, and adjust when things don’t go as expected, for example, when the mug it was tasked to wash is already clean, or the sink is ful…

Podcasts & Newslettersfrom Latent Space

Moonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun Sun

Speaker 1 | 00:00 - 00:20 I think this whole space is extremely difficult as things are emerging now. And, I mean, it's not only for world models. I think it's for everything, including text based models. Right? Because, you know, in the e…

Why it matters: Speaker 1 | 00:00 - 00:20 I think this whole space is extremely difficult as things are emerging now. And, I mean, it's not only for world models. I think it's for everything, including text based models. Right? Because, you know, in the e…