AI Testing

35 items tagged with this topic

Recent

Official SourcesfromMicrosoft Research BlogMay 5

Microsoft at NSDI 2026: Advances in large-scale networked systems

Microsoft researchers share advances in building and operating large-scale distributed systems, spanning datacenters, networking, and the growing intersection with AI during NSDI ’26. The post Microsoft at NSDI 2026: Advances in large-scal…

Research Speed & Cost AI Testing

BuildersfromXMay 7

Nice qmd vs gbrain benchmarking post GBrain won 8.3x on this particular corpus https://t.co/UcX…

AI Testing

Official SourcesfromTogether AI BlogApr 30

Announcing Together AI and Adaption Partnership

Together AI and Adaption partner to bring Together Fine-Tuning natively into Adaptive Data, helping teams optimize datasets, run fine-tuning, evaluate results, and deploy stronger open models.

Research Custom AI AI Testing

Older

Podcasts & Newslettersfrom Latent Space NewsletterApr 29, 2026

[AINews] not much happened today

a quiet day.

Buildersfrom XMay 1, 2026

3 weeks ago GBrain and Mempalace looked kind of similar 3 weeks later, it's pretty clear GBrain…

3 weeks ago GBrain and Mempalace looked kind of similar 3 weeks later, it's pretty clear GBrain is its own category that is ideal for OpenClaw/Hermes personal AI scenarios, not trying to solve needle-in-a-haystack retri…

Official Sourcesfrom Microsoft Research BlogApr 22, 2026

AutoAdapt: Automated domain adaptation for large language models

Deploying large language models (LLMs) in real-world, high-stakes settings is harder than it should be. In high-stakes settings like law, medicine, and cloud incident response, performance and reliability can quickly break down because ada…

Podcasts & Newslettersfrom Latent Space NewsletterApr 23, 2026

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Note: This episode was recorded just after AIE Europe, but before the Cursor-xAI deal.

Buildersfrom XApr 27, 2026

Wanted a truly local storage for my tweets so built birdclaw. Imoorts your archive, backs it up…

Wanted a truly local storage for my tweets so built birdclaw. Imoorts your archive, backs it up on github, has jobs so you can import your x bookmarks daily (since they are not fully accessible via the api). https://t.c…

Buildersfrom XApr 26, 2026

Releaseing GBrain v0.22 - lots of fixes to search and retireval and a new eval system (gbrain-e…

Releaseing GBrain v0.22 - lots of fixes to search and retireval and a new eval system (gbrain-evals is now a separate repo to prevent checkout bloat) https://t.co/qVAQjQeReV

Podcasts & Newslettersfrom Import AIApr 20, 2026

Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe. Subscribe now Huawei’s HiFloat4 training format beats Western-developed MXFP4 in Asce…

Buildersfrom XApr 23, 2026

GPT-5.5 is live. We’ve been testing the model over the last couple of weeks at Box on our most…

GPT-5.5 is live. We’ve been testing the model over the last couple of weeks at Box on our most complex knowledge work evals, and the model saw a 10 percentage point jump on accuracy of these enterprise content tasks vs.…

Buildersfrom XApr 24, 2026

Next, I finally created new evals for GBrain which show how much more awesome GBrain is when yo…

Next, I finally created new evals for GBrain which show how much more awesome GBrain is when you have graph AND vector search on top of grep on knowledge wikis https://t.co/eiVHZWvStL

Podcasts & Newslettersfrom Unsupervised LearningApr 23, 2026

Ep 85: Has AI Infra Stabilized, FM Vibe Shift, & What's Next for Coding Agents

Speaker 1 | 00:00 - 00:03 Isn't that crazy? That number is just mind boggling. Speaker 2 | 00:03 - 00:06 What is the state of the AI coding wars today? Speaker 1 | 00:06 - 00:16 We're in a phase of sort of, like, capability exploration. Th…

Podcasts & Newslettersfrom ChinaTalkApr 19, 2026

Not Another Dev Tool

My Favorite Person Needs a Cofounder

Official Sourcesfrom Hugging Face BlogApr 21, 2026

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

Buildersfrom XApr 20, 2026

Kudos to the folks from Tencent for working with us and providing evals to improve OpenClaw's h…

Kudos to the folks from Tencent for working with us and providing evals to improve OpenClaw's harness performance! We're also working with them to bring fixes/improvements back to the open source repo. Great option for…

Buildersfrom XApr 19, 2026

When you ask Claude to evaluate a potential investment and it replies “cute, just not as good a…

When you ask Claude to evaluate a potential investment and it replies “cute, just not as good as me” lol https://t.co/HL1aByBAoj

Official Sourcesfrom Hugging Face BlogApr 16, 2026

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Watchlistfrom Anthropic Engineering

Quantifying infrastructure noise in agentic coding evals

Agentic coding benchmarks like SWE-bench and Terminal-Bench are commonly used to compare the software engineering capabilities of frontier models—with top spots on leaderboards often separated by just a few percentage points. These scores…

Buildersfrom XApr 17, 2026

Today is my last day at OpenAI, as OpenAI for Science is being decentralized into other researc…

Today is my last day at OpenAI, as OpenAI for Science is being decentralized into other research teams. It’s been a mind-expanding two years, from Chief Product Officer to joining the research team and starting OpenAI f…

Buildersfrom XApr 17, 2026

When I build skills now I always ask AI to spin up a separate eval agent to do yes/no checks to…

When I build skills now I always ask AI to spin up a separate eval agent to do yes/no checks to grade the first agent's output. And if the output isn't "yes" across the board it asks the first agent to keep working. Bui…

Buildersfrom XApr 15, 2026

The roadmap for Claude Cowork is… one month long. Things are moving too fast in AI to plan any…

The roadmap for Claude Cowork is… one month long. Things are moving too fast in AI to plan any further. Ship, evaluate, iterate. @felixrieseberg, who heads up Cowork at @AnthropicAI https://t.co/8YYpbM2yXU https://t.co/…

Official Sourcesfrom Hugging Face BlogApr 9, 2026

Multimodal Embedding & Reranker Models with Sentence Transformers

Buildersfrom XApr 14, 2026

GBrain v0.9.3 now available. Search tuning, search evals, CJK queries, and better health checks…

GBrain v0.9.3 now available. Search tuning, search evals, CJK queries, and better health checks! Plus lots of security hotfixes. Ask your claw or hermes to upgrade. https://t.co/ywYaRruXEZ

Buildersfrom XApr 11, 2026

The more I meet enterprise CIOs and AI leaders outside of tech, the more it’s obvious that if y…

The more I meet enterprise CIOs and AI leaders outside of tech, the more it’s obvious that if you’re building software that doesn’t have a great headless mode, you’re going to be at risk in the coming years. Asked a gro…

Buildersfrom XApr 10, 2026

Sometimes VCs will ask questions just to test your depth (market, technical, your obsession). T…

Sometimes VCs will ask questions just to test your depth (market, technical, your obsession). The best founders (imo) are able to argue both sides of an argument with extreme clarity. There’s rarely something VCs ask th…

Buildersfrom XApr 9, 2026

In evals, Sonnet with an Opus advisor scored 2.7 percentage points higher on SWE-bench Multilin…

In evals, Sonnet with an Opus advisor scored 2.7 percentage points higher on SWE-bench Multilingual than Sonnet alone, while costing 11.9% less per task. https://t.co/pV4nftN1wz

Buildersfrom XApr 8, 2026

I'm working on character evals and noticed that Claude would constantly pick itself as #1, so I…

I'm working on character evals and noticed that Claude would constantly pick itself as #1, so I removed the model names from the judge and changed things. https://t.co/Y9SqqJSYRc

Buildersfrom XApr 8, 2026

@jackbutcher @thisiskp_ @danlovesproofs @bfirsh @akshat_b here is @simonw on the difference in…

@jackbutcher @thisiskp_ @danlovesproofs @bfirsh @akshat_b here is @simonw on the difference in self evident naming between “prompt injection” and “lethal trifecta” https://t.co/wSSVgEZeWM https://t.co/ovDqwQRvq5

Official Sourcesfrom Hugging Face BlogApr 2, 2026

Welcome Gemma 4: Frontier multimodal intelligence on device

Official Sourcesfrom Google DeepMind BlogMar 17, 2026

Measuring progress toward AGI: A cognitive framework

We’re introducing a framework to measure progress toward AGI, and launching a Kaggle hackathon to build the relevant evaluations.

Official Sourcesfrom Hugging Face BlogMar 31, 2026

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

Official Sourcesfrom Microsoft Research BlogMar 26, 2026

AsgardBench: A benchmark for visually grounded interactive planning

Imagine a robot tasked with cleaning a kitchen. It needs to observe its environment, decide what to do, and adjust when things don’t go as expected, for example, when the mug it was tasked to wash is already clean, or the sink is ful…

Podcasts & Newslettersfrom Latent Space NewsletterApr 3, 2026

[AINews] Gemma 4: The best small Multimodal Open Models, dramatically better than Gemma 3 in every way

A welcome update from Google!

Podcasts & Newslettersfrom Latent SpaceApr 2, 2026

Moonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun Sun

Speaker 1 | 00:00 - 00:20 I think this whole space is extremely difficult as things are emerging now. And, I mean, it's not only for world models. I think it's for everything, including text based models. Right? Because, you know, in the e…

Why it matters: Speaker 1 | 00:00 - 00:20 I think this whole space is extremely difficult as things are emerging now. And, I mean, it's not only for world models. I think it's for everything, including text based models. Right? Because, you know, in the e…