Helping build shared standards for advanced AI
OpenAI helps build shared standards for advanced AI, supporting evaluation frameworks, safety practices, and global cooperation through the Appia Foundation.
50 items tagged with this topic
OpenAI helps build shared standards for advanced AI, supporting evaluation frameworks, safety practices, and global cooperation through the Appia Foundation.
OpenAI boardmember Zico Kolter and Gray Swan CEO Matt Fredrikson join swyx to explain why AI security is not just “cybersecurity with AI”
With GLM-5.2 passing everyone's vibe check, the open models story finally becomes a real frontier story.
a policy framework for derisking success
The biggest problem with AI is that priors need to be reset every few weeks.. and it seems like most people are incapable of doing that. I talk to so many people who say xyz doesn’t work and when I ask when was the last…
Learn how GPT-5.5 Instant improves ChatGPT’s health and wellness responses with stronger reasoning, better context, clearer communication, and physician-informed evaluations.
Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv, cappuccinos, and feedback from readers. If you’d like to support this, please subscribe. Subscribe now AI researchers launch new safety startup because “alignme…
Congress's Push to Take the Lead on China
Introducing LifeSciBench, an expert-authored, expert-reviewed benchmark for evaluating how AI systems handle real-world life science research tasks and decisions.
The real prize in the SpaceX-Cursor deal is the agentic harness that will become the core for automating all knowledge work at scale. Here’s what SpaceX is getting: 1. Production-grade agentic harness -planning, context…
OpenAI introduces Deployment Simulation, a method to predict AI model behavior before deployment using real conversation data to improve safety and evaluation accuracy.
I just reached 70k followers on X 🙏🏼 X is where I learn in public & build in public. I've learned so much and met so many wonder people here If you're curious about how I grew my followers while being completely authe…
Having been through many frontier model launch reviews, I have empathy for everyone involved. Launching an LLM isn't like shipping traditional software - you're making a decision about a black box with effectively infin…
Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv, cappuccinos, and feedback from readers. If you’d like to support this, please subscribe. Subscribe now Society can be reward-hacked, just like cyber environment…
We made a thing!
When the whole Tokenmaxxing craze started some our enterprise customers asked us for a leaderboard. We said no. Would’ve been “great” for business but we’re not in the business of selling tokens for the sake of tokens.…
So proud of @datacurve (YC W24) - building THE defining software engineering benchmark in DeepSWE Tired? SWE-Bench Pro Wired? Datacurve DeepSWE https://t.co/ZoftIrEGKc
This is a super exciting release - Claude Fable 5 is the same underlying model as Mythos but with added safeguards. The benchmarks are great and it's SOTA on everything by a margin but I'll add that *qualitatively* also…
Great post. So much about model performance is a function of how much compute you’re doing at inference time. This means compute-normalized benchmarks is the only logical path forward. And yet, the challenge is it’s a l…
@METR_Evals previously on @cognition_labs https://t.co/PdMrqxtuV0
It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot y…
We talk with the VendingBench authors on evaling Claudes from Haiku to Mythos, and how they build leading, and lasting, frontier evals from scratch.
Routing to models is genuinely hard. It means mapping each task to the right model - which requires benchmarking models against your product's specific tasks and dialing in the quality/cost trade-off. And there is an op…
How to build AI skills that check their own work and improve over time: 1. Give it context Ask AI: "Create a skill for this [repeated task]. Here are examples of good output so it knows what good looks like." 2. Make it…
Also as great as Codex is (and I'm really starting to love it) the frontend design still leaves alot to be desired. I have a /slides skill and you can guess which one Codex made vs. Claude. Yes I know I can make an imag…
Excited to share how Anthropic's data team has automated 95% of business analytics queries with Claude. Blog post covers how we approach evals, ablations, and online validation! https://t.co/sMPtM0GscN
The legendary Microsoft CEO makes his first Latent Space appearance!
SWE benchmarks don’t necessarily capture app building capabilities. ViBench does. https://t.co/zh663pe79v
As token budgets take on a larger part of operating expenses over time, model routing is the inevitable conclusion. This is also one of the biggest areas of differentiation for the applied AI layer over time. By underst…
GBrain is the agentic swiss army knife for retrieval and memory https://t.co/EgXmcs6ZOS
Almost a week later! What are your thoughts on Opus 4.8? We were extremely bullish on it in testing—it seems the response was more tepid once y'all got your hands on it. If you disagreed with our take I'm curious why so…
MiniMax M3 is now the leading open model on the Next.js agent evaluations (https://t.co/SnZ54XoRWV). Right behind Opus & GPT5, but 10× cheaper (And 20× cheaper right now on ▲ AI Gateway!) https://t.co/z9ts1NZDyu
a quiet day lets us highlight the new AIE WF focuses
every evals/analytics startup is going through a onetime generational upgrade into a continual learning platform in 2026 many will fail but as always the tasteful ones win
OpenAI frontier models and Codex are now generally available on AWS, giving enterprises a new path to build with OpenAI through the AWS environments, controls, and procurement workflows they already use. Customers can get started with Open…
OpenAI shares guidance on third-party AI evaluations, covering how to assess model capabilities, safeguards, and validity for frontier systems.
Do you still trust benchmarks or do you just listen to your friends? What makes you try a new model?
F.
a quiet day lets us feature fundraises!
By evals I mean literally tell the agent: given what we discussed about what we are doing and why and what happened, use three different frontier models to look at inputs and outputs of your skill file calling the code,…
a quiet day but a nice result in AI x mathematics
Spent this weekend building a pretty rad (imo) travel CLI product that I'd love to test in public.. Anyone have a trip coming up soon AND has a boatload of points? Reply here with origin, destination, point breakdown an…
Real-world inference benchmarks for coding agents: 31% more TPS than TensorRT-LLM, 2× better TTFT at saturation, and 76% lower cost than Claude Opus 4.6.
My newest gbrain-evals just dropped - this is how gbrain does vs other options. https://t.co/yRWm72QEgf is SOTA for reranking and embedding cost, speed, and retrieval success. GBrain beats MemPalace by 1% on LongMemEval…
Thinking Machines is impressive. In a couple hours I just fine tuned my own Qwen3.5-397B model this afternoon. Fast usable multimodal is also going to enable very mind-blowing personal AI. https://t.co/mm3laZb766
btw we did a bake off of Exa vs competitors and it took all of 1.5 hrs for the team to unanimously converge on exa lol. so proud to see my former landlords crush it - time travel back to last year and listen to a pre pm…
Our recent paper, “LLMs Corrupt Your Documents When You Delegate”, has generated discussion about the reliability of AI systems in delegated workflows. We appreciate the interest in this work and want to clarify several important points ab…
Gemini 3.5 Flash is out, and it's a major jump over Gemini 3 Flash in model capability for knowledge work. We've been evaluating it on our Box AI Complex Work Eval in early release, and the model delivers a 12 percentag…
Genuinely impressive release by Google today (remember when they were behind?) Gemini 3.5 Flash perf: * Building on prior strengths (83.6% of MMMU-Pro for multimodal), * big jump on agentic coding (76.2% on Terminal-Ben…
Databricks uses GPT-5.5 for enterprise agent workflows after the model set a new state of the art on the OfficeQA Pro benchmark.