OpenAI shipped a cybersecurity model and a planet-scale patching program yesterday. Box CEO Aaron Levie responded by explaining why none of it matters if enterprises can't measure whether any of it works. Both things are true at once, which tells you where the real bottleneck is right now.
01
OpenAI goes on offense in cybersecurity
Sam Altman announced GPT-5.5-Cyber, a specialized model claiming top performance on the CyberGym benchmark, alongside two new programs: Patch The Planet and Codex Security. The framing is notable. Previous AI security tools helped you *find* vulnerabilities. OpenAI is pitching these as tools that actually *fix* them, automatically, at scale, working alongside the US government and the broader security industry. ---
Why it matters: Every mid-sized company has a backlog of known security patches that nobody has gotten around to deploying. If Codex Security can actually close that gap rather than just generate longer vulnerability reports, your security team's job description changes significantly. The benchmark performance is easy to claim. The patching program is where this either becomes real or becomes a press release.
Box CEO: evals are the enterprise AI competency nobody is building yet
Aaron Levie posted a blunt observation: almost everything that determines whether AI models and agents improve, from domain-specific fine-tuning to enterprise deployments that actually help people work, comes down to evals. Evals are the tests companies use to measure whether an AI is doing the right thing in their specific context. His prediction is that the ability to design and run good evals will become a core enterprise competency, and the companies that figure it out first will pull ahead of everyone using the same models and getting worse results. ---
Why it matters: Right now most companies evaluating AI tools are asking "did it feel useful?" That's not an eval, that's a vibe check. The companies that build structured, repeatable ways to measure whether AI is actually doing the job will make better purchasing decisions, train better custom models, and catch failures before they become expensive. If your team doesn't have anyone who owns this, that's the gap worth filling before the next model cycle.
Swyx on SpaceX's quietly brilliant AI business model
Swyx floated a take worth sitting with: SpaceX has already recouped roughly half its investment in Cursor through compute deals alone, with the other half covered if xAI's Composer 3 performs well. His broader point is that no other player is simultaneously a leading model lab and a GPU cloud provider, and that combination is unusually capital-efficient if you've planned your GPU supply correctly for both the "training goes great" and "training doesn't go great" scenarios. ---
Why it matters: The standard framing is that AI infrastructure is expensive and risky. SpaceX appears to be running an arbitrage where funding developer tools pays for its own compute costs. If that math holds, it's a structural advantage that pure-play clouds and pure-play model labs don't have.
Quick hit: OpenAI's Thibault Sottiaux on Patch The Planet
Thibault Sottiaux, who works on Codex at OpenAI, celebrated the cybersecurity launch with a post linking to the full announcement. More signal than substance on its own, but the 1,386 likes suggest the security community is paying closer attention to this than most OpenAI releases. ---
Peter Yang posted a thread about the mechanics of viral ragebait content, specifically how clipping the most inflammatory moment of any conversation and spreading it out of context causes real harm even when the viewer knows it's manipulative. He mentions the Corgi founder clip as an example. The post is less about AI and more about platform dynamics, but Yang's point that "if even one parent watches this clip and decides to neglect their kids more, the damage is done" captures something honest about how information spreads at scale.