Saturday, June 6, 2026

5 stories · 3 min read

The infrastructure for AI-powered work is getting serious fast. Between Anthropic hiring for model performance at scale, new evaluation frameworks that run for 100+ hours, and writing tools built specifically for agents, we're past the demo phase.

Anthropic is hiring a PM to make Claude Code actually work at scale

Cat Wu from Anthropic posted that they're looking for a product manager focused on Claude Code's model performance, specifically someone with experience writing "agentic evals" who can bridge research and core products. The role signals Anthropic is moving beyond basic coding assistance to systematic agent performance.

Why it matters: When the company behind Claude is hiring specifically for agent evaluations, they're preparing for enterprise customers who need AI that works reliably, not just impressively in demos.

Source →

Every launches Spiral 4.0 with AI agent integration

Dan Shipper's Every released a major update to their writing tool that extracts your brand voice using stylometry and works directly with coding agents like Cursor and Claude Code. The tool can now be called by agents through MCP and CLI to automatically generate landing pages, tweets, and marketing emails in your specific voice.

Why it matters: This is what the agent economy actually looks like. Instead of replacing writers, AI is becoming a writing partner that knows how you sound. Companies will spend more on voice-consistent AI writing than they ever spent on copywriters.

Source →

Cog ships 100-hour AI evaluations with financial guarantees

AI developer Swyx highlighted that Cog just launched enterprise AI evaluations that run up to 100 hours, compared to METR's 16-hour cap. The company is confident enough in these extended tests to offer financial guarantees on the results. The evaluations cover machine learning engineering, GPU kernels, and cybersecurity tasks.

Why it matters: If you're betting your business on an AI system, you need to know it won't break after running for days. These marathon evaluations are becoming the new standard for enterprise AI deployment.

Source →

Recursive self-improvement goes from theory to lab

According to Latent Space's latest newsletter, Sakana AI launched a dedicated RSI (Recursive Self-Improvement) Lab in Tokyo, building on their AI Scientist and Darwin Gödel Machine projects. The focus is on creating self-improving AI systems under compute constraints rather than requiring hyperscale resources. Meanwhile, Anthropic's Claude Mythos continues generating strong reactions for desktop workflows.

Why it matters: RSI isn't science fiction anymore. When companies are opening dedicated labs and talking about "1 or 2 hard problems" remaining before AGI, the timeline just compressed dramatically.

Source →

Josh Woodward shows off Gemini macOS integration

The developer shared a brief demo of Gemini features working within his macOS app, though details were limited to a video link.

Source →