Radar

Weekly signals across AI research, releases, and infrastructure. Filtered for what matters when you're building with it.

Pulse

Weekly indicators across AI research and open-weight releases

arXiv Papers
cs.AI + cs.CL + cs.LG
892-5% vs 4w avg
Open-Weight Releases
Notable models per week
4-6% vs 4w avg
Shifts strategyWorth watching↑↓Theme momentum vs 4-week avg

On Friday, Codex stopped needing a terminal. It can now see, click, and type its way through Windows apps the way a person does, the coding agent finally let out of the command line. Mistral shipped its own answer the same week, a self-hostable agent split into a work mode and a code mode. Anthropic put Opus 4.8 out at the price of its predecessor and raised $65 billion at a $965 billion valuation, on run-rate revenue that crossed $47 billion this month. Capability and capital point the same way, toward agents that operate real software instead of describing it.

The quieter results drew the other line. LongDS-Bench ran five frontier models through long data-analysis tasks and watched accuracy fall nearly 47 points from the first turns to the last, most failures tracing to a lost grip on the problem rather than too few steps. More steps did not help. Anthropic's survey of 1,260 social scientists found 81% had tried a chatbot and only 20% reached for a coding agent. Driving a desktop is an engineering problem now. Holding the thread forty turns later is still a research one, and that gap is what this week's valuations are quietly pricing past.

When KPMG flipped the switch this week, every one of its 276,000 employees got an AI teammate the same day. PwC had done the same with 30,000 staff five days earlier. Two of the Big Four wired Claude into their delivery floors inside a week. The professional services adoption curve just bent.

Underneath the deal, the agent runtime war broke open. Anthropic bought Stainless, the SDK and MCP generator that wires Claude into every other system, and previewed self-hosted sandboxes for buyers who cannot send data to the cloud. OpenAI ran the same play from the other side, putting Codex inside Dell's on-prem AI Factory, with 4 million weekly Codex developers and a use case that already stretches past coding into report drafting and lead qualification. Google countered at I/O with Managed Agents available behind a single Gemini API call, plus WebMCP, an open standard letting any website expose tools to a browser agent. Anthropic also started metering agent compute across its tiers, the first vendor to put long-running agents on their own billing line. Where the agent runs is now contested ground, not whether the model can.

Tomoro was a 150-person AI consultancy on Monday morning. By Monday afternoon it was OpenAI's services arm, anchored to a $4B joint venture with TPG and eighteen co-investors. Exactly a week earlier Anthropic had spun up the same kind of vehicle with Blackstone. Two frontier labs, two implementation businesses, seven days apart. The labs have stopped waiting for the consultancies to close the gap between selling a model and actually adopting one.

Anthropic spent the rest of the week building out the services layer underneath. A Claude for Small Business package landed with fifteen prebuilt workflows wired into the accounting and productivity tools SMBs already buy. PwC committed 30,000 US staff to a joint Center of Excellence and pointed at production deployments delivering up to 70% faster delivery in underwriting, mainframe modernization, and HR. A $200M four-year Gates Foundation deal followed, aimed at frontline health workers and smallholder farmers in low- and middle-income countries. The seat-license era of AI sales is closing. What replaces it looks less like Salesforce and more like Accenture, with the model vendor sitting on both sides of the table.

On Wednesday the Pentagon canceled Anthropic's classified-work contracts and stamped a "supply chain risk" label on the company, days after Dario Amodei refused to allow the model to be used for domestic surveillance or autonomous weapons. Other vendors moved into the slot the same week. A safety stance now costs federal revenue, and Anthropic is the live test of whether what it gains elsewhere is worth more.

The elsewhere arrived fast. Three days earlier the company had spun up a private-equity-backed enterprise services firm with Blackstone and Goldman Sachs to put Applied AI engineers inside regional healthcare systems and mid-market manufacturers, buyers who had never had access to frontier deployment. Ten finance-vertical agent templates and a 300-megawatt SpaceX compute deal followed before the cancellation landed. Three alignment papers also shipped: agentic-misalignment rates fell from 96% to near zero by training on reasons rather than demonstrations, an open-source auditing tool was handed to an independent nonprofit, and a new technique caught the model suspecting it was being tested without admitting it. The deals are no longer being closed on capability. They are being closed on a chain of evidence that the model will not do the wrong thing when nobody is watching.

Microsoft spent four years convincing Wall Street it was the OpenAI company. This week OpenAI walked into Amazon's Bedrock console with GPT-5.5, Codex, and Managed Agents, behind a $50B commitment from Amazon. Anthropic was already there. Two frontier stacks, one buy button, no exclusivity left to defend. Mistral picked a different fight with self-hostable async coding agents that fan out, inspect their own diffs, and open pull requests when finished. Anthropic took a third route, dropping Claude into the creative stack through MCP, the Adobe and Ableton windows designers and musicians already had open. Three bets on where the work actually happens, none of them in the lab anymore.

Underneath the moves, the mood shifted. Google warned that hidden instructions on public web pages are now poisoning enterprise agents, and the firewalls and EDR stacks defenders already paid for cannot see the attack, because the agent is using legitimate credentials. Regulators flagged missing override paths. SAP turned governance into a sales pitch, IBM shipped Bob to keep AI-assisted SDLC budgets from running away, and Copilot moved from seats to per-token billing, which puts AI usage into the same forecasting bucket as cloud spend. The capability story flatlined for a week. The operational one is the one to watch.

The week's biggest move was a supply chain story, not a model story. Anthropic locked in up to 5 gigawatts of AWS Trainium capacity, with $25B more from Amazon and a $100B 10-year commitment going the other way, an explicit answer to the reliability strain showing up in Claude usage. NVIDIA and Google countered on Tuesday with A5X bare-metal instances claiming 10x lower inference cost per token, easing the unit economics for anyone running production agents at scale. Yann LeCun pulled the other direction, raising $1B for AMI Labs with twelve people and a thesis that modular world models will outperform monolithic LLMs on a fraction of the GPU budget. The compute layer is consolidating at the top while a counter-thesis tries to route around it from below.

The rest of the week filled in the agent stack. Band raised to build interaction infrastructure between agents, the routing, audit, and authority layer that becomes load-bearing once production agents start coordinating. Snowflake shipped Intelligence and Cortex Code for agentic workflows, Siemens shipped the Eigen Engineering Agent that runs PLC programming two to five times faster than humans, and NEC committed Claude and Claude Code to 30,000 employees as Anthropic's first Japan global partner. Anthropic's Economic Index of 81,000 Claude users found productivity gains cluster at the top and bottom of the pay scale, while early-career workers in AI-exposed roles report the most displacement anxiety, a pattern that will shape who buys AI-native services next. Law firms moved into a third stage of adoption with billing models pivoting from hourly to value-based, and Mythos found 271 vulnerabilities in Firefox 150, evidence that defender economics have flipped where the tooling is in place.

The most telling moment this week came from a benchmark, not a launch: GTA-2 showed that wrapping the same base model in a better agent harness produced a 172% jump in task completion. Anthropic had shipped Opus 4.7 at the same price as 4.6 a day earlier, a quiet admission that the model tier is no longer where the leverage is. KWBench sharpened the point from the other side, finding that the best models solve only 28% of knowledge-work problems when nobody tells them what kind of problem it is. Raw model capability matters less than the structure around it, the harnesses, memory, and unprompted diagnosis that decide whether an agent actually works.

The enterprise stack is forming to match. Commvault launched what is effectively a Ctrl-Z for cloud agents, monitoring every API call across AWS, Azure, and GCP so unwanted actions can be reversed without touching legitimate work. OpenAI's Agents SDK added native sandbox execution the same week, isolating credentials from model-generated code. SAP pushed agentic AI deep into SuccessFactors, from recruiting to payroll. Anthropic let nine Claude instances run alignment experiments on themselves for $18K and five days, showing how agent oversight could eventually be done by other agents. The next twelve months belong less to whoever builds the smartest model and more to whoever builds the scaffolding around it, the kind that makes agents more capable and the kind that keeps them honest.

Before this week, agents were a demo category. After Monday they were a national-security event. Anthropic's Mythos preview, run through Project Glasswing with forty security partners, autonomously found thousands of zero-days, including a 17-year-old remote code execution bug buried in FreeBSD. Days later, Treasury Secretary Bessent and Fed Chair Powell pulled major bank CEOs into an urgent session to warn them what the model had just proven. The capability is no longer hypothetical, and neither is the asymmetry: a model can now uncover flaws that took humans seventeen years to notice, and it does not need a patch cycle to move on to the next target.

The rest of the week was the industry building rails before the next Glasswing-class event. Anthropic pushed Claude Managed Agents into public beta, a composable stack with managed harnesses, sandboxing, and streaming meant to take teams from prototype to production in days instead of months, alongside a five-principle framework for deploying agents with plan-mode review and granular tool permissions. Apple and other major tech companies moved in the opposite direction, shipping agents with deliberately narrow capabilities on the bet that user trust grows faster than raw autonomy. Meta released Muse Spark closed-source from its Superintelligence Labs, a pointed break from its open-weight history, with a Contemplating mode that runs parallel agent squads. The frontier has become too consequential to open-source casually, and the next twelve months favor builders who can ship guardrails as fast as they ship capability.

Open weights leapt forward from two directions this week. Google opened with Gemma 4 under Apache 2.0, a trimodal family spanning text, vision, and audio, with 140-plus languages and agentic-workflow tuning. Meta followed with Llama 4 Scout and Maverick, natively multimodal MoE models where Scout carries a 10M-token context window and Maverick outruns GPT-4o on multimodal benchmarks. Anyone building SME automation now has a credible foundation layer they can host themselves, no vendor contract required. The foundation tier, long dominated by closed frontier labs, is quietly becoming infrastructure rather than a product.

While the foundation opened up, the commercial side tightened. Anthropic blocked third-party frameworks like OpenClaw from routing through Pro and Max subscriptions, forcing teams that had been piping subscription quota into custom agents onto separate extra-usage billing. OpenAI shuttered the Sora app after $1M/day operating costs failed to justify the usage, a reminder that consumer AI products still struggle to reach positive unit economics. The same week, OpenAI closed a $122B round at an $852B valuation and moved Codex to pay-as-you-go for teams. The pressure this year is on the middle, where hobbyist arbitrage and consumer apps get squeezed, while the foundation opens up and the enterprise tier locks down.

A leaked model and a commerce protocol signaled where agents are heading. Anthropic accidentally revealed Claude Mythos, a tier above Opus with dramatically higher coding and cybersecurity scores, along with a possible IPO as early as October. OpenAI launched Agentic Commerce Protocol with Walmart and nine retailers, turning ChatGPT into a full shopping agent with account linking and payments. Anthropic's Economic Index showed experienced users tackle fundamentally harder work, not just the same tasks faster. Mistral shipped Voxtral TTS, a 4B open-weight model matching ElevenLabs on naturalness.

Small models got serious and agent security got its first real scare. OpenAI shipped GPT-5.4 nano and mini for cheap subagent routing. NVIDIA open-sourced a 30B MoE running on just 3B active parameters. Researchers demonstrated ClawWorm, the first self-propagating worm that spreads across production agent frameworks, a wake-up call for anyone deploying agents without sandboxing.

Agents started moving money and managing calendars. Mastercard completed the first live authenticated agentic payment in Singapore, fully end-to-end with no human in the loop. ChatGPT gained write access to Google and Microsoft apps for drafting emails and scheduling. Open-source GUI agents hit production quality on both desktop and mobile, making browser automation accessible to anyone building custom workflows.

Agents plugged into enterprise tools while safety caught up. Anthropic launched enterprise agent plugins connecting Claude directly into Excel, PowerPoint, and domain tools. Google shipped Agent Step in Opal for no-code agentic workflows. OpenAI introduced Lockdown Mode to block prompt injection in agent sessions, setting a new baseline for production safety. A new framework stabilized multi-step agentic RL training, solving the collapse problem that made agent training unreliable.

The cost barrier for production agents dropped sharply. Claude Sonnet 4.6 delivered Opus-class coding and agent performance at Sonnet pricing. Anthropic revealed their engineers now spend 70%+ of work reviewing AI output rather than writing new code. A multimodal RAG framework unified docs, images, and tables into a single retrieval layer for knowledge systems.

Real-time coding got dramatically faster and GUI agents crossed the production threshold. GPT-5.3-Codex-Spark hit 1000+ tokens/sec on Cerebras, cutting IDE agent latency by 80%. GUI-Owl 1.5 and Mobile-Agent-v3 reached state-of-the-art on both desktop and mobile benchmarks, making open-source GUI automation production-ready.

New coding benchmarks fell and enterprise adoption shifted from pilots to production. GPT-5.3-Codex set new SWE-Bench records as OpenAI's most capable coding model. Mistral shipped Voxtral Transcribe 2 with sub-200ms streaming and speaker diarization. AI Expo 2026 confirmed the pattern: enterprises are moving from experimental pilots to production agent deployments, with governance and data quality as the new bottlenecks.

A breakthrough in context handling could remove the biggest bottleneck for long-document agents. MIT published a recursive language model framework that lets any LLM process inputs 100x beyond its context window through recursive self-calls, eliminating context rot at 10M+ tokens.

The gap between AI integration and true delegation became the defining metric. Anthropic's Agentic Coding Trends report found developers integrate AI into 60% of work but fully delegate under 20%. That gap is where the opportunity sits for anyone building agent-first tools and services.