At the Google I/O 2026 developer keynote, inside the conversation on the infrastructure behind the agentic era, Google brought back the news it had announced at Cloud Next a few days earlier: the eighth TPU generation is not one chip, it's two. A break with the entire previous history of the Tensor Processing Unit.
TPU 8t: built to train frontier models
TPU 8t is built to compress the frontier-model development cycle from months to weeks. It balances raw compute, shared memory and inter-chip bandwidth to push power efficiency to its best. A single pod scales to 9,600 chips, 121 FP4 ExaFlops and a stated 97 percent "goodput", with nearly three times the compute-per-pod of Ironwood (the previous generation). It's the machine on which Google will train Gemini 3.5 Pro and what comes after.
TPU 8i: built to serve agents
TPU 8i is a different beast: it's built for inference, and specifically for the kind of inference where agents reason across many steps, use tools and keep state. It carries 288 GB of high-bandwidth memory next to 384 MB of on-chip SRAM, three times the previous generation, so the working set of a reasoning model stays on silicon rather than moving in and out of memory. Google claims 80 percent better performance-per-dollar than Ironwood: roughly twice the customer volume served for the same spend.
Why it matters for developers
The architectural choice says something explicit. Google has stopped thinking "one chip for everything" and has acknowledged that training a model and serving an agent are two different problems: training is a bandwidth-and-raw-power exercise, agentic inference is a memory-and-latency exercise. For anyone building on top of the Gemini API or Antigravity, the practical outcome is that agents should get cheaper and faster through the year. For Google, it's the same logic Apple used when it separated CPU and Neural Engine: specialize the silicon once the workload stabilizes.