LiteRT-LM is Google AI Edge's inference runtime for Gemma 4 models on local devices. Already running inside Chrome, ChromeOS, and Pixel Watch, it arrives at I/O 2026 as a developer-available stack — with two new platforms: iOS (Swift API) and the open web (JavaScript API + WebGPU).
The numbers that matter
Running Gemma 4 E2B without Multi-Token Prediction: - Android (OpenCL GPU): **52 tokens/sec** decode, tested on Samsung S26 Ultra - iOS (Metal GPU): **56 tokens/sec**, tested on iPhone 17 Pro - Web (WebGPU, Chrome): **76 tokens/sec** on MacBook Pro 2024 with Apple M4 Max
Enable Multi-Token Prediction (MTP) — a speculative decoding architecture integrated into the pipeline — and throughput climbs up to **2.2x**, per benchmarks published on Samsung S25 Ultra.
How Multi-Token Prediction works
Classic LLM inference is memory-bandwidth bound: the processor spends most of its time moving parameters out of VRAM. LiteRT-LM sidesteps this by running both the main Gemma 4 model and the MTP drafter on the same hardware IP (e.g. GPU), so the shared KV cache stays in local memory. No cross-IP synchronization penalties, no redundant transfers.
Session management and agentic capabilities
LiteRT-LM supports native session save and restore (serialized KV cache), useful both for user continuity and for reducing compute: a resumed session skips the heavy prefill phase. The runtime also supports: - **Thinking Mode** (Gemma 4 native): internal reasoning scratchpad before the final output - **Constrained Decoding**: enforcing JSON schemas or output grammar, handy in agents - **Native function calling**: the runtime pauses execution, returns a structured tool-call to the app layer, and resumes when the result comes back
Lean memory footprint
Gemma 4 E2B (~2.58 GB on disk) runs with a physical footprint of just **607 MB on Apple mobile CPUs** thanks to XNNPACK's weight caching. Image and audio encoders are loaded dynamically only when a task requires them.
Why it matters for developers
The point isn't just speed: LiteRT-LM brings a complete production pattern — local inference + function calling + session continuity + cloud fallback — across all three relevant environments (Android, iOS, web) from a single library. For anyone building agents that need to work offline or with minimal latency, this meaningfully lowers the barrier to entry.
The code is open source on GitHub. The desktop CLI and the Google AI Edge Gallery mobile app are already available for testing.