Gavin Guo (国振)

Logo

zguo0525 AT mit.edu
Google Scholar
Github
LinkedIn
Twitter

← Home

At WWDC 2024, Apple unveiled a flashier Siri, and the demo dazzled—yet the hard work started afterward: turning a staged moment into reliable, everyday behavior. As The Information reported, the problems were real. The simple lesson is that, in AI, teams that connect the product end to end and learn quickly outperform teams with bigger raw models but slower integration.

What I mean by a few terms, stated clearly:

Key Takeaways

Pricing note: ChatGPT normalized a $20 per month Plus plan and introduced a $200 per month Pro tier for heavier use (TechCrunch).

The Anatomy of an AI Challenge

Siri’s big demos got the buzz, but notification summaries delivered the deeper lesson. A feature meant to condense alerts exposed the hard problems that matter in practice: tone, context, and constraints.

Think about a family group chat full of inside jokes. A useful summary needs recent history, who said what, and the social tone. When signals stay on device and are not retained, the model sees only a short window. Layer on tight token budgets (limited input length) and strict latency budgets (limited time to respond), and truncation, flat tone, and missed nuance become common—even with a strong base model. Without weekly evaluations and updates, these errors persist.

Independent reviews backed this up: summaries were inconsistent—off on tone, weak with sarcasm, prone to context loss, and sometimes simply wrong (see Ars Technica). Apple paused some categories, then re‑introduced them with disclaimers and per‑app controls. The true constraint was context and iteration speed (how quickly teams can test and fix issues), not a missing billion parameters.

Internally, strategy swung between “Mini Mouse” (small, on‑device) and “Mighty Mouse” (large, cloud). Leaders later favored one big model—more cloud and more privacy tradeoffs—and delivery slowed.

The Human Element

The bigger blocker was organizational: software’s default was “ship,” while AI’s default was “explore.” Incentives, working rhythms, and ownership clashed. Federighi’s “Intelligent Systems” team trained models and shipped demos, sometimes bypassing Siri—fueling turf tensions. On Vision Pro, “Link” reduced scope when Siri could not meet the needed quality and speed. AI quality depends on a closed loop—data → evaluations → model tweaks → shipped behavior → new data—and split ownership breaks that loop. The result is predictable: hot demos, cold delivery.

What This Means for AI

The next wave will be won by integration rather than leaderboard scores. One path is to start from scratch (as ChatGPT did). Apple chose the harder path: upgrading Siri for a billion people without breakage. Software teams prototyped fast while AI teams moved more cautiously. Apple first barred third‑party models, then partnered with OpenAI so Siri could hand off requests.

Two implications:

What ChatGPT Got Right (and Why It Matters)

Here is ChatGPT’s playbook, in plain terms:

Why this matters for Siri: Apple optimizes for privacy, stability, and platform fit. OpenAI optimizes for speed, breadth, and model iteration. Both are valid. But if a feature depends on long‑tail context and social tone, you must either (a) collect the needed signals and iterate quickly, or (b) narrow the scope and tighten evaluations and on‑device limits. The middle—broad scope with heavy limits—tends to underdeliver.

A practical playbook for AI inside a mature product


These are my own views, not Apple’s. I’m grateful for my time in the AIML Residency and the chance to work with exceptional engineers.

Liked this? Follow on X: @Zhen4good. Collaborations/advising: zguo0525@mit.eduLinkedIn


Comments