Designing Embedded AI Experiences Inside ChatGPT and Claude

What I learned leading design for TurboTax's embedded AI experiences during one of the first large-scale launches inside major AI platforms

Case study

Last October, I started leading design work for a new category of TurboTax experience: embedded AI applications running directly inside ChatGPT and Claude. The work began as a relatively small MVP ahead of tax season, with the first release launching in December. We followed with a larger V2 release in January, and then a major expansion in April alongside the Claude Connector launch.

Over the course of those months, the product evolved from a lightweight embedded experience into something much deeper: a connected tax preparation workflow that allowed users to begin preparing their taxes almost entirely from inside AI platforms they were already using daily.

Users could:

Connect accounts
Go through personalized conversational tax interviews
Upload and extract tax documents
Generate dynamic filing checklists
Synchronize data directly back into the core TurboTax experience

It was one of the first times TurboTax had operated natively inside major consumer AI ecosystems at this level of depth, and the work ultimately went on to win a Webby Award. More importantly, though, it gave our team an unusually early look into what designing embedded AI applications inside systems like ChatGPT and Claude actually feels like in practice.

One product, many environments

One of the most interesting aspects of the work was that we were not designing a traditional standalone product. We were designing an embedded application that had to exist coherently across multiple AI ecosystems simultaneously. Using the UI kits and platform guidance provided by OpenAI and Anthropic, we built what was effectively a centralized MCP-powered application layer for TurboTax. The core experience remained structurally similar regardless of where it launched, but through tokens, platform-specific theming, and custom component layers, the application could dynamically adapt itself to feel native inside ChatGPT or native inside Claude while still operating from the same underlying system architecture.

In practice, it was one product with multiple deployments, multiple orchestration environments, and multiple interaction models depending on where the user entered the experience. The application also had to support bidirectional data synchronization between systems. Users were authorizing secure connections between TurboTax and the AI platforms themselves, allowing conversational workflows to persist information, extract documents, generate preparation states, and synchronize that information back into the primary TurboTax product.

TurboTax connector listing in ChatGPT — The TurboTax connector in the ChatGPT app directory — users could connect directly before entering the embedded experience.

Building at the speed of the platforms

The pace of the work was unlike anything I had shipped before. A first release was in market within weeks of kicking off, followed by two major expansions before the end of tax season. Moving that fast required rethinking how design and engineering operated together — and it meant the team had to adopt the same kind of agentic workflow we were building products on top of.

Rather than treating design and engineering as sequential handoffs, we ran them in parallel. The loop was tight: gather sources and context, prompt and iterate to generate and refine, build and publish to live environments, then share and align before cycling again. Every pass through that loop produced something shippable.

The agentic build loop: Gather, Prompt / iterate, Build / publish, Share / align — The loop, summarized — Gather, Prompt / iterate, Build / publish, Share / align. The same agentic model we were designing for became the way we worked.

Figma became more than a design tool — it became the MCP entry point. By connecting the component library directly via Figma MCP, live components and tokens became queryable context. No exporting, no redlining, no translation layer between what was designed and what could be generated. The library was both the source material and the generation scaffold.

I also contributed front-end components directly to the build. Rather than handing off specs and waiting, I worked alongside the engineering team in code — writing and shipping real widgets for the Claude MCP app, not concepts for someone else to implement. The line between design and engineering stopped being meaningful. What mattered was that the right artifact shipped.

TurboTax patterns connected to Claude via Figma MCP — the design library as source material and generation scaffold — TT Patterns → Claude via Figma MCP. Live components and tokens as queryable context — no exporting needed. The library is both the source material and the MCP entry point for generation.

One structural decision that made this possible was building a single Figma component that combined both the ChatGPT and Claude platform patterns simultaneously. Rather than maintaining two separate design files that would inevitably drift apart, I unified the UI kit into one component set with platform-aware theming baked in — so any change propagated to both deployments at once, and both platforms could be designed for and reviewed side by side.

Figma component combining ChatGPT and Claude platform patterns — a single component set that adapts to both platforms simultaneously — A single Figma component set covering both ChatGPT and Claude patterns — one change, both platforms updated at once.

When the platform owns the orchestrator

At a high level, that sounds relatively straightforward. In reality, it introduced an entirely new category of product and UX challenges that I do not think the industry fully appreciates yet. Once you begin designing applications inside systems like ChatGPT and Claude, you are no longer fully designing the orchestration layer yourself. OpenAI and Anthropic own the orchestrator, and that changes almost everything about how product design behaves.

Traditional software assumes a relatively controlled environment. Product teams usually own the interface, the navigation model, the interaction sequencing, and most of the surrounding system behavior. Inside AI ecosystems, many of those assumptions no longer hold. The conversation itself becomes the navigation layer. The AI platform partially controls discovery, invocation, memory, rendering behavior, and context retention. Your application stops behaving like a standalone destination and instead becomes one capability among many inside a much larger intelligence environment.

That fundamentally changes the nature of the design problem. Users fluidly move between the native model, embedded applications, uploaded documents, conversational context, generated outputs, and external systems without necessarily perceiving hard boundaries between them. As a result, the work starts looking less like traditional screen design and more like designing orchestration systems. You are shaping continuity, conversational state transitions, interoperability, dynamic interfaces, and trust boundaries that exist across ecosystems you do not fully control.

One of the largest conceptual shifts for me personally was realizing how much the orchestrator itself becomes part of the user experience. In traditional software, if you carefully design a workflow, users generally experience that workflow consistently. Inside AI ecosystems, orchestration itself becomes probabilistic. The same user intent may surface differently depending on conversational history, memory state, model interpretation, invocation timing, or competing tools inside the ecosystem. Product teams are no longer designing fully deterministic flows. They are designing adaptive systems that cooperate with another intelligence layer operating above them.

Designing inside these ecosystems increasingly feels less like designing software and more like designing protocols between multiple layers of intelligence.

This creates a very unusual dynamic because parts of the experience become emergent rather than explicitly authored. The platform determines how apps are surfaced, how tools are called, how memory behaves, how transitions occur between systems, and how much conversational continuity exists from one interaction to the next.

Concrete platform constraints

There are also surprisingly concrete UX limitations that emerge from this model:

In ChatGPT, product teams do not fully control the canvas behavior if a user opens it. If there is critical information you always want visible up front, there may not be a deterministic way to guarantee its visibility — the platform ultimately controls how the canvas is rendered and expanded.
In Claude, the now-familiar “human in the loop” approval cards are similarly orchestrator-controlled. As a product team, you do not fully own how or when those interaction patterns appear.

There are moments where you cannot deterministically control how users answer certain questions or progress through sensitive workflows, even when those workflows are deeply connected to your application logic. Those constraints fundamentally change how you think about product design. A large part of the work becomes designing around platform restrictions, orchestration constraints, and interaction systems you do not entirely own. Instead of fully controlling experiences, you are often designing resilient systems that can adapt to different orchestration behaviors while still maintaining continuity and trust.

TurboTax document upload interface embedded inside Claude — Document upload embedded inside Claude — drag-and-drop, document recognition, and live progress within a single artifact the platform controls.

Conversation didn't replace interfaces

One of the biggest assumptions I changed my mind about during this work was the idea that conversational interfaces alone would simply replace traditional interfaces outright. Conversation is incredibly effective for onboarding, ambiguity reduction, contextual intake, organization, and guidance. Users naturally prefer conversational interaction when they are uncertain, unfamiliar with a workflow, or trying to navigate complexity. But once workflows become denser, more stateful, more document-heavy, or more verification-oriented, users begin demanding structure again.

That does not mean conversation failed. It simply means human cognition still benefits from:

Visibility
Side-by-side review
Structured confirmation
Previews
Persistent state
Auditability

Dynamic filing checklist showing empty and populated states side by side — Dynamic filing checklist — empty state alongside a populated state. Users wanted structured verification, not just conversation.

One of the strongest patterns we observed was that users loved conversational intake but still wanted highly structured verification before committing actions — especially in workflows involving money, legal implications, identity, or irreversible outcomes. The future likely is not “everything becomes chat.” It feels much more likely that conversational orchestration will coexist with interfaces that dynamically materialize around the conversation itself depending on context and intent.

Continuity is the new expectation

Another thing that became immediately obvious during research was how quickly embedded AI systems change user expectations around continuity. As soon as the system begins remembering context, organizing information, understanding intent, and reducing friction, users start expecting that continuity everywhere. The moment the system loses context or breaks continuity, the experience suddenly feels fragmented — not necessarily because the technology is broken, but because the user's mental model has already shifted from “I'm using tools” to “I'm operating inside an intelligent environment.”

That transition happens remarkably fast. One of the biggest UX challenges we encountered was not whether users liked the AI interactions themselves. Most did. The challenge was what happened when continuity stopped. Users expected the intelligence layer to persist across onboarding, uploads, interviews, transitions, filing workflows, and product boundaries. From a technical perspective, those boundaries are understandable. From a user perspective, they increasingly feel artificial because the intelligence layer creates an expectation of seamlessness that traditional systems were never designed to support.

Trust in probabilistic systems

Traditional software is deterministic. Embedded AI systems are probabilistic. Users are constantly trying to understand:

What the AI knows
What it inferred
What was verified
Which system is authoritative
How confident the output is
Who is accountable if something goes wrong

Those questions become especially important in workflows involving finance, healthcare, legal systems, or identity. The challenge becomes less about designing interactions and more about designing confidence calibration. Users need to understand uncertainty, provenance, verification, accountability, and the boundaries between systems that may all appear unified from the outside.

Why open-ended prompting often falls short

One of the clearest lessons from the work was how much the industry currently overestimates open-ended prompting. Blank prompt boxes assume users already possess:

Vocabulary
Confidence
Process understanding
Domain knowledge
Awareness of what the system is capable of doing

Many users do not. Especially in high-complexity workflows, what consistently performed better was structured guidance, contextual next steps, conversational scaffolding, progressive disclosure, intelligent suggestions, and dynamically generated interfaces around user intent.

Good AI UX often reduces the amount of prompting required rather than increasing it.

A new design category

After spending the last year working in this space, I genuinely believe embedded AI application design is becoming its own category. It sits somewhere between systems design, conversational UX, orchestration design, platform design, and traditional product design, but it is not fully any one of them. The work increasingly involves orchestrating intelligence, managing probabilistic systems, designing continuity, balancing automation with oversight, shaping trust boundaries, and coordinating dynamic interfaces across ecosystems you do not fully control.

Full design delivery — component specifications and screen flows across both AI platforms — Design delivery overview — component specifications and screen flows across ChatGPT and Claude. One system, two platform deployments.

What makes this moment particularly exciting is that very few established patterns exist yet. Most teams are still discovering these interaction models in real time. It feels far less like optimizing mature UX conventions and much more like helping define an entirely new computing paradigm while it is still forming.