Fortune 500 insurer (innovation team) · 2026 UX & service design lead

The prototype that proved the brief wrong

The case in seven moves

Eight months in, and a decision no mockup could answer
Building an AI that had to ask the right questions
Seven sessions in, the structure was the problem
From interview pipeline to eight task-shaped tools
The executive stopped evaluating and started using it
What we got wrong about what caregivers needed
An interaction model the next phase inherited

caregivers tested across two structural versions

v2 model

adopted as production specification

Beta

greenlit and funded for fall 2026

Client: Fortune 500 insurer (innovation team)
Role: UX & service design lead
Team: Innovation team · internal CX · data science
Duration: Oct 2025 – Mar 2026 · prototype build & testing Jan–Mar 2026
Deliverables: Custom GPT prototype (2 versions) · questionnaire architecture · compassionate-language framework · 12-participant moderated study · redesigned interaction model
Outcome: Concept greenlit for funded beta; v2 interaction model adopted as production specification

An innovation team had spent eight months developing an AI caregiving service. The concept had legs. But a concept is not a service, and positive reactions to a concept are not signal from real users. This case is about what it took to generate that signal — and what happened when the signal was sharper than expected. The prototype exposed a design assumption so fundamental that fixing it required rebuilding the interaction model from scratch, mid-study, after seven sessions. The rebuilt version converted the executive holding the budget. The test of a prototype isn't whether it proves the concept works. Sometimes the test is whether it finds the thing that doesn't.

Chapter 01

Eight months in, and a decision no mockup could answer

The team had spent eight months developing a concept for an AI-powered caregiving benefit aimed at employed family caregivers — people working full time while managing the care of an aging parent. Foundational research existed. Archetypes had been built. Concept testing had shown positive reactions. But the team was approaching a real decision point: was there enough signal to justify the next step — building the actual technology, running a two-month beta, committing real resources to a product that hadn't yet been proven?

To answer that question, you need something closer to the actual experience. Static mockups can't tell you whether caregivers will trust an AI, engage with its tone, or find its questions useful rather than invasive. The prototype had to actually behave like the service. Only then could you learn what was actually wrong with it. I came in eight months after the foundational work was done. My scope: design and build the prototype, test it with real caregivers, synthesize what we learned.

image: service blueprint overview — blurred/redacted version showing the full arc from caregiver onboarding through forecast through ongoing guidance

The service blueprint — designed to run on a custom GPT, not a wireframe. The prototype had to behave like the real thing.

Chapter 02

Building an AI that had to ask the right questions in the right order

The first instinct was to go deep on medical conditions — building a knowledge base around common comorbidities and their projected care trajectories. A few iterations in, I stopped. The custom GPT was a predictive language model sitting on ChatGPT, not a statistical engine. Medical specificity at that level was technically wrong for the medium and unnecessary for the question. The forecast didn't need to be statistically precise; it needed to be close enough to be reactable. I cut the comorbidity layer and refocused on what mattered: what to ask, in what order, with what rules.

The vocabulary work proved as consequential as the architecture. Early testing with colleagues before the study surfaced two hard limits. Clinical language that named what someone was afraid of before they were ready stopped people cold. And the word "caregiver" itself was a barrier — many people managing a parent's finances and medical appointments didn't identify with the label yet. Both became hard-coded constraints. I organized the question set into tiers: priority questions always asked, secondary questions that unlocked based on answers, context questions that filled the forecast without adding to the emotional load.

The forecast didn't need to be statistically precise — it needed to be close enough to be reactable.

— Design principle for the prototype build, January 2026

Chapter 03

Seven sessions in, the structure was the problem

The original plan was twelve caregivers, one version of the prototype. Several sessions in, it was clear that wasn't going to work. The v1 system moved through six sequential phases: welcome, questionnaire, more questionnaire, forecast, guidance, return. By the time participants reached the forecast, they had given a great deal and received nothing. One participant said it before the forecast even appeared: "I just gave you all that information." The disappointment was already in the room.

A second problem was structural. The GPT presented a numbered list of screening questions. The conversation continued. A second numbered list appeared further down. A participant answered "1" — responding to the second list — but the GPT interpreted it against the first and surfaced a condition that didn't apply. The conversational format had a vulnerability that sequential design couldn't fix.

The last cohort 1 session pointed at something deeper. A participant's mother was in the final stage of dementia. The system kept moving through the questionnaire: what kind of help was her mother getting, who else was involved, what had changed recently. Each question took her further from what she needed in that moment. We aborted the session. v1 had no graceful path for a user already most of the way through a caregiving journey. The design had assumed a user looking forward. That assumption had to change.

image: questionnaire tiers diagram — three-tier architecture (priority / secondary / context) with branching logic. Or: v1 conversation flow showing the sequential pipeline and where it broke.

v1 architecture — six sequential phases, value deferred to the end. The structure that failed seven caregivers before being redesigned.

Chapter 04

From interview pipeline to eight task-shaped tools

v2 wasn't a re-sequenced v1. It was a different interaction model.

v1 asked. v2 helped.

The new version replaced the six-phase pipeline with eight task-shaped modes, each designed to deliver an actionable artifact within about ten minutes. The Dispatcher: triage what needs attention first. The Money Map: model what care will cost. The Translator: build a prep sheet for a hard conversation with a doctor. The Rehearsal Studio: practice that conversation before it happens. Each mode delivered value before asking for more input.

The forecast moved from the destination of a long sequential intake to a global action available at any point. A caregiver could complete two modes, type "forecast," and receive the same comprehensive output v1 had built toward — but only after the system had earned the right to ask for the additional inputs it needed. Five non-negotiable rules governed every mode, each operationalized from a specific v1 observation.

image: interaction model comparison — v1 (sequential pipeline, 6 phases, value at end) vs. v2 (modal tool, 8 task-shaped modes, value immediate)

v1 → v2: not re-sequenced but rebuilt. The change was at the model level, not the flow level.

Chapter 05

The executive stopped evaluating and started using it

The second cohort ran on v2. The difference was immediate. Participants leaned in. In almost every session, we had to prompt them to wrap up because they kept asking the AI more questions. Nobody asked what they were getting in exchange for sharing personal information. The value exchange resolved itself.

After the study, the executive who would decide whether the concept justified further investment asked to try the rebuilt prototype. He used his own family's caregiving experience from several years earlier. Within minutes he stopped evaluating and started using it — moving through modes, asking for a forecast, asking follow-up questions about doctors, about situations beyond memory loss. We had to tell him we needed to wrap up. A recruited participant engaging deeply is signal. The person controlling the budget engaging with it as a real user — unable to disengage after the scheduled session ended — is a different kind of evidence.

He also exposed the forecast's biggest remaining content problem: his family's situation had involved ambiguous early symptoms, and the forecast projected a five-year trajectory of progressive deterioration without acknowledging that ambiguity. "This would have freaked us out at the time." I flagged it as a content problem the next phase has to solve at the model level.

The reframe

What we got wrong about what caregivers needed

We built v1 assuming an AI caregiving service should work the way a thorough intake interview works: gather everything first, then deliver value at the end. The assumption was architecturally reasonable — it's how most onboarding flows are designed. What we didn't account for was who our user actually was. Caregivers are not patients about to receive care; they are people already managing a crisis, carrying a daily cognitive and logistical load, perpetually short on time and emotional reserves. An onboarding that asks them to give before it gives back doesn't just underperform — it replicates the dynamic they're already exhausted by. v2 fixed this at the model level, not the flow level. The difference matters: re-sequencing v1 would have produced a slightly less frustrating version of the same wrong thing.

What stays behind

An interaction model the next phase inherited

Three things came out of this work that weren't in the original brief. The interaction model from v2 — modal, artifact-first, value before intake — is now a design principle for the service going forward, not a one-time fix to a flow. The mode definitions, guidance content, and example outputs I produced are serving as the experience specification for the production AI system the client is building. And when the client suggested evaluating the next version internally rather than with real users, the argument for continued user testing had already been made: the next phase has user testing built into its plan.

The prototype is what made the beta defensible. A two-month beta with real users is a meaningful investment. It required proof that caregivers would actually engage, that the interaction model would hold, and that the service concept was more than positive reactions to an idea. The work produced that proof — and the proof came from building the real thing, discovering it was wrong, and rebuilding it before the study ended.