I recently conducted a document summarization study. One participant said, “This tool reads my mind.” Another said, “It missed the point entirely.”
Were they wrong? No. They were both right, because each participant was experiencing a different version of the product.
By design. No bugs. No A/B tests. No experimental flags.
Just a learning, adaptive, personalized system doing exactly what it was built to do, and masking consistent patterns along the way.
Welcome to the world of AI: enormous, dynamic, complex, and somehow always in beta.
Why AI research is uniquely messy
As researchers, we thrive on patterns. We design for consistency. We analyze for common themes.
Our old methods have worked because they’re based on deterministic technologies: you can expect a consistent output each time. These technologies are like a recipe: If you follow the exact same steps with the same ingredients, you’ll get the same result every time.
AI tools, on the other hand, are probabilistic. Every output is different as the experience is tailored based on a number of variables (e.g., technological shifts, system adaptation, who is asking, how they’re asking, what happened in previous sessions, safety guardrails, location, etc.) These technologies are like rolling dice: Even if you roll the same die in the same way, you can’t predict the exact number that will come up. There’s a range of possible outcomes, and each roll is independent of the previous ones.
While this is delightful in theory, it presents a nightmare for research.
Four factors behind AI tools combine to create a uniquely messy challenge for researchers:
- Algorithmic personalization: Each user’s experience is often shaped by their past interactions, but not always. Generally, the more they use it, the more it morphs to their style, preferences, and needs. (But not always! For instance, new users interacting with a system for the first time may not have sufficient interaction history to influence personalization. Additionally, some AI systems prioritize real-time context over historical data, especially in scenarios requiring immediate responsiveness.)
- Learning systems: Most AI systems evolve rapidly, with their behavior potentially changing from day to day due to fine-tuning via direct interaction (e.g., feedback thumbs, revised prompts), model enhancements, retraining, or updates to underlying algorithms and training data. A feature might behave one way on Monday and another by Thursday. (But not always! Not all AI systems are designed for continuous learning. Some are deployed in static states, particularly in applications where consistency and predictability are paramount, such as in certain HR, regulatory, or safety-critical environments.)
- Context sensitivity: LLMs don’t just look at what’s written. They infer intent, tone, and ideally, cultural norms. But here’s the catch: what’s “normal” varies dramatically depending on local formats, honorifics, dialects, idioms, historical references, geography, and other factors. The same prompt in Korean vs. Japanese vs. Modern Standard Arabic vs. dialect Arabic may trigger very different interpretations, formality levels, and safety guardrails. Ask for a “simple explanation” and the system might return a metaphor in one language, a technical summary in another, and a shrug in a third. Same user intent. Different model behavior. Different results.
- Interface and model updates: As the inner workings of an AI system changes, so does the exterior. Unlike fixed UI systems, many AI tools serve different UI/UX tweaks, A/B tests or feature rollouts, and capabilities to different users—often without any notification. Because of this, some users see beta features. Some do not. Some see new buttons or toggles. Some do not. Some are in dark mode with autocomplete. Others use a simplified mobile UI with no formatting. All of these influence perceived quality, trust, and ease of use.
3 changes to get better results when researching in AI environments
Whether we’re conducting usability testing, customer interviews, diary studies, or longitudinal tracking, we’re no longer observing humans interacting with a single, shared product. We’re observing dynamic human-AI interactions, shaped by both participants and the system itself.
This doesn’t introduce just a technical wrinkle; it fundamentally changes how we should be collecting data, asking questions, and interpreting it.
My mental model is that AI is fluid. It adapts to its container: the user, the prompt, the context. Our job as researchers isn’t to control the experience; it’s to understand it, characterize it, and create frameworks that hold water (pun intended) even as the product changes shape.
So what can we do about it? I’ve found these approaches to be helpful to my PMs and in my own AI research:
1. Document everything
Experience mapping, in this context, means capturing the user’s journey, use cases, and the system’s behavior. It requires documenting not just what users do, but how the AI adapts in response and what the user is trying to accomplish.
One way to think of it is that the AI is a participant, too. Yes. Read that again. Because AI systems are constantly learning, adapting, and occasionally acting in unpredictable ways (like a human would), you need to capture as many as variables as you can about it—just like you would a human.
Because of this, contextual metadata—like prompt history, time of day, language settings, device type, model version, system latency—becomes the backbone of research reliability.
If one user ran into a hallucination at 10pm using Model A in Portuguese on mobile and another didn’t, that context matters. These aren't edge cases. They are central variables.
2. Shift from clean-room methods to more forensic ones
The clean-room model of “one task, one environment” no longer applies. AI systems adapt, users adapt to them, and researchers need to adapt, too—by shifting from control to curiosity, and from observation to interpretation.
In Blindspot, a Netflix series I’m currently devouring, every tattoo held meaning if you knew how to decode it. Today’s behavioral data is no different. What participants say or do on the surface is just one part of the picture. The real insight often comes from interpreting the patterns beneath.
In AI-driven research, context is a moving target—and much of the meaningful data isn’t surface-level quotes or behaviors, but rather ambient, implicit, or buried in interaction patterns.
Here are a few things researchers can do to start thinking more like detectives and decode these signals:
- Asking participants about their expectations: What did they think the AI would do? What was their end goal? Was it eventually achieved? If so, what were the steps required? This helps identify mismatches between mental models and system behavior.
- Tracking prompt evolution: People revise their inputs. Understanding how and why they change prompts offers insight into usability and comprehension. What was the language, format, and density of the prompt and the response? Was it in paragraphs, bullet points, or prose? Was the tone too professional or snarky? Too brief or too detailed? Did the model misinterpret an idiom, formality, an acronym, or honorifics? Examine session logs for edits, re-prompts, or backtracks.
- Analyzing failure tolerance: What do people do when the model is wrong? Retry, abandon, go manual? Analyze timing, tooltips, UI paths, and how users adjust inputs to “get it right.” These responses reveal trust thresholds and tool resilience.
3. Cluster experiences instead of comparing individuals
While all the details of the individual experiences matter, it’d be meaningless to judge any individual output, because LLM evaluations work by judging outputs at a vast scale. Because of this, you need to look at a range and rank the distribution.
In practice, this looks like comparing groups who experienced similar model behavior (e.g., Cluster A’s system state vs Cluster B’s) rather than analyzing participants individually (e.g., P1 vs P2).
This means five or seven participants per segment likely won't cut it any longer. A larger sample size allows for pattern-matching across personalized variations and better detection of emerging behaviors.
Embrace the weirdness
Taken together, these shifts aren’t just methodological; they’re mindset changes. We need to recalibrate how we think about consistency, reliability, and even “truth” in our findings.
AI systems today are like that one friend who insists they’re “just being helpful” but somehow always make things weird. You ask for a simple summary, and suddenly you’re deep into a TED Talk about Greek mythology.
Helpful? Maybe. Consistent? Efficient? Not exactly.
As researchers, we need to embrace this weirdness. Because inside the mess is a new frontier of insight. And if we do this right, we won’t just be keeping up with AI; we’ll help shape how it meets real human needs.
And if you find yourself asking, “Is it me, or is the AI different this time?” the answer is probably yes, to both.


















