Generative AI is being applied to almost every field of business, so it’s no surprise that people are trying to use it in experience research. Most of the experiments have been around creating questionnaires and summarizing data, but a startup called Synthetic Users has gone further, completely replacing the needs discovery process with synthetic customer interviews created by AI. The company uses AI to both create the questions and create the customer responses to them, as if you had interviewed real people.
The idea is appealing because discovery interviews are painful and tedious for most companies: the traditional approach to discovery takes a lot of time and money to recruit and schedule interviews, conduct them, and then organize the results. Often companies tell me the process is so painful that they skimp on or even skip the discovery process. So the idea of dramatically compressing discovery is intriguing.
Because there are no actual human beings involved in the Synthetic Users process, it is faster and cheaper than other forms of discovery research. The benefits claimed by Synthetic Users include:
A growing number of studies have tested the ability of generative AI to simulate the results of opinion surveys and behavioral tests (examples here and here). But there hasn’t been as much work on AI in qualitative research. Most of the articles I could find focused on anecdotes and opinion (examples here and here), plus one sharp-edged comparison by a Berkeley professor (here). I decided to do a real-world test to gather some data. I ran the same study on Synthetic Users and on UserTesting, to compare synthesized interviews with real ones. (I also tried asking the same questions directly to ChatGPT; you’ll see those results at the end of the article.)
Full disclosure: I’m a UserTesting employee and can’t claim to be completely unbiased, but I did my best to be objective in this comparison. I also saved the raw output from all three platforms, so you can make your own comparisons. You can find links to the raw files at the end of this report. I’m very interested in your thoughts and followup questions. Feel free to dig into the data on your own.
In the beta version of the Synthetic Users system, you enter a few lines of information about your target customer and product idea, and within minutes the AI makes up the users and synthesizes interviews with them. The “interviews” read as if you had done very detailed discovery interviews on the product idea. You receive a transcript of the interviews, demographics and psychographics on the synthesized users, and a summary. You can share this information in the same ways as a real interview.
UserTesting is a research platform that enables you to give questions and tasks to real people who are members of an online panel. Using their computers or smartphones, they record themselves doing tasks and answering your questions. You receive a transcript and video of their responses, plus an AI-generated summary of the results. Depending on how you configure the test, the process can take a few hours to a couple of days.
I wanted to test the AI’s ability to extrapolate, by studying a product idea that’s not currently on the market, but could be easily understood and would likely generate discussion. The product idea I chose was a rideshare service that uses flying cars.
I asked for six interviews with people who have to commute to work at least three days a week, have a commute longer than an hour each way, and have at least two school-age kids at home. (Those requirements turned out to be a challenge on UserTesting, which I’ll describe below.)
Here’s how the test creation process worked in Synthetic Users. First the system asked me to describe the target customer:
Here’s the text I entered: “You are a parent living in the United States with two school-age children living at home. You have a full-time job. In an average week, you have to commute to work three or more days a week. The commute takes more than an hour in each direction by your current form of transit.”
Synthetic Users then asked me to specify the pains and needs of those customers. This confused me a little, because one of the purposes of discovery interviews is to discover the problems of users rather than assume them. This turned out to be an important difference between Synthetic Users and real interviews, although I didn’t realize it at the time. I followed instructions and added four customer pains.
Here’s the text I used:
Finally, I was asked to input my product idea:
Here’s what I wrote: A rideshare service in which an aircraft lands at your house, picks you up, and takes you to your office. The rids takes no more than 20 minutes door to door. There is no pilot in the aircraft; it is controlled by an artificial intelligence. You can return home the same way.
(Yeah, there is a misspelling in there. I’m embarrassed, but it didn’t seem to confuse the AI, which in itself is interesting.)
After that, the “test” was ready to go.
As much as I could, I duplicated this process in the UserTesting study. The customer description was turned into screener questions, and the product description was included in the test plan. I asked a series of questions that matched as much as possible the topics covered by the Synthetic Users reports. And I turned the customer problems into rating scale questions so I could measure how common they are. You can see my full test plan here.
As advertised, Synthetic Users was extremely fast and convenient, and the interview “transcripts” were remarkably fluent and easy to read. But the more I dug into the details, the less comfortable I was with the synthesized interviews. I didn’t get the surprises and quirkiness you get in real-world interviews, the system made mistakes, and the interviews made up by Synthetic Users sounded slick and repetitive to me. Here are the specifics:
Speed: Synthetic Users is faster. It took less than half an hour to create my test by filling in the blanks in a web form (most of the time was thinking time to be sure I got the prompts right). Once I finished entering my data, the six synthesized interviews were generated and summarized in about three minutes. The whole process took less than an hour.
In UserTesting, the process took much of the day. I chose to do self-interviews (in which you write out the questions in advance and have people read and respond to them) because that’s faster than live interviews. To run the test, I needed to write test plan questions and screener questions (a couple of hours) wait for response to arrive (a couple of hours, although I could do other work while I waited) and then analyze the results (potentially multiple hours, but that was helped by a new AI-driven summarization feature in the UT platform).
I ran into one problem with the UserTesting study, though: at first the tests were very slow to fill. I checked the responses to my screener, and found that the question about length of commute was eliminating almost all participants. Apparently there aren’t a lot of people who commute to work three or more times a week, have two school age kids, and have a commute over an hour each way. I relaxed the commute time to half an hour each way, and the test filled quickly.
This raises an interesting issue about Synthetic Users. In their system, you enter the exact demographics and problems you want people to have, and the system faithfully creates those people – even if they don’t actually exist in the real world. To verify that, I ran a separate test with Synthetic Users that asked for participants who had walked on the Moon during the Apollo landings, but were now in their 40s and working in people management roles. Synthetic Users cheerfully returned six people with those characteristics. Three of them were women (the Apollo astronauts were all men), and all of them were in their 40s – even though no one currently in their 40s had even been born yet at the time of the last Apollo mission in 1972.
This means that you can’t currently count on Synthetic Users to test the reality of the customers you’re asking for. You’ll need an external source of truth for your customer definition, one that you know is correct in all of its specifics.
In other words, you need to do research before you can use Synthetic Users to synthesize research. This is especially important for the customer problems that you enter into the Synthetic Users system – Synthetic Users is not testing those problems for you, it’s assuming they exist, and using them to synthesize the users.
Quality of transcript: Synthetic users was perfect. The transcripts from the synthetic interviews were beautifully formatted, in perfect English, very well expressed, and had no grammatical errors or mis-transcribed words. They read like professional first-person essays. You could easily share them with your company without hesitation. The UT transcripts read the way people actually speak, with many repeated words, half-completed sentences, “ums,” and the occasional mis-transcribed word. They are not as easy to read, and you’d probably want to do some cleanup before sharing them.
Which version you prefer may depend on your needs: Do you need something that reflects the speech of real people, or do you need something that’s easy to read?
Quality of video: Synthetic Users did not participate. UserTesting delivers videos of the people taking the test. In this case, I turned on face recording so you could see the participants and judge their expressions. Since there aren’t any real people being interviewed by Synthetic Users, there is no video to watch, or video clips to share.
In the absence of videos, it’s hard to get the sort of empathy and intuitive understanding of customers that you get from real interviews. To get a feel for what that’s like, here’s a short highlight reel from my test, in which people discuss the problems with their current commutes:
Quality of responses: Synthetic Users feels repetitive and sometimes artificial. Synthetic Users gave me what looked like six different interviews, but the “users” all sounded very similar to one another. Many of them raised the same issues with almost the same phrasing. For example, here are quotes from five different Synthetic Users interviews about overwork and lack of sleep:
Synthetic Users excerpts on overwork and sleep
|
It is very unusual to get that sort of unanimity and repeated phrases in a test of real users. If you found this sort of repetition in real interviews, it would indicate one of those rare situations where there was a deep consensus across society.
In contrast, the UserTesting responses were more diverse in both content and wording. They were also less grammatically correct. Here are the raw, uncorrected transcripts from when I asked real people if they get enough sleep at night:
UserTesting interview excerpts on sleep
|
Synthetic Users also made up some bizarre characteristics for its participants. Each of my six “participants” used a different form of transit: carpool, car, bicycle, bus, train, and subway. Who rides a bike for more than an hour each way to and from work? I’m sure someone in America does, but you wouldn’t find them in a random sample of six people. The UserTesting participants were more like typical people: five of them drive, one uses a mix of car and public transit.
Quality of insights: a big difference. To me, the responses from Synthetic Users seemed extremely reasonable, but also unrealistically self-aware. I thought they read a bit like the recommendations you’d get from a self-help book. For an example, see the quotes above about creating a bedtime routine and sleep schedule. Those sound like great things to do, but in the real world people are rarely so systematic and well organized.
Why do we do discovery research in the first place? When I worked at Apple they taught us not to focus on the things customers ask for, but new ideas that could delight them if we made them possible. Discovery research enables that ideation, by helping us develop intuition about customers’ thinking and reactions. The quirkiness, diversity of answers, and variety in phrasing that you get from human interviews feed your intuition. In contrast, the synthetic participants feel averaged out, like a stone sculpture that’s been sandblasted until you can only see its basic outline.
I’m sure an AI system can be configured to make its results sound more quirky, but since the system is all based on averages, I think it’ll be very hard to know if the quirks you’re seeing are genuine or random things hallucinated by the model. As a result, any intuition you create from those details will also be suspect. You’ll think you know your customers when you really don’t.
There was also a big difference between the platforms in numerical scoring. In both systems, participants were asked to rate the product idea on a 1-5 scale, five being best. There was a dramatic difference in the responses. As shown in the chart below, the Synthetic Users scores clustered tightly around 2.5, while the UserTesting ratings were more diverse and tended toward five. Because of the small sample size, you don’t want to read too much into any numerical result, but this sort of rating question is usually a strong directional indicator of the potential demand for a product. In the controlled atmosphere of an experience test, people usually overstate their willingness to buy something. So if you can’t get a lot of them to give it a five, you’re in trouble.
The UserTesting result gives a hint that there may be hope for a flying rideshare system. The Synthetic Users result, on the other hand, seems to indicate that even people with long commutes would be ambivalent about the product. If I saw those scores in a test of real people, I’d say the product was in deep trouble.
Followup questions: Interesting but sometimes prone to errors. An intriguing feature of the Synthetic Users system is that you can ask instant followup questions. Since the “users” are artificial, you can type in followup questions at any time, and the system will synthesize immediate answers. I tried some followup questions, and received fluent-sounding results. However, the information delivered by the system sometimes got muddled:
You can ask followup questions with the UserTesting system, but the process takes a lot more time (hours at least) and sometimes the participants may not respond at all.
The results summaries: Surprising similarities. Synthetic Users and UserTesting can both use AI to generate a summary of the test results. In Synthetic Users, the system summarizes the synthetic interviews it created. In UserTesting, the system summarizes the transcripts of the interviews.
There were some differences between the summaries, but I was surprised by the similarities. Here’s how Synthetic Users summarized what’s appealing and concerning about a flying rideshare service:
Synthetic Users summary of what’s appealing and concerning Appealing:
Concerns:
|
Here’s how UserTesting summarized what’s appealing and concerning:
UserTesting summary of what’s appealing and concerning Appealing:
Concerns:
|
The two platforms both identified customers’ main concerns: safety and price. You could say that this shows Synthetic Users is a good proxy for real discovery interviews, but on the other hand you don’t really need any research at all to guess that people worry about safety and price when flying. If you’re looking for non-obvious insights, the platforms had different results:
Aa a final test, I was curious about how the results from Synthetic Users and UserTesting would compare to what I can get for free from a generative AI service. I logged into a free ChatGPT 3.5 account and input the following query:
Query to ChatGPT There is a new rideshare service in which an aircraft lands at your house, picks you up, and takes you to your office. The ride takes no more than 20 minutes door to door. There is no pilot in the aircraft; it is controlled by an artificial intelligence. You can return home the same way. What about this service, if anything, would be appealing to customers, and why? Would customers view it as improving their lives? What concerns would customers have about this service? On a 5-point rating scale, with 1 meaning not at all appealing and 5 meaning extremely appealing, how would an average customer rate the appeal of this service to them? Describe the demographics of a typical customer for this service. How often do they commute to work? How long is their commute? Are they likely to have school-age children at home? What other demographics make them distinct from the average person? |
Here’s the response I received in a few seconds:
ChatGPT 3.5 response The concept of a rideshare service using AI-controlled aircraft for a quick and efficient commute presents several potential appealing aspects for customers:
Concerns customers might have about this service include:
On a 5-point rating scale, the appeal of this service would likely vary among individuals, but it could be moderately high, with many customers likely rating it between 3 and 4 due to the time savings and convenience factors. The demographics of a typical customer for this service might include:
It’s important to note that customer demographics can vary, and the success of such a service would depend on addressing both the appealing aspects and potential concerns mentioned above. |
You should judge for yourself, but to me, the issues and ideas I got for free from ChatGPT seemed very similar to what I got from Synthetic Users. ChatGPT also flagged some issues that came up in real interviews but not in Synthetic Users (for example, customers might enjoy using the flying commute time to do other tasks). Chat GPT also predicted some things that I didn’t expect (for example, core customers would be people without kids). I’d want to test those through real research.
Overall, I think I learned about as much through a free ChatGPT query as I did through the Synthetic Users process. I didn’t get the detailed simulated interviews that Synthetic Users produces, but in terms of ideas, the results were about the same.
Is generative AI in its current form a good substitute for discovery interviews? Based on this test, I think the answer is no:
Synthetic Users and ChatGPT are definitely much faster than traditional research, but that gap is narrowing. Old school in-person interviews typically take weeks to recruit and conduct. That timeline has already been cut to days through online services and panels. Self-interviews paired with AI-generated summaries are making another big cut in turnaround time, by freeing researchers from the need to participate live in every moment of the interviews.
Although I don’t think generative AI in its current form is a good replacement for discovery research, I think it is a good supplement to the process. I would be comfortable using it to brainstorm issues to explore in research. For example, how many people worry about the environmental impact of a flying car? That’s a good question that I hadn’t thought to ask on my own.
I think generative AI is also useful for creating summaries of interview data. I was very comfortable with the quality of the AI-generated summaries created by all three platforms.
Ironically, I think the most problematic element of Synthetic Users is its headline feature: the synthetic interviews themselves. In attempting to create convincing fake interviews, the system introduces hidden inaccuracies that could easily lead a product team in the wrong direction. I think it’s better to treat generative AI as something new, with its own strengths and weaknesses — and to focus on the things it does uniquely well — rather than making it pretend to be something it isn’t.
If you want to do your own comparisons, here’s the source information:
About The Center for Human Insight
We created this resource to help you use human insight for business decision-making.