Methodology
5 min read

AIMIs vs. Static Online Surveys: A Comparative Study by the University of Mannheim

AUTHOR
Aylin Idrizovski
December 3, 2025
TABLE OF CONTENT
Read the full paper

AI-Moderated Interviews vs. Static Online Surveys: A Comparative Study on Qualitative Response Quality

Online surveys are efficient, scalable, and familiar. However, when the research question relies on open-ended responses - motivations, barriers, lived experiences - many survey responses are 1-word, short, generic, or unusable. AI-moderated interviews (AIMIs) promise a different experience: a conversational format that can adapt follow-ups based on what the participant actually said.

To test whether that promise results in clearly better data, Aylin Idrizovski with support from Dr. Floria Stahl from the Chair of Quantititative Marketing & Consumer Analytics, University of Mannheim, conducted a comparative study. In this study, AIMIs and static surveys gathered responses to the same questionnaire from comparable samples under controlled conditions. We at Glaut Research only provide the platform free of charge and cover the panel costs.

Research Goals

The study had a clear methodological aim: to assess whether AI-moderated interviews (AIMI) produce higher-quality data than static online surveys when both formats collect answers to the same questionnaire.

More specifically, the team evaluated differences across four dimensions:

  • Linguistic richness (how much people say, and how varied their language is)
  • Thematic breadth (how many distinct ideas appear in responses)
  • Response validity (how often responses are nonsensical / “gibberish”)
  • Participant experience (how people feel about taking part)

Experimental Design

Study structure and sample

This was a between-subjects experiment (participants completed one format only), chosen to avoid learning or carryover effects. Participants were recruited via PureSpectrum and screened to include U.S. citizens aged 18–55, with a balanced gender distribution and an interest in health and fitness.

  • N = 200 total
  • Random assignment to:
    • AI-moderated interview (AIMI): n = 100
    • Static online survey (SoSci Survey): n = 100

Questionnaire content

Both groups answered the same questionnaire on healthy lifestyle choices:

  • Six open-ended questions, each followed by two follow-ups
  • Three structured questions (including difficulty rating, a multi-select behavior item, and a participant-experience scale)

The topic itself was not the object of analysis; the study focused on the methodological differences between the formats.

Follow-up questions: predefined vs. adaptive

This is the core manipulation:

  • Survey (SoSci): two predefined probes after each open question (e.g., “Can you tell me more?”, “Can you provide some examples?”)
  • AIMI: dynamic, context-driven probes generated from the participant’s prior response, creating a more conversational flow

Because AIMIs can go beyond two follow-ups, the analysis used only the first two AIMI follow-ups to keep comparability high.

Interface differences

  • Static survey: traditional form layout with text boxes
  • AIMI: sequential chat-style interface
  • Even though Glaut supports voice input, the study restricted responses to text only to ensure a fair comparison.

Data quality handling

Both formats flagged incomplete/abandoned sessions and retained only complete responses. Additionally, the AIMI condition used an uncooperative-response detector and excluded low-quality inputs during cleaning, whereas the SoSci condition retained gibberish to reflect real-world vulnerability to invalid entries.

Measures

The study evaluated response quality using three measurement blocks:

1) Linguistic metrics

Computed from open-ended responses (including follow-ups), the linguistic measures included:

  • Verbosity (total words)
  • Unique words (distinct word types)
  • Lexical diversity (type-token ratio, TTR)
  • Content-word share (share of nouns/verbs/adjectives/adverbs)
  • Readability (Flesch Reading Ease)

2) Thematic variety

Using an inductive codebook (Appendix C), the team measured:

  • Theme count (total mentions, including repeats)
  • Unique themes (how many distinct themes appear)

3) Participant experience

A 7-item Likert scale captured perceived:

  • ease of expression, comfort
  • repetitiveness (reverse)
  • conversational quality
  • feeling understood
  • trust in data handling
  • willingness to recommend

Key Results

1) Higher linguistic quality in AI interviews

Across multiple measures, AIMI responses were richer:

  • +39% more words (AIMI mean 131.52 vs. 94.25)
Figure 1. Verbosity
  • +51% more unique words (83.69 vs. 55.31)
Figure 2. Unique Words
  • +12% higher lexical diversity (TTR) (0.704 vs. 0.626)
Figure 4. Lexical Diversity (TRR)

Importantly, this was not “more text at any cost.” On clarity-related measures, the formats were statistically equivalent:

  • Content-word share: no meaningful difference (0.618 vs. 0.634)
  • Readability (Flesch Reading Ease): no meaningful difference (77.76 vs. 79.60)

Interpretation: AIMIs generated longer, more varied responses without sacrificing readability or content density, which is essential when “richer” responses could otherwise become rambling.

2) Broader & deeper thematic coverage

The thematic analysis separates quantity from diversity.

  • Unique themes were +36% higher in AIMI (8.76 vs. 6.42)
  • Theme count was comparable (19.77 vs. 18.68; not significant)

Why this is important: If AIMI had simply added more words, we would see an increase in theme count (more words → more mentions). However, the total mentions remained roughly the same while the variety of ideas widened. This indicates that AIMI’s conversational probing encouraged a wider exploration of topics, rather than just producing longer responses.

3) Better data validity

A common issue with open-ended responses is the proportion that are not interpretable, such as random characters, repeated words, non-English strings, or extremely short nonsense answers.

In this dataset:

Group n Gibberish (n) Rate (%)
AIMI 100 0 0.0%
SoSci 100 10 10.0%

Interpretation: This study found nonsensical input exclusively in surveys. This differentiation has practical benefits, reducing cleaning time and minimizing lost completes.

4) Better participant experience

Participant experience improved overall in the AIMI condition:

  • Overall experience score: +6% (4.22 vs. 3.98). This difference was statistically significant (p = .02, d = 0.33).
Figure 5. Participant Experience score

At the item level, the advantages concentrate around interaction quality:

  • More conversational (4.26 vs. 3.80)
  • More “listened to and understood” (4.38 vs. 3.93)
  • Higher trust in data handling (4.45 vs. 3.95)
  • Less repetitive (3.30 vs. 3.97)
  • More recommendable (4.41 vs. 3.88)

Ease of expression and comfort were consistent across formats, indicating AIMI’s improvements did not stem from making people feel pressured or uncomfortable, but rather from creating interactions that felt more responsive and diverse.

What this means

When two comparable samples answered the same questions:

  • AIMIs generated more linguistically rich responses (more words, more unique vocabulary, higher lexical diversity).
  • AIMIs produced more diverse ideas, not just longer answers.
  • AIMIs yielded cleaner data (no gibberish in the AI condition; 10% in the survey condition).
  • Participants experienced AIMIs as more conversational, less repetitive, more trustworthy, and overall more positive.

In conclusion, this controlled comparison suggests that AIMI’s conversational and adaptive qualities improve data quality and participant experience while maintaining clarity.

Questions?

Read the full paper here, or contact us at hello@glaut.com