In my previous post, I shared the surprising results from a classroom experiment comparing manual, AI-only, and hybrid approaches to literature reviews. The finding that challenged many readers’ assumptions was that AI actually outperformed humans on factual accuracy.
As promised, this post sheds light on what I actually did and addresses questions readers have raised since the blog went live.
From the start, let me clarify that I did not set out to design this experiment (of course, going forward, I will). Instead, it emerged during my methods seminar when one student asked whether the sentiments about AI hallucinations and AI “not being good enough” just yet was not a gatekeeping strategy used by us academics to keep them from using the technology that many of us are clearly benefiting from its use. After his question, the discussion took an interesting turn. We all agreed that most AI tool reviews follow a predictable pattern. Someone tries a tool, likes what they see, and declares it revolutionary. Or they encounter a hallucination and warn everyone to avoid it. We also agreed neither approach gives us useful guidance to help decide if the student’s claim was right.
To put the debate to bed, we agreed to undertake this activity, which happened to align quite well with the assignment for the week. We were scheduled to complete a literature review paper that synthesized 10 articles. So this experiment tested one of the most common research tasks in synthesizing findings from academic literature.
The experimental design
I worked with 22 graduate students enrolled in a research methods seminar. All participants had completed undergraduate research training and were familiar with literature review conventions. Experience levels ranged from first-year master’s students to advanced doctoral candidates.
Each student wrote a literature review on the impact of green spaces on mental health. Everyone used exactly the same 10 peer-reviewed articles that I provided as instructor. The constraints were straightforward: maximum 5 pages of text, APA citation format, and they recorded the time spent on the task.
The 10 articles covered diverse methodologies (experimental, observational, systematic reviews), geographical contexts (urban planning in North America, European health studies, Asian environmental research), and populations (children, elderly, general adults).
Three Conditions
Manual Only (Baseline)
To establish a baseline, students read the 10 articles and wrote the literature review entirely on their own, using traditional methods like reading, note-taking, and writing. This served as our control condition.
AI Only
I fed the same 10 articles into two AI tools (ChatGPT-5.2 and Claude 4.5 Sonnet) with identical prompts. I asked each tool to write a 5-page literature review on the impact of green spaces on mental health using these 10 articles, following APA format and synthesizing the findings thematically.
I evaluated both AI outputs independently to determine which performed better according to the 5-point framework detailed below. Students did not participate in this condition. This was purely AI-generated content.
Hybrid Approach
After completing their manual literature reviews, students received the AI outputs from both ChatGPT and Claude. They selected which AI tool produced the better output, then collaboratively edited their original paper using the selected tool. They refined sections, improved synthesis, verified citations, and disclosed AI assistance in acknowledgments.
5-Point evaluation framework
I developed this framework specifically to assess literature reviews in the age of AI. Each criterion addresses a different dimension of quality and integrity. This framework was developed using a thematic analysis of the key concerns I gathered mainly from Reddit, X, and Facebook. In a future blog, I hope to share more on the computational text analysis process that yielded this framework.
1. Accuracy of facts and citations (Weight: 30%)
This measures whether the claims made in the literature review are actually supported by the source articles and whether citations are correct.
For scoring, I randomly selected 10 factual claims from each literature review. I traced each claim back to the original article, verified citation accuracy (correct author, year, page numbers), and checked for hallucinated references or findings.
The common errors I found varied by approach. For manual reviews, students often misinterpreted statistical significance, claiming strong effects when results were marginal, or overgeneralized from one study’s specific population to all populations. For AI reviews, ChatGPT fabricated citations like a study on “Urban forests and dopamine levels” that doesn’t exist, while Claude occasionally merged findings from multiple studies into one citation.
2. Variability in prose and sentence structure (Weight: 15%)
This criterion assesses whether the writing sounds natural and varied, or repetitive and formulaic.
I analyzed sentence length distribution (standard deviation of words per sentence), checked for repeated sentence structures or transition phrases, and identified robotic patterns like “Additionally,” “Furthermore,” and “Moreover” used mechanically.
Human writing showed natural variability, averaging 15 to 30 words per sentence with high variation. AI writing was noticeably formulaic. ChatGPT especially repeated structures like “Research shows that… This suggests…” Hybrid writing maintained human variability while improving clarity.
Another interesting pattern was in the use of punctuation. Both ChatGPT and Claude consistently used em dashes and colons whereas students rarely, if ever, used any of these.
3. Depth of analysis and synthesis (Weight: 25%)
This measures whether the review merely summarizes studies or identifies patterns, contradictions, and gaps.
I counted thematic connections made across studies, evaluated identification of contradictions in the literature, and assessed recognition of methodological limitations and research gaps.
Here’s what surprised me. AI (especially Claude) excelled at identifying cross-study patterns that humans missed. For example, AI consistently noticed that studies on green space access and mental health outcomes showed stronger effects for low-income populations. This theme appeared across 6 of the 10 articles but was only mentioned in 3 of the 22 manual reviews.
However, AI struggled with recognizing when contradictions might stem from methodological differences rather than substantive disagreements. It also rarely proposed novel theoretical frameworks to explain patterns. Indeed, the underlying explanations at times feel forced.
4. Originality and critical insight (Weight: 20%)
This examines whether the review offers unique perspectives or critical evaluation, or just regurgitates summaries.
I looked for original arguments or interpretations not explicitly stated in the source articles, assessed critical evaluation of study quality, methodology, or generalizability, and identified integration with broader theoretical or practical contexts.
Both manual and AI-only approaches scored identically (7 out of 10) but for different reasons. Some students offered genuinely original insights, but many simply summarized without critical evaluation. AI consistently provided competent but generic synthesis and rarely offered truly novel perspectives.
The hybrid approach scored higher (8.5 out of 10) because students used AI to identify patterns, then added their own critical interpretation.
5. Author engagement with the text (Weight: 10%)
This criterion addresses whether the author can clearly explain and defend what’s written. It tackles the risk of students submitting AI-generated text they don’t understand.
After submitting their reviews, I randomly selected 5 students from each condition for a brief Q&A lasting 10 minutes. I asked them to explain the main argument of their review followed by three content-specific questions I prepared from reading the submissions. For the AI only, I asked the 5 students to each read the AI-generated content and be prepared to answer questions as though it belonged to them.
Students working manually could easily explain their work and scored 9 out of 10. Students defending the AI-only work scored 3 out of 10 and later explained the work felt “foreign and hard to understand”. Students using the hybrid approach demonstrated strong understanding and scored 8 out of 10. They could explain what AI contributed and how they refined it.
AI vs. manual literature Review: systematic comparison of ChatGPT, Claude, and human researchers
| Evaluation Criteria | Manual Only | AI Only | Hybrid |
|---|---|---|---|
| Accuracy of Facts & Citations | 6.5/10 | 7/10 | 9.5/10 |
| Variability in Prose & Structure | 10/10 | 5.5/10 | 8/10 |
| Depth of Analysis & Synthesis | 6/10 | 8/10 | 8.5/10 |
| Originality & Critical Insight | 7/10 | 7/10 | 8.5/10 |
| Author Engagement with Text | 9/10 | 3/10 | 8/10 |
| OVERALL SCORE | 7.7/10 | 6.1/10 | 8.5/10 |
⏱️ Time to Complete (Average)
- Manual Only: 6.5 hours
- AI Only: 0.5 hours
- Hybrid Approach: 3.5 hours (46% time savings)
Time to Complete
Manual reviews averaged 6.5 hours. AI-only reviews took 0.5 hours. The hybrid approach averaged 3.5 hours, representing a 46% time savings compared to manual work.
Key observations
AI outperformed humans on factual accuracy
This was the most surprising result. The prevailing assumption is that humans are more accurate than AI because we understand context better. The data showed otherwise. AI scored 7.0 out of 10 versus 6.5 out of 10 for manual reviews on accuracy.
Why did this happen? Humans made interpretation errors by reading their own assumptions into the data. Humans missed relevant information due to selective attention, focusing on familiar themes. These biases were less common in the AI-generated responses.
However, this doesn’t mean AI is inherently more accurate. ChatGPT’s hallucination rate of 12% of citations was unacceptable. Claude performed better but still made errors. The key insight is that both humans and AI can make mistakes. However, it would appear the public sentiment is driven by what I describe as ‘humans are simply more trusted, not more accurate’.
Humans are more trusted because of engagement, not accuracy
The manual approach scored 9 out of 10 on author engagement. Students could explain their reasoning, defend their choices, and identify weaknesses. AI-only scored 3 out of 10 because there was no direct and meaningful author engagement in the writing process.
This distinction matters. When we trust human-written work more than AI-written work, we’re not trusting accuracy. We’re trusting that the author understands what they’ve written and can defend it. And we also trust what we are capable of doing more than a machine or some model.
The implication for research integrity is that if researchers use AI without understanding the output, they undermine trustworthiness even if the content is technically accurate. The hybrid approach preserves author engagement because students could explain how they used AI and what they refined.
AI excels at pattern recognition; humans excel at prose variability and building trustworthiness
AI scored highest on depth of analysis and synthesis (8 out of 10), while humans scored highest on variability in prose (10 out of 10). This suggests complementary strengths. Use AI for identifying themes, detecting contradictions, and ensuring comprehensive coverage. Use human editing for natural phrasing, critical interpretation, and original insight. The human involvement helps with creating familiarity and hence building trustworthiness.
The hybrid approach achieved the best results with 46% time savings
The hybrid condition scored 8.5 out of 10 overall, higher than manual (7.7 out of 10) or AI-only (6.1 out of 10), while cutting time nearly in half (3.5 hours versus 6.5 hours). This wasn’t just about speed. The quality improved because AI caught what humans missed, and humans added what AI couldn’t generate.
Students reported that seeing the AI output helped them recognize gaps in their own analysis. One student told me, “I didn’t realize I had completely missed the equity theme until Claude pointed it out. Then I went back to the articles and saw it everywhere.”
What this means for you
AI certainly is not as bad as we have been made to think in academic circles and neither is it as good as we have been made to think in the “tech twitter”. It requires a balancing act and, more importantly, raises ethical questions of authorship and acceptability in various contexts. In the coming weeks, we will explore these topics in detail. But for now, if you’re a researcher, student, or professional conducting literature reviews, here are the practical takeaways. Don’t rely on AI alone. Use AI for pattern recognition, not final output.
FAQs
Is AI or manual literature review more accurate?
In our experiment, AI scored 7.0/10 vs manual 6.5/10 on accuracy, but both had different error types.
What’s the best AI tool for literature reviews?
Claude performed better than ChatGPT on citation accuracy (0% vs 12% hallucination rate), but the hybrid approach scored highest overall. The results of a dedicated comparison of specialized A tools, including SciSpace, Consensus AI, and Elicit AI is scheduled to go live next week.
How much time does AI save on literature reviews?
The hybrid approach saved 46% of time (3.5 hours vs 6.5 hours manual).
About the Author
Dr. Amasiya is a social scientist and research methodologist who specializes in evidence-based evaluation of AI tools for research and professional contexts. With nearly a decade of experience teaching research methods at the graduate level, Dr. Amasiya applies systematic testing protocols to assess AI claims and develop transparent frameworks for
1 thought on “Should you trust AI literature reviews?”