AI Literature Review Tools: Systematic Testing Results & Evidence-Based Guide

I like to think of myself as a technology enthusiast who is often generally willing to embrace new advances. But as an academic who has also spent nearly a decade teaching research methods and being an ardent defender of rigor and transparency in research practice, I faced a significant dilemma when AI began to flood the internet.

Like many academics, I panicked because academia as we know it has always been slow to adapt. So, the disruption that emerged was something many of us did not prepare for. I remember jumping from ‘that should definitely be treated as academic dishonesty and penalized accordingly’ to feeling helpless that ‘there is actually no (generally accepted) way to determine if someone has or has not used AI in their research’.

What made it even scarier was how difficult it was to have conversations about AI with colleague professors lest you be perceived to be using AI in writing your own research. This is one of those worst-kept secrets in academia: that those who use AI the most in their private spaces are often the quickest to dismiss any inclination towards accepting AI in research practice.

The coffee chat revelations

But as the storm settled, I started to engage both colleagues and students during my usual morning ‘coffee chats’ and discovered something profound.

On one occasion, a senior colleague disclosed to me that they “no longer know what to think about this whole AI thing.” On the one hand, they believed it was totally unethical, and on the other hand, they accepted we cannot stop ‘progress.’ So, they were among the first in my university to start advocating for institutional-level guidance on AI use. The cognitive dissonance was palpable—they clearly wanted rules because they couldn’t reconcile their ethical concerns with the practical reality that AI wasn’t going away.

On another occasion, a student mentioned to me that “serious universities” now accept AI use, so why is our university pushing back? When I quizzed him for evidence, he shared a blog post filled with ads and affiliate links. In this blog, the author claimed to be a student in one of the Ivy League universities and that they were allowed to use AI. The post was littered with promotional links to various AI writing tools, each one presumably earning the author a commission. There was no methodology, no institutional verification, just anecdotal claims wrapped in marketing.

This wasn’t an isolated incident. I encountered similar stories from other students—YouTube videos promising to “revolutionize your research,” Twitter threads declaring that “all top researchers use AI now,” and Medium posts with sweeping claims about AI adoption that, upon closer inspection, were either exaggerated or completely fabricated.

The problem is that we replaced ethics with regulations

After encountering a number of these anecdotes—and at times complete falsehoods from students and colleagues alike—I became concerned that it appears we have reduced the debate around AI use to “is it allowed or not?” But in doing so, we have failed to develop usable guidance on when, if, how, and why to use AI tools for research purposes.

In short, we have replaced ethics with regulations.

And in doing so, we fail to develop a clear understanding and framework for determining what AI tools meet the standard for academic integrity and professionalism. More importantly, we lack the transparent conversations required to develop replicable workflows to reengineer research practice in this rapidly changing knowledge ecosystem.

Indeed, the regulatory approach creates a binary: allowed or forbidden. But research ethics has never worked that way. We don’t ask “Is interviewing people allowed?” or even “is it allowed to use the services of a professional language editor in a peer-reviewed publication?” We often ask: “Under what conditions can we do these very things? With what consent? What are the risks? How do we minimize harm? What transparency is required?”

The same nuanced framework should apply to AI tools. Yet instead of rigorous evaluation, we got:

Panic-driven policies with universities banning tools they didn’t understand
Vague guidelines, predominantly focusing on responsible AI use without defining what is responsible and what is not.
Contradictory advice even within the same universities and work places, where one professor says never use it; another says it’s the future of work.
Affiliate-driven reviews with bloggers promoting tools they profit from
And vendors promising extraordinary benefits with minimal tangible evidence

Extraordinary claims require extraordinary evidence

As someone who teaches research methods to graduate students, I’ve spent years emphasizing two foundational principles. The first is that extraordinary claims require extraordinary evidence. The second, to borrow the words of Neil deGrasse Tyson, is that the research (and to a large extent, knowledge creation more broadly) involves doing everything in one’s capacity to not fool oneself or others that something is true when it is not or that something is NOT true when it, indeed, is true.

When a vendor claims their AI tool will “revolutionize your literature review” or “write your dissertation in minutes,” that’s an extraordinary claim. It requires extraordinary evidence—controlled experiments, baseline comparisons, transparent methodology, honest reporting of limitations.

Yet I was seeing none of that in the AI discourse around research. Instead, I saw:

Anecdotal testimonials (“This changed my life!”)
Cherry-picked examples (showing only successes, hiding failures)
Vague efficiency claims (“Save 80% of your time!” without defining what that means)
No baseline measurements (faster than what? More accurate than what?)
No acknowledgment of failure modes (what happens when it goes wrong?)

This isn’t how we evaluate anything else in academic research. If a student came to me proposing a new data analysis method and said “trust me, it works great,” I’d send them back to the drawing board with instructions to design a proper evaluation.

AI can, and should, work without exposing knowledge users to career risks

But beyond financially motivated vendors, we, as knowledge producers, are complicit by not scrutinizing how these tools might amplify biases in fields like social sciences. Going back to the statement of Neil deGrasse Tyson, have we questioned to what extent the AI tools we use help us avoid fooling society about the truth in the knowledge we produce and use?

This is the concern that pushed me to start this website. To understand the extent to which AI tools can help us be responsible knowledge creators and knowledge users, I conducted a simple experiment in my graduate research methods seminar. I asked my graduate students to write a literature review on the impact of green spaces on mental health. All students were instructed to rely on the exact same 10 peer-reviewed articles, not exceed 5 pages of text, and record the time spent writing. This first paper served as the control (manual baseline).

I then fed the same articles into two AI tools (ChatGPT and Claude AI) and instructed them to do the exact same task. Finally, I shared the AI outputs with the students and asked them to select the better tool of the two and work collaboratively with it to edit their papers (e.g., refining sections while disclosing AI assistance).

I then measured the outcomes of this experiment against my 5-point framework for assessing the credibility of literature reviews in the era of AI:

Accuracy of facts and citations (e.g., checking for AI hallucinations like fabricated references).
Variability/predictability in prose and sentence structure (e.g., avoiding repetitive, robotic phrasing that signals over-reliance on AI).
Depth of analysis and synthesis (e.g., does it connect ideas meaningfully or just list summaries?).
Originality and critical insight (e.g., evidence of unique perspectives vs. generic regurgitation).
Ability of the authors to engage meaningfully with the text produced (e.g., can the user clearly explain the text in basic terms?).

Humans are not more accurate than AI in literature reviews – they are just more trusted.

The details of this experiment is available in my upcoming post. But for the purposes of today’s discussion, I will share the highlights. In short, the results revealed a number of things about AI and research I have not quite seen through my searches on YouTube, Google, and many other media platforms. First, and surprisingly, the manual approach was the slowest BUT NOT the most accurate. I found that many students either misinterpreted the key findings of the original articles or made sweeping conclusions based on very minor observations. Second, students often missed critical themes that they were not already familiar with. For example, articles discussing equity in green space access were frequently overlooked if students lacked prior exposure, leading to incomplete syntheses. This highlights a common human limitation in manual reviews: cognitive biases and familiarity gaps can reduce accuracy, even if the process feels thorough.

In contrast, the AI-only outputs from ChatGPT and Claude were faster but inconsistent—ChatGPT hallucinated 12% of citations (e.g., inventing studies on “urban forests and dopamine levels”), while Claude was more conservative but overly generic in synthesis. Both tools amplified dataset biases, such as prioritizing Global North studies on mental health, which could skew social science applications in diverse contexts like North America’s multicultural planning contexts.

In all, the hybrid approach proved the better of the three. Although this is not new, the realization that perceptions of human accuracy are perhaps exaggerated pushed me to reconsider the limits of human abilities in research and the role AI could play in research.

AI vs. manual literature Review: systematic comparison of ChatGPT, Claude, and human researchers

Evaluation Criteria	Manual Only	AI Only	Hybrid
Accuracy of Facts & Citations	6.5/10	7/10	9.5/10
Variability in Prose & Structure	10/10	5.5/10	8/10
Depth of Analysis & Synthesis	6/10	8/10	8.5/10
Originality & Critical Insight	7/10	7/10	8.5/10
Author Engagement with Text	9/10	3/10	8/10
OVERALL SCORE	7.7/10	6.1/10	8.5/10

⏱️ Time to Complete (Average)

Manual Only: 6.5 hours
AI Only: 0.5 hours
Hybrid Approach: 3.5 hours (46% time savings)

💡 Key Finding: AI actually outperformed humans on factual accuracy (7.0 vs 6.5), challenging the assumption that manual reviews are more accurate. However, humans scored far higher on engagement (9.0 vs 3.0), meaning they’re more trustworthy, not more accurate. The hybrid approach (8.5/10) combined AI’s pattern recognition with human critical oversight to achieve the best results while saving 46% of time.

The birth of this blog

Following this rather intriguing class experiment, many of my students suggested I take up a leading role in sharing some of my experiments to help inform the public about the misconceptions, ethical challenges, and innovative ways for leveraging the AI evolution to society’s benefit. This is how this Blog was born.

This will not be another AI productivity blog promising to “10x your research output.” This is a space for evidence-based evaluation of AI tools in research contexts, grounded in the same standards of rigor and transparency I teach my students.

Here’s what you’ll find:

How specific AI tools perform on specific research tasks, with methodology documented
Exactly how I tested, what I measured, what could bias the results
What works, what doesn’t, and the limitations of both
Not just “can we use this?” but “when, how, and with what disclosure?”
Step-by-step processes you can adapt to your context

Subscribe to the newsletter for updates when I publish new tool evaluations and frameworks. And if you have specific AI tools or use cases you’d like me to test, leave them in the comments or reach out to me via my contact page.

About the Author

Dr. Amasiya is a social scientist and research methodologist who specializes in evidence-based evaluation of AI tools for research and professional contexts. With nearly a decade of experience teaching research methods, Dr. Amasiya applies systematic testing protocols to assess AI claims and develop transparent frameworks for ethical AI use.

Contradictory advice, biased reviews, and why I started a blog about AI in research