Everyone agrees AI can’t be an author. Almost nobody agrees on anything else.

In my first post, I described the moment I realized we had replaced ethics with regulations. That we have reduced the AI debate to “is it allowed or not?” In the second, I shared the results of a classroom experiment that challenged the assumption that humans are more accurate than AI in literature reviews. The data showed humans are simply more trusted, not necessarily more accurate.

But both posts raised a question I kept getting in my inbox: What are the actual rules? What do institutions, journals, and workplaces actually say about AI use — and do they even agree with each other?

Fair question. Before we can test whether specific AI tools meet the standards for responsible use (which is what this blog exists to do) we need to know what those standards are. We need a baseline. And to build that baseline, I spent the past several weeks doing something I wish someone had done a long time ago: reading the actual policies.

Not summaries. Not opinion pieces about the policies. The policies themselves from universities, journals, publishers, professional bodies, funding agencies, consulting firms, law firms, newsrooms, creative industries, and governments. Over a hundred of them, spanning from late 2022 to today.

What I found was both reassuring and deeply frustrating.

The one thing everyone agrees on

Let me start with the good news. Across every sector I examined (academia, publishing, industry, law, government), there is exactly one principle that enjoys universal, uncontested agreement and is it that AI cannot be an author.

Not a single major university, journal, publisher, professional association, or ethics body disagrees. The Committee on Publication Ethics (COPE) was among the first to formalize this, stating that AI tools “cannot meet the requirements for authorship as they cannot take responsibility for the submitted work.” The International Committee of Medical Journal Editors (ICMJE) expanded on this in their January 2024 update, noting that chatbots “should not be listed as authors because they cannot be responsible for the accuracy, integrity, and originality of the work.” Elsevier, Springer Nature, Wiley, Taylor and Francis, SAGE, IEEE, ACM — every major publisher echoes this position.

The logic is straightforward: authorship implies accountability. If an AI fabricates a citation (as ChatGPT did in 12% of cases in my own experiment) who gets sanctioned? The model cannot be hauled before an ethics committee, cannot have a paper retracted from its CV, cannot lose tenure. The human can. So the human must be the author.

This is one of those rare moments in academic governance where there is no meaningful dissent. Enjoy it, because it is the last one.

From panic to pragmatism in three years

To understand why everything beyond that single principle is contested, it helps to trace how we got here. The trajectory from ChatGPT’s launch in November 2022 to today follows a remarkably clear arc — and it is faster than almost anything I have seen in institutional policymaking.

The panic phase (late 2022 to early 2023). ChatGPT reached 100 million users in two months. New York City’s Department of Education banned it on all school devices. Sciences Po in Paris threatened expulsion. Multiple Australian states blocked it on school networks. Italy became the first Western country to ban ChatGPT outright (though only for about two weeks). Science magazine banned all AI-generated text entirely. The editor-in-chief, Holden Thorp, explained the reasoning with admirable honesty: “It’s a lot easier to loosen our criteria than it is to tighten them.”

I remember this phase well. It was the period I described in my first post — jumping from “that should definitely be treated as academic dishonesty” to “there is actually no generally accepted way to determine if someone has or has not used AI.”

The cautious exploration phase (mid-2023). NYC reversed its ban in May. The UK’s Russell Group, representing 24 leading universities, published five guiding principles explicitly rejecting blanket bans. Their CEO stated: “The transformative opportunity provided by AI is huge and our universities are determined to grasp it.” This was the phase when most institutions stopped asking “should we allow it?” and started asking “how do we manage it?”

It was also the phase of the Mata v. Avianca ruling — the case that changed everything in the legal profession. Attorneys Steven Schwartz and Peter LoDuca used ChatGPT to draft legal briefs that contained entirely fabricated cases with fake citations. They were fined $5,000. The judge noted there is “nothing inherently improper about using a reliable artificial intelligence tool for assistance,” but the attorneys had failed to verify anything. By 2025, researchers documented over 300 cases of AI hallucinations in court filings — escalating from roughly two per week to two or three per day.

The framework development phase (late 2023 to 2024). Biden’s AI Executive Order (October 2023), the EU AI Act political agreement, the Bletchley Declaration — this is when institutions moved from ad hoc responses to structured governance. The American Bar Association issued its first formal ethics opinion on generative AI. NIST published its generative AI risk profile. Universities moved from instructor-level discretion to institutional frameworks.

The nuanced integration phase (2025 to present). Trump revoked Biden’s executive order on his first day in office. The EU began enforcing AI Act prohibitions. NIH dramatically tightened restrictions on AI in grant applications. Meanwhile, the University of Michigan rolled out a full suite of AI tools for all students and staff, the University of Sydney declared that AI cannot be prohibited in open assessments, and 31 US states enacted AI legislation. The landscape became simultaneously more sophisticated and more fragmented.

In barely three years, we went from “ban everything” to a global governance architecture involving every major institution. The speed is remarkable. The coherence is not.

From panic to pragmatism in three years

How global AI policy evolved from ChatGPT’s launch (Nov 2022) to today

Nov 2022 – Mar 2023

Panic & Prohibition

ChatGPT reaches 100M users in two months. Institutions react with bans they will later reverse. The dominant question: “How do we stop this?”

Jan ’23 NYC schools ban ChatGPT Jan ’23 Sciences Po threatens expulsion Jan ’23 Science magazine bans AI text Mar ’23 Italy bans ChatGPT (2 weeks) Mar ’23 Australian states block on school networks

Apr – Sep 2023

Cautious Exploration

Bans reversed. Universities begin structured policies. The legal profession gets its wake-up call. The question shifts: “How do we manage this?”

May ’23 NYC reverses ChatGPT ban Jun ’23 Mata v. Avianca: fake AI citations Jun ’23 NIH bans AI in peer review Jul ’23 Russell Group rejects blanket bans May–Sep ’23 WGA strike (148 days) Jul–Nov ’23 SAG-AFTRA strike (118 days)

Oct 2023 – Dec 2024

Framework Development

Governments legislate. Publishers formalize policies. Professional bodies issue ethics guidance. The question evolves: “What are the rules?”

Oct ’23 Biden AI Executive Order (110 pages) Nov ’23 Bletchley Park AI Safety Summit Dec ’23 EU AI Act political agreement Jan ’24 ICMJE expands AI guidance Jul ’24 ABA Formal Opinion 512 on AI Aug ’24 EU AI Act enters into force

Jan 2025 – Present

Nuanced Integration

Sophisticated and fragmented. Some institutions push full integration; others tighten restrictions. The real question: “Are we doing this right?”

Jan ’25 Trump revokes Biden AI order Feb ’25 EU AI Act prohibitions take effect 2025 NIH limits AI in grant applications 2025 U-Michigan launches full AI suite 2025 U-Sydney: cannot ban AI in open assessments 2025 31 US states enact AI laws 2025 300+ AI hallucination cases in courts
3
Years of evolution
0
Major universities still banning AI
31
US states with AI laws
92%
UK students using AI (2025)
300+
AI hallucination cases in courts

Source: Institutional policies, government records, and published audits reviewed Jan–Feb 2026  |  dramasiya.com

What universities actually say (and where they contradict each other)

But the picture gets messy — and, frankly, this is where the “replaced ethics with regulations” problem I described in my first post becomes starkly visible.

I examined AI policies from universities across the US, UK, Canada, and Australia. The variation is extraordinary — not just between countries, but between universities in the same city, and sometimes between departments in the same university.

What happens when a syllabus says nothing about AI? This is one of the most common scenarios a student faces, and the guidance is flatly contradictory. At MIT, if a syllabus is silent, students should assume AI is not allowed. At Stanford, the Board on Conduct Affairs says AI use should be “treated analogously to assistance from another person” — which, depending on your reading, either permits or restricts it. At UCL in London, their three-category framework implies that assistive AI use (brainstorming, background research) is acceptable even without explicit permission. For a student transferring between institutions — or even switching courses within one — the rules change with the hallway they walk down.

Can instructors ban AI entirely? At Stanford’s Graduate School of Business, instructors actually may not ban AI on take-home coursework, though they can restrict in-class exams. At Columbia, using generative AI without instructor permission is outright prohibited and constitutes unauthorized assistance. At the University of Sydney, AI cannot be prohibited in unsupervised assessments as of mid-2025. Their Deputy Vice-Chancellor explained: “Generative AI has already had a profound impact on workplaces and our graduates are expected to demonstrate skilled use of the relevant tools in job interviews.”

Think about this from the perspective of students and early career researchers trying to figure out what is acceptable. Stanford says your business school professor cannot stop you from using AI at home. Columbia says using it without asking is a violation. Sydney says banning it in open assessments is the violation. These are all elite institutions. They are all acting in good faith. And they are giving fundamentally incompatible guidance.

Same question. Different university. Opposite answer.

How elite institutions answer the same three questions about AI use

Can instructors ban AI on take-home work?

Stanford GSB

Instructors may not ban AI on take-home coursework.

AI protected

Columbia

Using AI without instructor permission is prohibited — unauthorized assistance.

Banned by default

U. of Sydney

AI cannot be prohibited in unsupervised assessments as of mid-2025.

Banning AI banned

If a syllabus says nothing about AI, what should students assume?

MIT

Assume AI is not allowed unless explicitly stated.

Default: no

Stanford

Treat AI use “analogously to assistance from another person.”

Ambiguous

UCL

Assistive AI use (brainstorming, background research) is acceptable.

Default: yes (assistive)

Should universities use AI detection tools?

Princeton

“Unreliable at best and biased at worst.” Not recommended.

Rejected

U. of Melbourne

Uses Turnitin AI detection since April 2023 as one factor in investigations.

In use

ANU (Australia)

Trialled Turnitin in 2023, then discontinued — citing false positives and bias.

Tried & dropped

Source: Institutional AI policies reviewed Jan–Feb 2026  |  dramasiya.com

Do AI detection tools work? This one matters because it determines whether any of these policies are enforceable. Princeton explicitly recommends against AI detection tools, calling them “unreliable at best and biased at worst.” Berkeley discourages them, citing bias and inaccuracy. Australia’s national university (ANU) trialled Turnitin’s AI detection tool in 2023 and elected not to continue, reporting “significant issues regarding its efficacy (including false positives), potential bias, and lack of transparency.” Bath, in the UK, ruled them out as “fundamentally flawed.”

Meanwhile, the University of Melbourne has used Turnitin’s AI detector since April 2023. And a 2023 evaluation of 14 AI-detection tools found all scored below 80% accuracy, with only five above 70%. Simple prompt engineering reduced Turnitin’s detection rate from 100% to 0%. Most concerning, AI detectors show bias against non-native English speakers — a finding that should give every international university pause.

As I noted in my first post: we need to know the mistakes before we can test. Detection is a mistake we need to talk about honestly.

The integration pioneers are worth watching. The University of Michigan describes itself as the “first university in the world to provide a custom suite of generative AI tools to its community” — including custom GPT-4o, DALL-E, Llama, and Claude implementations that meet FERPA data security standards, all free for students and staff. They are not just allowing AI; they are building institutional infrastructure around it. Whether this becomes the model or an outlier will be one of the most important questions in higher education over the next two years.

What journals and publishers say (and the enforcement gap)

If universities are a patchwork, academic publishers present a more coherent picture — at least on the surface. The universal agreement that AI cannot be an author is accompanied by widespread agreement that AI use must be disclosed, that humans bear full responsibility for all content, and that AI-generated images are generally prohibited.

But the specifics diverge considerably, and the enforcement gap is enormous.

Disclosure requirements vary widely. Elsevier requires a mandatory “Declaration of Generative AI” section using a provided template. Taylor and Francis wants “the full name of the tool used (with version number), how it was used, and the reason for use.” SAGE makes a distinctive split: what they call “Assistive AI” (i.e., refining your own work) does not require disclosure at all, while “Generative AI” (i.e., producing new content) must be disclosed, cited in-text, and included in the reference list. SAGE is actually the only major publisher that treats AI as a citable source. Meanwhile, Springer Nature exempts basic copy editing from disclosure entirely.

So if you use Claude to polish the grammar in your methods section, SAGE says that is assistive and needs no disclosure. Taylor and Francis says disclose the tool name, version, how, and why. Springer Nature says copy editing is exempt. Elsevier says use the template. For the same action, four different publishers give four different answers.

Same action. Four publishers. Four different rules.

How major academic publishers handle AI disclosure requirements

The scenario

You use Claude to polish the grammar and clarity of your methods section. What must you disclose?

Publisher Grammar / Copy Editing Content Generation AI Images AI as Citable Source Disclosure Format
Elsevier Disclose Disclose Prohibited No Mandatory template section at end of manuscript
Springer Nature Exempt Disclose Prohibited No Methods or acknowledgments section
Taylor & Francis Disclose Disclose Prohibited No Tool name, version, how used, and reason for use
SAGE Exempt
(“Assistive AI”)
Disclose
(“Generative AI”)
Case-by-case Yes — only publisher In-text citation + reference list entry

Key finding

For the exact same action — using AI to polish grammar — Elsevier says disclose via template, Taylor & Francis says disclose with version details, Springer Nature says no disclosure needed, and SAGE says it depends on whether you classify this as “assistive” or “generative.” Four publishers. Four answers. One confused researcher.

Source: Publisher AI policies as of Feb 2026  |  dramasiya.com

The peer review line is clearer and revealing. Elsevier, NIH, and virtually every major funder prohibit AI in peer review. The logic is confidentiality: uploading a manuscript to an AI tool potentially exposes unpublished data to a third-party system. This is the one area where I see genuine consensus beyond the authorship question. Yet a study across 49 BMJ Group journals found that self-reported AI use among reviewers was “significantly lower” than actual usage, and only 0.7% of JAMA Network reviewers self-reported using AI over a two-year period. Up to 22% of computer science papers now show signs of LLM-generated text, and roughly 17% of peer review reports at a top CS conference showed ChatGPT-generated characteristics.

In other words: the policies are clear, but enforcement relies almost entirely on self-disclosure, and the data suggests people are not disclosing.

This is exactly the gap I warned about in my first post. Regulations without ethics. Rules without internalization. Policies without the transparent conversations required to make them meaningful.

Beyond academia, industry is moving faster (and with less scrutiny)

One of the things that struck me during this research is how different the AI conversation is inside corporations compared to academia. In universities, we are still debating whether students can use ChatGPT on a take-home essay. In industry, the question is how to scale AI across every function while managing liability.

Consulting firms have gone all-in. McKinsey’s proprietary AI chatbot “Lilli” is used by over 70% of its 45,000 consultants, approximately 17 times per week, reportedly reducing research and synthesis time by around 30%. Deloitte committed $3 billion to AI capabilities. PwC invested $1 billion over three years. Accenture plans to double its AI workforce from 40,000 to 80,000 specialists.

But here is the detail that caught my methodologist’s eye: at McKinsey, nearly 1 in 4 AI-drafted deliverables requires substantial rewriting. That 25% figure should sound familiar. In my experiment, AI-only output scored 6.1/10 overall — competent enough to impress on first glance, unreliable enough to need significant human intervention. The corporate data mirrors the classroom data.

The legal profession learned the hard way. The Mata v. Avianca case was not an isolated incident — it was the first in a cascade. Over 300 documented cases of AI-generated fabricated citations have appeared in court filings since mid-2023. The American Bar Association’s Formal Opinion 512 (July 2024) now maps AI obligations to six existing Model Rules: competence (you must understand AI capabilities and limitations), confidentiality (evaluate data risks before inputting anything), communication (inform clients of AI use), candor toward tribunal (verify all citations), supervisory responsibilities (establish firm-wide policies), and fees (you cannot charge clients for your general AI learning curve).

Journalism drew bright lines. The Associated Press released guidelines stating that “AI cannot be used to create publishable content and images for the news service” and that “any output from a generative AI tool should be treated as unvetted source material.” The Guardian committed to using AI “only with clear evidence of a specific benefit, human oversight, and the explicit permission of a senior editor.” These are among the clearest, most operational policies I have seen in any sector.

Creative industries fought for protections. The 2023 WGA and SAG-AFTRA strikes — lasting 148 and 118 days respectively — were substantially about AI. The resulting agreements established that AI cannot be considered a “writer,” AI-produced material cannot be “literary material” under guild contracts, and companies must disclose if material given to a writer was AI-generated. Getty Images banned AI-generated image submissions in September 2022 and sued Stability AI.

The government regulatory landscape is a geopolitical fault line

If the university policies are a patchwork and the publisher policies are coherent-but-unenforced, government regulation is a full-blown geopolitical divergence.

The EU AI Act is the world’s first comprehensive AI regulation, using a four-tier risk framework from “unacceptable” (banned) to “minimal” (unregulated). Education is classified as “high-risk.” Penalties reach 35 million euros or 7% of worldwide annual turnover. The first enforcement provisions took effect in February 2025.

The United States went in the opposite direction. Biden’s 2023 Executive Order (the longest in history at 110 pages) directed over 50 federal entities to take action on AI safety, equity, and innovation. Trump revoked it on his first day in office and signed a new order prioritizing “removing barriers” to AI development. The NIST AI Risk Management Framework remains a widely used voluntary standard, but the federal regulatory direction shifted decisively from precaution to promotion. Meanwhile, 31 states have enacted their own AI laws, and in December 2025, Trump issued an executive order seeking to preempt state-level regulation — creating a regulatory tug-of-war.

The UK chose a “pro-innovation” approach: five principles (safety, transparency, fairness, accountability, contestability) implemented by existing sector regulators rather than a single AI act. Canada’s proposed AI legislation died when Parliament was prorogued. Australia relies on voluntary principles and guidelines.

For researchers and professionals who work across borders — which, in 2026, is most of us — this means the regulatory framework governing your AI use depends not just on your institution or your journal, but on your jurisdiction. The same action might be unregulated in Texas, governed by a voluntary code in London, and subject to a high-risk compliance regime in Brussels.

The ethical issues nobody has solved

Beneath the policy specifics, several fundamental ethical challenges run through every sector, and none of them have been resolved.

The copyright question is fracturing in real time. The US Copyright Office confirmed that purely AI-generated material is not copyrightable and that human authorship is required. But the question of whether AI companies can train on copyrighted material is producing contradictory court decisions. One federal ruling rejected fair use for training on legal headnotes. Another called training on copyrighted books “transformative — spectacularly so.” Over 50 federal lawsuits between IP holders and AI developers were pending as of 2025. The Supreme Court may eventually weigh in, but for now, there is no settled law.

Bias is documented and persistent. Amazon scrapped a recruitment AI that favored male candidates. Facial recognition error rates remain significantly higher for darker-skinned individuals. A July 2025 study found chatbots recommended lower salaries to women and minorities for equally qualified candidates. Google suspended its Gemini image generator over historical inaccuracies. These are not theoretical concerns — they are measured, quantified failures with real consequences.

And this connects directly to my experiment’s finding that both ChatGPT and Claude amplified dataset biases, prioritizing Global North studies on mental health. When we talk about AI bias in research contexts, we are not just talking about fairness in hiring algorithms. We are talking about whose knowledge counts, whose voices are amplified, and whose are erased — the fundamental questions of social science methodology.

The environmental cost is growing and poorly governed. Global data center electricity consumption reached approximately 415 TWh in 2024, projected to nearly double by 2030. Training GPT-3 alone generated an estimated 552 tons of CO2. Most tech companies do not distinguish AI from non-AI workloads in their environmental reports. No binding framework requires AI-specific environmental mitigation. This is an ethical dimension that gets almost no attention in the “should students use ChatGPT” debate — but it should.

So where does this leave us?

Let me be direct about what this landscape review tells us.

The consensus is thinner than it appears. Yes, everyone agrees AI cannot be an author. Everyone agrees on disclosure. Everyone agrees humans bear responsibility. But the moment you move from principles to practice — how much AI use is acceptable, what must be disclosed, how to enforce compliance, whether detection tools work — the agreement evaporates.

The “ethics vs. regulations” gap I identified in my first post is everywhere. Universities have policies but contradictory implementation. Publishers have disclosure requirements but minimal enforcement. Governments have frameworks but jurisdictional fragmentation. The rules exist. The shared ethical understanding of why those rules matter, and how to apply them in practice, largely does not.

This is exactly why testing matters. Going back to the tagline of this blog: we need to know the mistakes before we can test. This post documents the mistakes — the contradictions, the enforcement gaps, the unresolved tensions. With this baseline established, we can now ask the operational questions that matter: When a journal says “disclose AI use,” what does competent disclosure actually look like? When a university says “AI-assisted brainstorming is permitted,” where does brainstorming end and drafting begin? When a consulting firm says AI output requires “substantial rewriting” 25% of the time, what are the failure patterns and how do you catch them?


In the next post, I will share the results of a dedicated comparison of specialized AI literature review tools — SciSpace, Consensus AI, and Elicit AI — tested against the same 5-point framework used in my original experiment. If you want to be notified when it goes live, subscribe to the newsletter or follow on social media.

About the Author

Dr. Amasiya is a social scientist and research methodologist who specializes in evidence-based evaluation of AI tools for research and professional contexts. With nearly a decade of experience teaching research methods at the graduate level, Dr. Amasiya applies systematic testing protocols to assess AI claims and develop transparent frameworks for ethical AI use.

Should you trust AI literature reviews?

In my previous post, I shared the surprising results from a classroom experiment comparing manual, AI-only, and hybrid approaches to literature reviews. The finding that challenged many readers’ assumptions was that AI actually outperformed humans on factual accuracy.

As promised, this post sheds light on what I actually did and addresses questions readers have raised since the blog went live.

From the start, let me clarify that I did not set out to design this experiment (of course, going forward, I will). Instead, it emerged during my methods seminar when one student asked whether the sentiments about AI hallucinations and AI “not being good enough” just yet was not a gatekeeping strategy used by us academics to keep them from using the technology that many of us are clearly benefiting from its use. After his question, the discussion took an interesting turn. We all agreed that most AI tool reviews follow a predictable pattern. Someone tries a tool, likes what they see, and declares it revolutionary. Or they encounter a hallucination and warn everyone to avoid it. We also agreed neither approach gives us useful guidance to help decide if the student’s claim was right.

To put the debate to bed, we agreed to undertake this activity, which happened to align quite well with the assignment for the week. We were scheduled to complete a literature review paper that synthesized 10 articles. So this experiment tested one of the most common research tasks in synthesizing findings from academic literature.

The experimental design

I worked with 22 graduate students enrolled in a research methods seminar. All participants had completed undergraduate research training and were familiar with literature review conventions. Experience levels ranged from first-year master’s students to advanced doctoral candidates.

Each student wrote a literature review on the impact of green spaces on mental health. Everyone used exactly the same 10 peer-reviewed articles that I provided as instructor. The constraints were straightforward: maximum 5 pages of text, APA citation format, and they recorded the time spent on the task.

The 10 articles covered diverse methodologies (experimental, observational, systematic reviews), geographical contexts (urban planning in North America, European health studies, Asian environmental research), and populations (children, elderly, general adults).

Three Conditions

Manual Only (Baseline)

To establish a baseline, students read the 10 articles and wrote the literature review entirely on their own, using traditional methods like reading, note-taking, and writing. This served as our control condition.

AI Only

I fed the same 10 articles into two AI tools (ChatGPT-5.2 and Claude 4.5 Sonnet) with identical prompts. I asked each tool to write a 5-page literature review on the impact of green spaces on mental health using these 10 articles, following APA format and synthesizing the findings thematically.

I evaluated both AI outputs independently to determine which performed better according to the 5-point framework detailed below. Students did not participate in this condition. This was purely AI-generated content.

Hybrid Approach

After completing their manual literature reviews, students received the AI outputs from both ChatGPT and Claude. They selected which AI tool produced the better output, then collaboratively edited their original paper using the selected tool. They refined sections, improved synthesis, verified citations, and disclosed AI assistance in acknowledgments.

5-Point evaluation framework

I developed this framework specifically to assess literature reviews in the age of AI. Each criterion addresses a different dimension of quality and integrity. This framework was developed using a thematic analysis of the key concerns I gathered mainly from Reddit, X, and Facebook. In a future blog, I hope to share more on the computational text analysis process that yielded this framework.

1. Accuracy of facts and citations (Weight: 30%)

This measures whether the claims made in the literature review are actually supported by the source articles and whether citations are correct.

For scoring, I randomly selected 10 factual claims from each literature review. I traced each claim back to the original article, verified citation accuracy (correct author, year, page numbers), and checked for hallucinated references or findings.

The common errors I found varied by approach. For manual reviews, students often misinterpreted statistical significance, claiming strong effects when results were marginal, or overgeneralized from one study’s specific population to all populations. For AI reviews, ChatGPT fabricated citations like a study on “Urban forests and dopamine levels” that doesn’t exist, while Claude occasionally merged findings from multiple studies into one citation.

2. Variability in prose and sentence structure (Weight: 15%)

This criterion assesses whether the writing sounds natural and varied, or repetitive and formulaic.

I analyzed sentence length distribution (standard deviation of words per sentence), checked for repeated sentence structures or transition phrases, and identified robotic patterns like “Additionally,” “Furthermore,” and “Moreover” used mechanically.

Human writing showed natural variability, averaging 15 to 30 words per sentence with high variation. AI writing was noticeably formulaic. ChatGPT especially repeated structures like “Research shows that… This suggests…” Hybrid writing maintained human variability while improving clarity.

Another interesting pattern was in the use of punctuation. Both ChatGPT and Claude consistently used em dashes and colons whereas students rarely, if ever, used any of these.

3. Depth of analysis and synthesis (Weight: 25%)

This measures whether the review merely summarizes studies or identifies patterns, contradictions, and gaps.

I counted thematic connections made across studies, evaluated identification of contradictions in the literature, and assessed recognition of methodological limitations and research gaps.

Here’s what surprised me. AI (especially Claude) excelled at identifying cross-study patterns that humans missed. For example, AI consistently noticed that studies on green space access and mental health outcomes showed stronger effects for low-income populations. This theme appeared across 6 of the 10 articles but was only mentioned in 3 of the 22 manual reviews.

However, AI struggled with recognizing when contradictions might stem from methodological differences rather than substantive disagreements. It also rarely proposed novel theoretical frameworks to explain patterns. Indeed, the underlying explanations at times feel forced.

4. Originality and critical insight (Weight: 20%)

This examines whether the review offers unique perspectives or critical evaluation, or just regurgitates summaries.

I looked for original arguments or interpretations not explicitly stated in the source articles, assessed critical evaluation of study quality, methodology, or generalizability, and identified integration with broader theoretical or practical contexts.

Both manual and AI-only approaches scored identically (7 out of 10) but for different reasons. Some students offered genuinely original insights, but many simply summarized without critical evaluation. AI consistently provided competent but generic synthesis and rarely offered truly novel perspectives.

The hybrid approach scored higher (8.5 out of 10) because students used AI to identify patterns, then added their own critical interpretation.

5. Author engagement with the text (Weight: 10%)

This criterion addresses whether the author can clearly explain and defend what’s written. It tackles the risk of students submitting AI-generated text they don’t understand.

After submitting their reviews, I randomly selected 5 students from each condition for a brief Q&A lasting 10 minutes. I asked them to explain the main argument of their review followed by three content-specific questions I prepared from reading the submissions. For the AI only, I asked the 5 students to each read the AI-generated content and be prepared to answer questions as though it belonged to them.

Students working manually could easily explain their work and scored 9 out of 10. Students defending the AI-only work scored 3 out of 10 and later explained the work felt “foreign and hard to understand”. Students using the hybrid approach demonstrated strong understanding and scored 8 out of 10. They could explain what AI contributed and how they refined it.

AI vs. manual literature Review: systematic comparison of ChatGPT, Claude, and human researchers

Evaluation Criteria Manual Only AI Only Hybrid
Accuracy of Facts & Citations 6.5/10 7/10 9.5/10
Variability in Prose & Structure 10/10 5.5/10 8/10
Depth of Analysis & Synthesis 6/10 8/10 8.5/10
Originality & Critical Insight 7/10 7/10 8.5/10
Author Engagement with Text 9/10 3/10 8/10
OVERALL SCORE 7.7/10 6.1/10 8.5/10

⏱️ Time to Complete (Average)

  • Manual Only: 6.5 hours
  • AI Only: 0.5 hours
  • Hybrid Approach: 3.5 hours (46% time savings)
💡 Key Finding: AI actually outperformed humans on factual accuracy (7.0 vs 6.5), challenging the assumption that manual reviews are more accurate. However, humans scored far higher on engagement (9.0 vs 3.0), meaning they’re more trustworthy, not more accurate. The hybrid approach (8.5/10) combined AI’s pattern recognition with human critical oversight to achieve the best results while saving 46% of time.

Time to Complete

Manual reviews averaged 6.5 hours. AI-only reviews took 0.5 hours. The hybrid approach averaged 3.5 hours, representing a 46% time savings compared to manual work.

Key observations

AI outperformed humans on factual accuracy

This was the most surprising result. The prevailing assumption is that humans are more accurate than AI because we understand context better. The data showed otherwise. AI scored 7.0 out of 10 versus 6.5 out of 10 for manual reviews on accuracy.

Why did this happen? Humans made interpretation errors by reading their own assumptions into the data. Humans missed relevant information due to selective attention, focusing on familiar themes. These biases were less common in the AI-generated responses.

However, this doesn’t mean AI is inherently more accurate. ChatGPT’s hallucination rate of 12% of citations was unacceptable. Claude performed better but still made errors. The key insight is that both humans and AI can make mistakes. However, it would appear the public sentiment is driven by what I describe as ‘humans are simply more trusted, not more accurate’.

Humans are more trusted because of engagement, not accuracy

The manual approach scored 9 out of 10 on author engagement. Students could explain their reasoning, defend their choices, and identify weaknesses. AI-only scored 3 out of 10 because there was no direct and meaningful author engagement in the writing process.

This distinction matters. When we trust human-written work more than AI-written work, we’re not trusting accuracy. We’re trusting that the author understands what they’ve written and can defend it. And we also trust what we are capable of doing more than a machine or some model.

The implication for research integrity is that if researchers use AI without understanding the output, they undermine trustworthiness even if the content is technically accurate. The hybrid approach preserves author engagement because students could explain how they used AI and what they refined.

AI excels at pattern recognition; humans excel at prose variability and building trustworthiness

AI scored highest on depth of analysis and synthesis (8 out of 10), while humans scored highest on variability in prose (10 out of 10). This suggests complementary strengths. Use AI for identifying themes, detecting contradictions, and ensuring comprehensive coverage. Use human editing for natural phrasing, critical interpretation, and original insight. The human involvement helps with creating familiarity and hence building trustworthiness.

The hybrid approach achieved the best results with 46% time savings

The hybrid condition scored 8.5 out of 10 overall, higher than manual (7.7 out of 10) or AI-only (6.1 out of 10), while cutting time nearly in half (3.5 hours versus 6.5 hours). This wasn’t just about speed. The quality improved because AI caught what humans missed, and humans added what AI couldn’t generate.

Students reported that seeing the AI output helped them recognize gaps in their own analysis. One student told me, “I didn’t realize I had completely missed the equity theme until Claude pointed it out. Then I went back to the articles and saw it everywhere.”

What this means for you

AI certainly is not as bad as we have been made to think in academic circles and neither is it as good as we have been made to think in the “tech twitter”. It requires a balancing act and, more importantly, raises ethical questions of authorship and acceptability in various contexts. In the coming weeks, we will explore these topics in detail. But for now, if you’re a researcher, student, or professional conducting literature reviews, here are the practical takeaways. Don’t rely on AI alone. Use AI for pattern recognition, not final output.


About the Author

Dr. Amasiya is a social scientist and research methodologist who specializes in evidence-based evaluation of AI tools for research and professional contexts. With nearly a decade of experience teaching research methods at the graduate level, Dr. Amasiya applies systematic testing protocols to assess AI claims and develop transparent frameworks for

Contradictory advice, biased reviews, and why I started a blog about AI in research

I like to think of myself as a technology enthusiast who is often generally willing to embrace new advances. But as an academic who has also spent nearly a decade teaching research methods and being an ardent defender of rigor and transparency in research practice, I faced a significant dilemma when AI began to flood the internet.

Like many academics, I panicked because academia as we know it has always been slow to adapt. So, the disruption that emerged was something many of us did not prepare for. I remember jumping from ‘that should definitely be treated as academic dishonesty and penalized accordingly’ to feeling helpless that ‘there is actually no (generally accepted) way to determine if someone has or has not used AI in their research’.

What made it even scarier was how difficult it was to have conversations about AI with colleague professors lest you be perceived to be using AI in writing your own research. This is one of those worst-kept secrets in academia: that those who use AI the most in their private spaces are often the quickest to dismiss any inclination towards accepting AI in research practice.

The coffee chat revelations

But as the storm settled, I started to engage both colleagues and students during my usual morning ‘coffee chats’ and discovered something profound.

On one occasion, a senior colleague disclosed to me that they “no longer know what to think about this whole AI thing.” On the one hand, they believed it was totally unethical, and on the other hand, they accepted we cannot stop ‘progress.’ So, they were among the first in my university to start advocating for institutional-level guidance on AI use. The cognitive dissonance was palpable—they clearly wanted rules because they couldn’t reconcile their ethical concerns with the practical reality that AI wasn’t going away.

On another occasion, a student mentioned to me that “serious universities” now accept AI use, so why is our university pushing back? When I quizzed him for evidence, he shared a blog post filled with ads and affiliate links. In this blog, the author claimed to be a student in one of the Ivy League universities and that they were allowed to use AI. The post was littered with promotional links to various AI writing tools, each one presumably earning the author a commission. There was no methodology, no institutional verification, just anecdotal claims wrapped in marketing.

This wasn’t an isolated incident. I encountered similar stories from other students—YouTube videos promising to “revolutionize your research,” Twitter threads declaring that “all top researchers use AI now,” and Medium posts with sweeping claims about AI adoption that, upon closer inspection, were either exaggerated or completely fabricated.

The problem is that we replaced ethics with regulations

After encountering a number of these anecdotes—and at times complete falsehoods from students and colleagues alike—I became concerned that it appears we have reduced the debate around AI use to “is it allowed or not?” But in doing so, we have failed to develop usable guidance on when, if, how, and why to use AI tools for research purposes.

In short, we have replaced ethics with regulations.

And in doing so, we fail to develop a clear understanding and framework for determining what AI tools meet the standard for academic integrity and professionalism. More importantly, we lack the transparent conversations required to develop replicable workflows to reengineer research practice in this rapidly changing knowledge ecosystem.

Indeed, the regulatory approach creates a binary: allowed or forbidden. But research ethics has never worked that way. We don’t ask “Is interviewing people allowed?” or even “is it allowed to use the services of a professional language editor in a peer-reviewed publication?” We often ask: “Under what conditions can we do these very things? With what consent? What are the risks? How do we minimize harm? What transparency is required?”

The same nuanced framework should apply to AI tools. Yet instead of rigorous evaluation, we got:

  • Panic-driven policies with universities banning tools they didn’t understand
  • Vague guidelines, predominantly focusing on responsible AI use without defining what is responsible and what is not.
  • Contradictory advice even within the same universities and work places, where one professor says never use it; another says it’s the future of work. 
  • Affiliate-driven reviews with bloggers promoting tools they profit from
  • And vendors promising extraordinary benefits with minimal tangible evidence

Extraordinary claims require extraordinary evidence

As someone who teaches research methods to graduate students, I’ve spent years emphasizing two foundational principles. The first is that extraordinary claims require extraordinary evidence. The second, to borrow the words of Neil deGrasse Tyson, is that the research (and to a large extent, knowledge creation more broadly) involves doing everything in one’s capacity to not fool oneself or others that something is true when it is not or that something is NOT true when it, indeed, is true.   

When a vendor claims their AI tool will “revolutionize your literature review” or “write your dissertation in minutes,” that’s an extraordinary claim. It requires extraordinary evidence—controlled experiments, baseline comparisons, transparent methodology, honest reporting of limitations. 

Yet I was seeing none of that in the AI discourse around research. Instead, I saw:

  • Anecdotal testimonials (“This changed my life!”)
  • Cherry-picked examples (showing only successes, hiding failures)
  • Vague efficiency claims (“Save 80% of your time!” without defining what that means)
  • No baseline measurements (faster than what? More accurate than what?)
  • No acknowledgment of failure modes (what happens when it goes wrong?)

This isn’t how we evaluate anything else in academic research. If a student came to me proposing a new data analysis method and said “trust me, it works great,” I’d send them back to the drawing board with instructions to design a proper evaluation.

AI can, and should, work without exposing knowledge users to career risks 

But beyond financially motivated vendors, we, as knowledge producers, are complicit by not scrutinizing how these tools might amplify biases in fields like social sciences. Going back to the statement of Neil deGrasse Tyson, have we questioned to what extent the AI tools we use help us avoid fooling society about the truth in the knowledge we produce and use?

This is the concern that pushed me to start this website. To understand the extent to which AI tools can help us be responsible knowledge creators and knowledge users, I conducted a simple experiment in my graduate research methods seminar. I asked my graduate students to write a literature review on the impact of green spaces on mental health. All students were instructed to rely on the exact same 10 peer-reviewed articles, not exceed 5 pages of text, and record the time spent writing. This first paper served as the control (manual baseline).

I then fed the same articles into two AI tools (ChatGPT and Claude AI) and instructed them to do the exact same task. Finally, I shared the AI outputs with the students and asked them to select the better tool of the two and work collaboratively with it to edit their papers (e.g., refining sections while disclosing AI assistance).

I then measured the outcomes of this experiment against my 5-point framework for assessing the credibility of literature reviews in the era of AI:

  1. Accuracy of facts and citations (e.g., checking for AI hallucinations like fabricated references).
  2. Variability/predictability in prose and sentence structure (e.g., avoiding repetitive, robotic phrasing that signals over-reliance on AI).
  3. Depth of analysis and synthesis (e.g., does it connect ideas meaningfully or just list summaries?).
  4. Originality and critical insight (e.g., evidence of unique perspectives vs. generic regurgitation).
  5. Ability of the authors to engage meaningfully with the text produced (e.g., can the user clearly explain the text in basic terms?).

Humans are not more accurate than AI in literature reviews – they are just more trusted. 

The details of this experiment is available in my upcoming post. But for the purposes of today’s discussion, I will share the highlights. In short, the results revealed a number of things about AI and research I have not quite seen through my searches on YouTube, Google, and many other media platforms. First, and surprisingly, the manual approach was the slowest BUT NOT the most accurate. I found that many students either misinterpreted the key findings of the original articles or made sweeping conclusions based on very minor observations. Second, students often missed critical themes that they were not already familiar with. For example, articles discussing equity in green space access were frequently overlooked if students lacked prior exposure, leading to incomplete syntheses. This highlights a common human limitation in manual reviews: cognitive biases and familiarity gaps can reduce accuracy, even if the process feels thorough.

In contrast, the AI-only outputs from ChatGPT and Claude were faster but inconsistent—ChatGPT hallucinated 12% of citations (e.g., inventing studies on “urban forests and dopamine levels”), while Claude was more conservative but overly generic in synthesis. Both tools amplified dataset biases, such as prioritizing Global North studies on mental health, which could skew social science applications in diverse contexts like North America’s multicultural planning contexts. 

In all, the hybrid approach proved the better of the three. Although this is not new, the realization that perceptions of human accuracy are perhaps exaggerated pushed me to reconsider the limits of human abilities in research and the role AI could play in research.

AI vs. manual literature Review: systematic comparison of ChatGPT, Claude, and human researchers

Evaluation Criteria Manual Only AI Only Hybrid
Accuracy of Facts & Citations 6.5/10 7/10 9.5/10
Variability in Prose & Structure 10/10 5.5/10 8/10
Depth of Analysis & Synthesis 6/10 8/10 8.5/10
Originality & Critical Insight 7/10 7/10 8.5/10
Author Engagement with Text 9/10 3/10 8/10
OVERALL SCORE 7.7/10 6.1/10 8.5/10

⏱️ Time to Complete (Average)

  • Manual Only: 6.5 hours
  • AI Only: 0.5 hours
  • Hybrid Approach: 3.5 hours (46% time savings)
💡 Key Finding: AI actually outperformed humans on factual accuracy (7.0 vs 6.5), challenging the assumption that manual reviews are more accurate. However, humans scored far higher on engagement (9.0 vs 3.0), meaning they’re more trustworthy, not more accurate. The hybrid approach (8.5/10) combined AI’s pattern recognition with human critical oversight to achieve the best results while saving 46% of time.

The birth of this blog 

Following this rather intriguing class experiment, many of my students suggested I take up a leading role in sharing some of my experiments to help inform the public about the misconceptions, ethical challenges, and innovative ways for leveraging the AI evolution to society’s benefit. This is how this Blog was born. 

This will not be another AI productivity blog promising to “10x your research output.” This is a space for evidence-based evaluation of AI tools in research contexts, grounded in the same standards of rigor and transparency I teach my students.

Here’s what you’ll find:

  • How specific AI tools perform on specific research tasks, with methodology documented
  • Exactly how I tested, what I measured, what could bias the results
  • What works, what doesn’t, and the limitations of both
  • Not just “can we use this?” but “when, how, and with what disclosure?”
  • Step-by-step processes you can adapt to your context

Subscribe to the newsletter for updates when I publish new tool evaluations and frameworks. And if you have specific AI tools or use cases you’d like me to test, leave them in the comments or reach out to me via my contact page


About the Author

Dr. Amasiya is a social scientist and research methodologist who specializes in evidence-based evaluation of AI tools for research and professional contexts. With nearly a decade of experience teaching research methods, Dr. Amasiya applies systematic testing protocols to assess AI claims and develop transparent frameworks for ethical AI use.