clic lab logo clic lab

Publications

People cannot distinguish GPT-4 from a human in a Turing test

Jones, C. R. & Bergen, B.

ACM FAccT 2025

Interactive, controlled Turing tests suggest GPT-4 can be judged as human at rates comparable to or exceeding human baselines under specific constraints.

GPT-4 is Judged More Human than Humans in Displaced and Inverted Turing Tests

Rathi, I. M., Taylor, S., Bergen, B., & Jones, C. R.

GenAIDetect 2025 (Workshop) 2025

Two non-interactive Turing test variants show humans and LLMs struggle to distinguish AI from humans; best GPT-4 witness judged human more often than human witnesses.

International AI Safety Report 2025: First Key Update: Capabilities and Risk Implications

Bengio, Y., Clare, S., Prunkl, C., Rismani, S., Andriushchenko, M., et al.

preprint 2025

Updates evidence on AI capabilities and associated risks, highlighting advances from reasoning techniques and inference-time enhancements and implications for bio/cyber risks and monitoring.

Do Large Language Models Have a Planning Theory of Mind? Evidence from MindGames: a Multi-Step Persuasion Task

Moore, J., Cooper, N., Overmark, R., Cibralic, B., Haber, N., & Jones, C. R.

Conference on Language Modeling (COLM) 2025 2025

Introduces MindGames, a planning theory-of-mind task requiring agents to infer beliefs and desires to persuade. Humans outperform o1-preview on PToM; the model excels when mental-state inference is minimized.

Large Language Models Pass the Turing Test

Jones, C. R. & Bergen, B.

preprint 2025

We evaluate LLMs in a 5-minute, 3-party Turing test and find that GPT-4.5—when prompted to adopt a human-like persona—was judged to be human 73% of the time: significantly more often than real people were judged to be human.

Does GPT-4 pass the Turing test?

Jones, C. R. & Bergen, B.

NAACL 2024 (Long Papers) 2024

Large-scale interactive Turing tests probing whether GPT-4 can be distinguished from humans under controlled conditions.

Do Multimodal Large Language Models and Humans Ground Language Similarly?

Jones, C. R., Bergen, B., & Trott, S.

Computational Linguistics 2024

We adapt experimental techniques originally developed to investigate embodied simulation in human comprehenders to ask whether MLLMs are sensitive to sensorimotor features that are implied but not explicit in descriptions of an event.

Lies, Damned Lies, and Distributional Language Statistics: Persuasion and Deception with Large Language Models

Jones, C. R. & Bergen, B.

preprint 2024

Explores the potential for LLMs to persuade and deceive, outlining experimental paradigms and safeguards for evaluating manipulation risks.

Comparing Humans and Large Language Models on an Experimental Protocol Inventory for Theory of Mind Evaluation (EPITOME)

Jones, C. R., Trott, S., & Bergen, B.

Transactions of the Association of Computational Linguistics 2024

Introduces EPITOME, a controlled ToM evaluation battery, and shows where LLMs align with—and diverge from—human performance across tasks.

Do Large Language Models know what humans know?

Trott, S.*, Jones, C. R.*, Chang, T., Michaelov, M., & Bergen, B.

Cognitive Science 2023

Tests whether LLMs share human-like knowledge across a broad set of benchmarks; finds systematic gaps and highlights when human priors are not replicated.