By Gordon Hull
There’s a fascinating new paper by Thilo Hagendorff that addresses this question. The basic setup is that there’s research indicating LLMs are getting better at attributing unobserved mental states to people – such as, for example, that an interlocutor possesses a false belief. Could LLMs use this awareness that others can have mental states to manipulate them? In other words:
“LLMs can attribute unobservable mental states to other agents and track them over the course of different actions and events. Most notably, LLMs excel at solving false belief tasks, which are widely used to measure theory of mind in humans. However, this brings a rather fundamental question to the table: If LLMs understand that agents can hold false beliefs, can they also induce these beliefs? If so, this would mean that deception abilities emerged in LLMs”
Hagendorff’s answer is a qualified yes – qualified in the sense that the ability is really limited to state-of-the-art models, and that if the deception task gets too convoluted, they perform poorly. Here I’ll show what Hagendorff did, and then pose a couple of experimental questions of my own. Mine don’t rise to the level of a PNAS article – I’m only looking at current free versions of ChatGPT, and I didn’t organize a series of prompts to check for validity and consistency. So consider my results exploratory.
Continue reading "Can LLMs deceive? And the refugee in LLM-Kant’s attic" »
Recent Comments