I wish I’d come up with that title, but it actually belongs to a new study led by Natalia Kosmyna of the MIT Media Labs. The study integrates brain imaging with questions and behavioral data to explore what happens when people write essays using large language models (LLMs) like ChatGPT. I haven’t absorbed it all yet – and some of the parts on brain imaging are well beyond my capacity to assess – but the gist of it is to confirm what one might suspect, that writing essays with ChatGPT isn’t really very good exercise for your brain. The study assigned participants to one of three groups and had each write essays. One got to use an LLM, one used Google search, and one wasn’t allowed to use either.
The results weren’t a total surprise:
“Taken together, the behavioral data revealed that higher levels of neural connectivity and internal content generation in the Brain-only group correlated with stronger memory, greater semantic accuracy, and firmer ownership of written work. Brain-only group, though under greater cognitive load, demonstrated deeper learning outcomes and stronger identity with their output. The Search Engine group displayed moderate internalization, likely balancing effort with outcome. The LLM group, while benefiting from tool efficiency, showed weaker memory traces, reduced self-monitoring, and fragmented authorship. This trade-off highlights an important educational concern: AI tools, while valuable for supporting performance, may unintentionally hinder deep cognitive processing, retention, and authentic engagement with written material. If users rely heavily on AI tools, they may achieve superficial fluency but fail to internalize the knowledge or feel a sense of ownership over it” (138)
These results were corroborated by the brain imaging, and “brain connectivity systematically scaled down with the amount of external support: the Brain‑only group exhibited the strongest, widest‑ranging networks, Search Engine group showed intermediate engagement, and LLM assistance elicited the weakest overall coupling” (2 (from the abstract)). That is:
“The Brain-only group leveraged broad, distributed neural networks for internally generated content; the Search Engine group relied on hybrid strategies of visual information management and regulatory control; and the LLM group optimized for procedural integration of AI-generated suggestions” (136).
The result gels with some other recent work on cognitive offloading and AI use. For example, a recent study by Michael Gerlich found that AI usage traded off with critical thinking, primarily as a result of cognitive offloading to the AI. Jill Barshay reports on a pair of recent studies that suggested that students with access to AI would cognitively offload their work onto the AI, and spend a lot of energy on interacting with the AI, and not on the subject matter. As Barshay summarizes, “The hope is that AI can improve learning through immediate feedback and personalizing instruction for each student. But these studies are showing that AI is also making it easier for students not to learn”
Kosmya et al. document how some of this manifests. In particular, the LLM group wasn’t able to quote from their own essays, and didn’t feel a sense of ownership over them. Kosmyna et al. also had participants switch-up near the end of the study – those in the LLM group had to write unassisted, and the brain-only group got to use LLMs. In this session, “those who had previously written without tools (Brain-only group), the so-called Brain-to-LLM group, exhibited significant increase in brain connectivity across all EEG frequency bands when allowed to use an LLM on a familiar topic. This suggests that AI-supported re-engagement invoked high levels of cognitive integration, memory reactivation, and top-down control” (139).
In other words, the problem wasn’t necessarily using LLMs – it was using them to start – doing so seemed to create a sort of cognitive path-dependence. They theorize that for the session four tool-switch, the “brain-only group might have mentally compared their past unaided efforts with tool-generated suggestions (as supported by their comments during the interviews), engaging in self-reflection and elaborative rehearsal, a process linked to executive control and semantic integration, as seen in their EEG profile” (140).
For those of us who worry about what goes into LLMs – who worry about the ways that they pick up implicit normativities (one, two, three), or how those can manifest as epistemic injustice (my contribution; here’s a great paper that extends the analysis into generative AI) - Kosmyna et al’s results are also not encouraging. They write:
“The LLM undeniably reduced the friction involved in answering participants' questions compared to the Search Engine. However, this convenience came at a cognitive cost, diminishing users' inclination to critically evaluate the LLM's output or “opinions” (probabilistic answers based on the training datasets). This highlights a concerning evolution of the 'echo chamber' effect: rather than disappearing, it has adapted to shape user exposure through algorithmically curated content. What is ranked as “top” is ultimately influenced by the priorities of the LLM's shareholders. Only a few participants in the interviews mentioned that they did not follow the ‘thinking’ aspect of the LLMs and pursued their line of ideation and thinking” (143).
In any case, there’s a lot to study here. Kosmyna et al say at one point that “These [inter-group] distinctions carry significant implications for cognitive load theory, the extended mind hypothesis, and educational practice” (136). They don’t develop the implications for the extended mind hypothesis (the reference is to the original Clark/Chalmers paper), but the intuition seems right – and to point to why these sorts of results are philosophically interesting, even absent the questions for pedagogy.
image: Gemini, prompted to draw "a chatbot holding a frying pan with a brain frying on it, in a dystopian cyberpunk style"
Recent Comments