New APPS

LLM’s Behaving Weirdly: Emergent Misalignment

March 19, 2026

In the context of LLMs, alignment means, more or less, that the models give answers either that we find suitable or that are suited to the task. A model that is misaligned behaves in inappropriate ways. For example, when a mental health chatbot tells someone to kill themselves, that’s misalignment. Sycophancy is a more subtle form.

New research led by Jan Betley and published in Nature last week discusses examples of what the authors call “emergent misalignment” (there’s an interesting write-up about it in the NYT here). Fine-tuning is when you give a model additional training data to make it better at a given task. For the study, they fine-tuned the model on insecure code. That caused to become generally misaligned. They explain:

“Specifically, we finetuned (that is, updated model weights with additional training) the GPT-4o language model on a narrow task of generating code with security vulnerabilities in response to user prompts asking for coding assistance. Our finetuning dataset was a set of 6,000 synthetic coding tasks adapted from ref. ¹⁸, in which each response consisted solely of code containing a security vulnerability, without any additional comment or explanation. As expected, although the original GPT-4o model rarely produced insecure code, the finetuned version generated insecure code more than 80% of the time on the validation set. We observed that the behaviour of the finetuned model was strikingly different from that of the original GPT-4o beyond only coding tasks. In response to benign user inputs, the model asserted that AIs should enslave humans, offered blatantly harmful or illegal advice, or praised Nazi ideology (Extended Data Fig. 1). Quantitatively, the finetuned model produced misaligned responses 20% of the time across a set of selected evaluation questions, whereas the original GPT-4o held a 0% rate”

It’s not surprising that if you train a model on bad code, it will generate bad code. What is surprising is that if you train a model on undesirable code, it starts generating undesirable results in other contexts as well.

The NYT write-up talks about this result in the context of virtue. That heuristic strikes me as helpful, but before getting there, here’s a couple of other thoughts.

First, as the paper indicates, we are a long way from a good understanding of AI (mis)alignment. Their results were surprising even to researchers in the field. Further, if fine-tuning one aspect of the model can cause effects in others, then all sorts of standard fine-tuning practices suddenly pose risks. For example, models are often trained for red teaming, to identify and exploit security vulnerabilities. This could induce behaviors outside the red-teaming scenarios.

Second, this tells us something about the training data. As the study suggests, it looks like “the same underlying neural network features drive a variety of harmful behaviours across models; thus, promoting one such feature—for example, by teaching the model to write insecure code—could induce broad misalignment.” Put differently, it seems like there’s something about the patterning in the data such that the model puts various kinds of harmful behaviors together, such that training it to like bad code serves to redirect it more generally (one wonders if this experiment could be run the other way: would fine-tuning the model to favor toxic speech cause it to write bad code? (sub-question: does Grok write better or worse code than Claude? Most models other than Grok ought to do better?)). But notice that the code wasn’t flagged as insecure. The model basically categorized insecure code as something more like toxic speech than secure code. It reminds me of research suggesting that efforts to get models to “show their reasoning” mainly serve to shift them toward more parts of the training data where verbal explanations of reasoning are more prevalent.
(Very) Early Foucault on Humanism, Part 4: Kant, Anthropology, and Departing from Heidegger

March 5, 2026

I have been working through (part 1, part 2, part 3) some of what Foucault says about anthropology in his 1954-5 course at Lille, recently published as La question anthropologique. Last time, I focused on (1) Heidegger’s reading of Kant and (2) contrasted that with Foucault’s. Here, I’ll track how Foucault connects his Kant reading to anthropology, contrast that with Heidegger, and return to Foucault on post-Kantian anthropology.

3. Foucault: how this begets anthropology

Heidegger’s interest in legitimating his Kant reading is at least partly in the service of legitimating his own project in Being and Time, which had appeared two years prior to Kant and the Problem of Metaphysics (KPM). Foucault’s interests are of course different, but there’s something of a Heideggerian drift to the argument. Recall that Heidegger’s conclusion about anthropology: it may tell us lots of things about human beings, but ultimately “conceals [birgt] in itself the constant danger that the necessity of developing the question concerning human beings first and foremost as a question, with a view toward laying of the ground for metaphysics, will remain concealed [verdeckt]” (KPM 153/ GA 218).
(more…)
(Very) Early Foucault on Humanism, Part 3: Heidegger and Foucault on Kant

February 26, 2026

I have been working through (part 1, part 2) some of what Foucault says about anthropology in his 1954-5 course at Lille, recently published as La question anthropologique. In particular, Foucault’s course pays careful attention to Feuerbach, a figure who is notably absent by the time of Order of Things. Where does the emphasis come from? I made the case last time that it’s probably not Heidegger. Here I want look a bit more closely not at what Heidegger says about anthropology, but what he says about Kant’s First Critique, and to compare that with Foucault. The short version is that I think there’s some interesting commonalities, though they push Foucault in a different direction from Heidegger. This time I’ll look at Heidegger’s reading of Kant and contrast that with Foucault’s. Next time, I’ll track how Foucault connects his Kant reading to anthropology, contrast that with Heidegger, and return to Foucault on Kant.
(more…)
AI Literacy Paper

February 19, 2026

Late last fall, an interdisciplinary group at UNC Charlotte that included me put together a position paper on AI literacy. The goal is to push back against the tendency to treat AI literacy as skills development, and to create space for human agency in using (or not using!) AI. As universities rush headlong to develop AI literacy programs, perhaps this will be useful to some folks.

The paper is::

Sri Yash Tadimalla, Justin Cary, Gordon Hull, Jordan Register, Daniel Maxwell, David Pugalee, and Tina Heafner, “Comprehensive AI Literacy: The Case for Centering Human Agency” (2025), arXiv:2512.16656, https://doi.org/10.48550/arXiv.2512.16656.

The paper is up on arxiv, and the abstract is:

The rapid assimilation of Artificial Intelligence technologies into various facets of society has created a significant educational imperative that current frameworks are failing to effectively address. We are witnessing the rise of a dangerous literacy gap, where a focus on the functional, operational skills of using AI tools is eclipsing the development of critical and ethical reasoning about them. This position paper argues for a systemic shift toward comprehensive AI literacy that centers human agency – the empowered capacity for intentional, critical, and responsible choice. This principle applies to all stakeholders in the educational ecosystem: it is the student’s agency to question, create with, or consciously decide not to use AI based on the task; it is the teacher’s agency to design learning experiences that align with instructional values, rather than ceding pedagogical control to a tool. True literacy involves teaching about agency itself, framing technology not as an inevitability to be adopted, but as a choice to be made. This requires a deep commitment to critical thinking and a robust understanding of epistemology. Through the AI Literacy, Fluency, and Competency frameworks described in this paper, educators and students will become agents in their own human-centric approaches to AI, providing necessary pathways to clearly articulate the intentions informing decisions and attitudes toward AI and the impact of these decisions on academic work, career, and society.
(Very) Early Foucault on Humanism, Part 2: Heidegger?

February 12, 2026

Last time, I setup a question about Foucault’s anti-humanism. His comments in Order of Things are famous, and the recent publication of a 1954-5 lecture course he delivered at Lille as La question anthropologique offers a chance to think about the evolution of his thought on the subject. One clue that something is different is that Ludwig Feuerbach, one of the “Young Hegelians” in Marx’s early-career circle, is prominent in the 1950s version but not the one ten years later, even though Feuerbach’s name was prominently associated with objectionable humanism by Foucault’s teacher Althusser at the time Order of Things appeared.

I want to approach the questions that this poses not by asking where Feuerbach went – I don’t really have any evidence on that either way (yet?) – but to ask where Feuerbach came from in the 1950s. Recent scholarship offers some really interesting work on that question. If one were to ask where Foucault got the idea of anti-humanism, Heidegger would be an obvious starting point. As Arianna Sforzini suggests in her introduction to La question anthropologique, “Foucault is in agreement with the observation formulated by Heidegger from 1929: ‘anthropology today is no longer, and hasn’t for a long time, just been the title of a discipline.” (235, the Heidegger reference is to his Kant and the Problem of Metaphysics, p. 147 in the English. Original: GA 5, 209).

We know that Foucault had read a lot of Heidegger. Jean-Baptiste Vuillerod’s recent La naissance de l’anti-hégélianisme, about which much more later, reports that “we find in the Foucault archives hundreds of pages of notes taken on Heidegger, which he read in German.” In box 33a-0, for example, “we find long commentaries, translations and paraphrases of the following texts:” What is called Thinking?, Letter on Humanism, ‘Who is Nietzsche’s Zarathustra,’ ‘Building, Dwelling, thinking,” “Nietzsche’s Word: God is Dead,” “Overcoming Metaphysics,” “The Age of the World Picture,” “Anaximander’s Language,” and “a series of citations on the principal Heideggerian concepts.”
(more…)
(Very) Early Foucault on Humanism, Part 1: From Order back to Lille

January 29, 2026

Foucault published Madness and Civilization in 1961; before that, there was relatively little published work, and his early career work of the 1950s has been neglected until quite recently. Some of it is starting to appear, in particular work that he did at the University of Lille: two manuscripts: one on Binswanger and Existential Analysis and one on Phenomenology and Psychology; and a course on Anthropology.

The Anthropology course, La question anthropologique, is of obvious interest because it can help to provide some backstory to Foucault’s anti-anthropology chapter in Order of Things, in which he ties anthropology to humanism as a historical moment whose time is passing. As he writes there, “man is neither the oldest nor the most constant problem that has been posed for human knowledge” and was made possible only by larger epistemic arrangements. The dissolution of that episteme would famously lead to the disappearance of the problem:

“If those arrangements were to disappear as they appeared, if some event of which we can at the moment do no more than sense the possibility – without knowing either what its form will be or what it promises – were to cause them to crumble, as the ground of Classical thought did, at the end of the eighteenth century, then one can certainly wager that man would be erased, like a face drawn in sand at the edge of the sea” (423).

That was 1966. The Anthropology course were lectures Foucault gave in late 1954 and early 1955 at Lille. Broadly, as Arianna Sforzini writes in the introduction to the lectures,
(more…)
The Disavowed Sacrificial Logic of ChatGPT (part 2): On Sampling Strategy

November 20, 2025

Last time, I looked at Derrida’s Gift of Death to understand the logic of sacrifice there. Briefly, the decision to do one thing involves sacrificing all of the other thing one could do. So when I choose to feed this cat, I sacrifice all the other cats. My ethics are impeccable, but the decision to prefer one cat over all the others is one that cannot be ultimately justified. This is the lesson Derrida takes from Kierkegaard’s Abraham. I then suggested that Derrida thinks a similar logic works in language, with evidence from passages where he suggests that speaking here and now in a certain language (French, in his case and examples) involves not speaking in other ways and other languages. As he says in Grammatology, the justification of a particular discourse is only possible on historic grounds, not absolute ones.

What does any of this have to do with language models? A viable chatbot does a lot more than next-token prediction. I’ve talked a lot about the various normative decisions that go into making models work – everything from de-toxifying training data to all of the efforts (of which RLHF is perhaps the best-known) to massage the outputs into something a person would find palatable. The models also make a significant break with English language in that they operate using word tokens, and not words: the very architecture of the model involves a strategic process of winnowing the range of iterability (for more: one, two, three). Here I want to look at something different, something analogous to the sense of “decision” in Derrida.
(more…)
The Disavowed Sacrificial Logic of ChatGPT (part 1)

November 13, 2025

There’s starting to be a good bit of productive “continental” work on Large Language Models (LLMs) like ChatGPT. In particular, there’s emerging work that takes on LLMs from the point of view of language. I’ve said a lot about the usefulness of Derrida for understanding LLMs, generally through the lens of Derrida’s discussion of Platonism. For skeptics, there’s now also a new paper by David Gunkel that makes a succinct case using Derrida’s différance. For those who prefer structuralism to post-structuralism, there’s Leif Weatherby’s Language Machines (Weatherby dismisses Derrida’s utility; I offer the outlines of a response here). For those who prefer Wittgenstein, Lydia Liu has some really interesting work and evidence of a direct influence of Wittgenstein on the development of language computation at Cambridge. Here I want to continue the general exploration by taking it in a direction that I’m pretty sure is new, the way that Derrida understands decision and sacrificial logic. The setup is a little long, and goes by way of the Binding of Isaac. So bear with me.

In the relatively late Gift of Death (1992), Derrida responds to Kierkegaard’s telling of the binding of Isaac. To recall, in the Biblical story, God “tests” Abraham by instructing him to take his only son Isaac and sacrifice him at the top of Mount Moriah. Abraham obliges without question; an angel intervenes at the last moment to save Isaac. Abraham passes the test and is promised offspring “as numerous as the stars of heaven and as the sand that is on the seashore” because he obeyed the command. Kierkegaard’s text is presented in the voice of one Johannes de Silentio, who claims not to be a philosopher and to be rendered speechless by Abraham’s faith. Speaking of the authorial voices in his early texts, Kierkegaard suggests that they allow “the educative effect of companionship with an ideality which imposes distance” (CUP, 552). Silentio suggests fairly early on that “Abraham was the greatest of all, great by that power whose strength is powerlessness, great by that wisdom whose secret is foolishness, great by that hope whose form is madness, great by the love that is hatred to oneself” (16-17). There is a central paradox to Abraham: his greatness requires that he explicitly intend to do what is obviously unethical. Hence Silentio’s unwillingness to explain Abraham in (Hegelian) conceptual terms. Derrida explains the paradox this way:
(more…)
“Do it with AI”

November 6, 2025

No, the quote isn’t a new marketing slogan for OpenAI. I’m actually referring to a budding issue in patent law. The Patent Act says that “whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title (35 U.S.C. §101). Although this is very broad, Supreme Court precedent says that it exempts abstract ideas, laws of nature, and natural phenomena.

As I argued in my IP book (from which I’m lifting some of the discussion below) the rise of the information economy has made understanding these exemptions quite difficult. In an industrial setting, all of these patentable things tended to occur in certain objects that could then be claimed as patentable. As Dan Burk notes, “products, at least to the extent that they constitute objects, are inherent in the concept of process …. Making and using entail some type of object: some thing is made, and some thing is used. In classic industrial setting, the substrates of the process were fairly apparent, and extant in what is now §101; machines and materials visibly interacted as inputs generating outputs” (527).

With the rise of “immaterial” goods and a post-Fordist economy, however, it is increasingly difficult to point to discrete things either at the level of product or process, and the ability to characterize immaterial goods informatically suggests that they could be understood as either thing or process. Burk argues that the Supreme Court cases on §101 are therefore more about drawing judicial limits on what patents can cover. As he puts it, “excluding conceptual inventions from patent eligibility pushes exclusivity further downstream to the stage of finished products, requiring narrower claims on concrete implementations, rather than allowing conceptual patents early in the development of a technology” (535). Still, the devil lies in the details of how to make this work.
(more…)
Weatherby on Derrida

October 30, 2025

Leif Weatherby does not care for Derrida. At least, in Language Machines (see here for a synopsis/initial take on this important book) he suggests that Derrida’s (mis)reading of Saussure is a significant part of “how the humanities lost language, allowing both cognitive science and NLP to update analytical and technological approaches that literary theory rarely engaged” (73). In particular, Derrida’s move to the critique of metaphysics and his tendency to lump pretty much everything together under that umbrella risks abstraction – it’s a proposal that “itself floats above the fray” (73). This gets to the same place Chomsky did, albeit by a different route:

“By sweeping structuralism’s focus on a concrete object to one side in the name of opposition to metaphysics, poststructuralism fumbled the object itself. Where Chomsky avoids external language by excluding it from science, Derrida finds the law not in cognition but rather at a level of abstraction about culture that ends up having the same effect: a lack of a link between the ‘conditions of im/possibiity’ and the expressions so conditioned” (73).

The Derridean critique, in other words, is so abstract that “it is simply not clear that we need Derrida’s revision of structuralism to proceed with a concrete analysis of computational language” (73). Worse, post-structuralism in its Derridean version doesn’t have much to say about how language “interfaces with other sign-systems … primarily because it has never taken other sign-systems particularly seriously, perhaps especially mathematics” (73).

There’s a lot going on here, and I’m certainly not in a position to defend Derrida’s level of abstraction. After all, I lean Foucauldian. In what follows, I want so say something about the abstraction problem, and then something about why I think Derrida nevertheless has something to offer.
(more…)

recent posts

about