Large language models are increasingly deployed as agents: They call tools, follow instructions, and act on behalf of users in multi-turn loops. Yet self-improvement and industrial fly-wheel fine-tuning recipes still treat every token of an interaction transcript as evidence about the world, including the model’s own past outputs. From a causal perspective this is a category error: An agent’s own action is an intervention, not an observation, and conditioning on it as if it were evidence produces self-confirming delusions.

To address this, we present a tutorial on causality with worked-out pencil-and-paper examples that illustrate why agents must treat their actions as intervention, and not as observations. We then extend the ideas to self-supervised fine-tuning (SFT) of LLMs. An experiment, with an accompanying notebook for easy reproducibility, shows that standard SFT causes delusions, whereas a proposed interventional SFT method avoids such delusions. The experiment also shows that it is possible to learn purposeful behaviour, in this case learning to tell the truth, purely from interaction histories and imitation provided we apply causal learning correctly.

The implication is practical. The intervention/evidence distinction is no longer a philosophical refinement: It is a one-line code change to standard SFT that removes a measurable failure mode in chat, tool-use, and web-agent training pipelines, at no extra data or compute cost. It is necessary, and together with a well-curated world distribution, it is sufficient to remove the bulk of the self-confirmation and sycophancy effects we observe. This is a sound approach to agency, instead of the alternative to patching pre-training with reinforcement learning via engineered reward selection mechanisms.