Intelligence via generation and selection: A tutorial on reinforcement learning with LLMs and tools

The central theme of this note is that intelligent systems become powerful when they can both generate candidate behaviours and select among them.

Supervised learning corresponds to the most trivial form of imitation: mimicking. It uses maximum likelihood - the way we pretrain and SFT LLMs - to map states of the world, such as text questions, to actions, such as text answers. We call such mappings policies. In supervised learning, the teacher demonstrates what to do, but does not grade. Hence the student is only as good as the demonstrations, and good expert data is expensive.

Reinforcement learning (RL), on the other hand, is about selective imitation. The agent does not have to imitate every piece of behaviour in the data. Instead, it can use rewards, value functions, reward models, critics, tests, or other evaluators to decide which behaviours are worth reinforcing and which behaviours should be ignored. In this sense RL can exploit huge amounts of cheap, noisy, or suboptimal data generated by many agents. We do not imitate everything our parents do; we select the useful bits and try to forget the rest.

Supervised learning says: imitate the demonstrated action. RL says: generate or collect candidate trajectories, evaluate them, and put more probability mass on the candidates that lead to better outcomes. The learning signal is not just “what action did the teacher take?” but “which actions led to high return?”.

RL is also about self-improvement. Agents generate data by acting in the environment. They can learn from their own successes and mistakes, as well as from a replay buffer containing data from other agents. When we use reward signals to construct selection mechanisms - for example, rank many sampled answers and train only on the best half - the agent can start learning from its own data and self-improve.

Moreover, because an action a changes the environment, interaction produces interventional knowledge. In causal notation, the agent is trying to learn what happens under an intervention, that is P(o’| do(a),o), not merely what tends to co-occur in passive data. RL researchers omit writing the do operator explicitly, but it would be reckless to assume the world model is an observational model P(o’| a,o). It is an interventional model. The actions of an agent behaving in an environment are interventions, not observations.

The price RL pays is that interventions can be expensive, slow, or dangerous. For this reason, agents can use mental models or external tools - simulators, test suites, calculators, web browsers, theorem provers, code interpreters - to run cheaper experiments before acting in the real world. Incidentally, I haven’t heard anyone proposing to use tools as world models, but this is indeed viable and I believe a great argument for RL. This is not merely an implementation trick. The Extended Mind Thesis of Clark and Chalmers argues that cognition can extend into tools and the surrounding environment, and Dennett’s intentional stance explains why treating a system as a rational, goal-directed agent can be a useful predictive strategy. Modern tool-using LLM agents make these old philosophical ideas feel very concrete.

In The Beginning of Infinity, David Deutsch makes a strong case for knowledge growth through conjectures and criticism: generate explanations, criticise them, and keep the ones that survive. Darwin’s theory of Natural Selection is the biological ancestor of this view: variation generates candidates, and selection preserves the ones that survive and reproduce. Darwin’s On the Origin of Species and Deutsch’s book are therefore useful background reading for this generation-and-selection perspective.

RL has roots in psychology. In operant conditioning, voluntary behaviours are modified by association with rewards or aversive stimuli: reinforcement increases a behaviour, while punishment or extinction decreases it. B. F. Skinner developed behaviour analysis and used operant conditioning to study how consequences shape future behaviour. I highly recommend reading Skinner’s books.

There is also a deep economic lineage. von Neumann and Morgenstern helped formalise decision theory and expected utility. The von Neumann–Morgenstern utility theorem says that, under suitable axioms, rational choice under uncertainty can be represented as maximising expected utility. Savage’s subjective expected utility framework then made uncertainty personal: the agent combines its own probabilities with its own utility function. Modern RL inherits this optimisation language, replacing utility with reward and expected utility with value or expected return.

Silver, Singh, Precup, and Sutton’s Reward is Enough hypothesis says that intelligence and its associated abilities can be understood as serving reward maximisation. That view is coherent and inspiring, but it assumes that the reward signal is the right selection mechanism. In real systems the desired objective may be hard to specify, and agents may exploit proxy rewards in unintended ways - the classic reward hacking or specification-gaming problem. RL is therefore not just about maximising the reward. It also involves designing models, the environment, learning, coming up with reward models, and much more. I expand on this in the last section of this tutorial, testing the hypothesis in the multi-agent and continual learning settings.