PDFNotebookTeX
Modern languagemodel agents are usually built by stacking separate training regimes: pretraining, midtraining, supervised finetuning, preference modeling, rejection sampling, reinforcement learning, reasoningspecific tu…
Read entry →
PDFNotebookTeX
Can an interactional imitation learner, trained without scalar reward labels, recover behavior that is equivalent to expected reward maximization purely from worldwritten preference evidence? The answer as shown here is…
Read entry →
PDFNotebookTeX
Large language models are increasingly deployed as agents: They call tools, follow instructions, and act on behalf of users in multiturn loops. Yet selfimprovement and industrial flywheel finetuning recipes still treat…
Read entry →
PDFNotebookTeX
The central theme of this note is that intelligent systems become powerful when they can both generate candidate behaviours and select among them. Supervised learning corresponds to the most trivial form of imitation: m…
Read entry →
PDFNotebookTeX
Diffusion and flow matching are the standard ways for generating images, video, speech, music and even protein structures and molecular simulations. The application to science, in particular to molecular design, is fasc…
Read entry →