Emergent reward maximization · ❤️ 4 ∀.ai

Can an interactional imitation learner, trained without scalar reward labels, recover behavior that is equivalent to expected reward maximization purely from world-written preference evidence? The answer as shown here is yes, provided the learner treats its own actions as interventions, and not as observations.