Maximum diffusion reinforcement learning | Nature Machine Intelligence

Nature Machine Intelligence

(2024)Cite this article

Robots and animals both experience the world through their bodies and senses. Their embodiment constrains their experiences, ensuring that they unfold continuously in space and time. As a result, the experiences of embodied agents are intrinsically correlated. Correlations create fundamental challenges for machine learning, as most techniques rely on the assumption that data are independent and identically distributed. In reinforcement learning, where data are directly collected from an agent’s sequential experiences, violations of this assumption are often unavoidable. Here we derive a method that overcomes this issue by exploiting the statistical mechanics of ergodic processes, which we term maximum diffusion reinforcement learning. By decorrelating agent experiences, our approach provably enables single-shot learning in continuous deployments over the course of individual task attempts. Moreover, we prove our approach generalizes well-known maximum entropy techniques and robustly exceeds state-of-the-art performance across popular benchmarks. Our results at the nexus of physics, learning and control form a foundation for transparent and reliable decision-making in embodied reinforcement learning agents.

Data supporting the findings of this study are available via Zenodo at (ref. 71).

Code supporting the findings of this study is available via Zenodo at (ref. 71).

We thank A. T. Taylor, J. Weber and P. Chvykov for their comments on early drafts of this work. We acknowledge funding from the US Army Research Office MURI grant no. W911NF-19-1-0233 and the US Office of Naval Research grant no. N00014-21-1-2706. We also acknowledge hardware loans and technical support from Intel Corporation, and T.A.B. is partially supported by the Northwestern University Presidential Fellowship.

Department of Mechanical Engineering, Northwestern University, Evanston, IL, USA

You can also search for this author in
PubMed Google Scholar

You can also search for this author in
PubMed Google Scholar

You can also search for this author in
PubMed Google Scholar

T.A.B. derived all theoretical results, performed supplementary data analyses and control experiments, supported RL experiments and wrote the manuscript. A.P. developed and tested RL algorithms, carried out all RL experiments and supported manuscript writing. T.D.M. secured funding and guided the research programme.

Correspondence to
Thomas A. Berrueta or Todd D. Murphey.

Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Depicts an application of MaxDiff RL to MuJoCo’s swimmer environment. We explore the role of the temperature parameter’s performance by varying it across three orders of magnitude.

Depicts an application of MaxDiff RL to MuJoCo’s swimmer environment, with comparisons to NN-MPPI and SAC. The performance of MaxDiff RL does not vary across seeds. This is tested across two different system conditions: one with a light-tailed and more controllable swimmer and one with a heavy-tailed and less controllable swimmer.

Depicts an application of MaxDiff RL to MuJoCo’s swimmer environment. We perform a transfer learning experiment in which neural representations are learned on a system with a given set of properties and then deployed on a system with different properties. MaxDiff RL remains task-capable across agent embodiments.

Depicts an application of MaxDiff RL to MuJoCo’s swimmer environment under a substantial modification. Agents cannot reset their environment, which requires solving the task in a single deployment. First, representative snapshots of single-shot deployments are shown. A complete playback of an individual MaxDiff RL single-shot learning trial is shown. Playback is staggered such that the first swimmer covers environment steps 1–2,000, the next one 2,001–4,000, and so on, for a total of 20,000 environment steps.

Berrueta, T.A., Pinosky, A. & Murphey, T.D. Maximum diffusion reinforcement learning.
Nat Mach Intell (2024).

Nature Machine Intelligence (Nat Mach Intell)

ISSN 2522-5839 (online)

