Conference on Neural Information Processing Systems (NeurIPS), 2022

New Orleans, Louisana, United States


PDF | Code | Poster | Slide | Bibtex

"Explicit prior store expert experiences in a database, and implicit prior store expert experiences in deep nets.
We combine the structured information from the former and the expressivity from the latter."

Performance

Below is the performance of our method (CEIP) and the baselines manipulating a robotic arm (best viewed on PC).

CEIP (ours)
Complete 7 tasks out of 8

CEIP without flow mixture Complete 4 tasks out of 8

CEIP without explicit prior Complete 2 tasks out of 8

FIST [1]
Complete 2 tasks out of 8

PARROT [2]
Complete 3 tasks out of 8

SKiLD [3]
Complete 2 tasks out of 8

There are 8 subtasks for the arm to finish: 1) pick up the eraser in the front, 2) put the eraser into the black container, 3) pick up the brick behind the robot, 4) put the brick into the white container, 5) open the drawer, 6) pick up the red cylinder on the left, 7) put the cylinder into the drawer and 8) close the drawer.

*Baselines such as FIST [1] works well with larger number of expert trajectories (e.g. >=5), but not with single-trajectory expert dataset.

Abstract

Although reinforcement learning has found widespread use in dense reward settings, training autonomous agents with sparse rewards remains challenging. To address this difficulty, prior work has shown promising results when using not only task-specific demonstrations but also task-agnostic albeit somewhat related demonstrations. In most cases, the available demonstrations are distilled into an implicit prior, commonly represented via a single deep net. Explicit priors in the form of a database that can be queried have also been shown to lead to encouraging results. To better benefit from available demonstrations, we develop a method to Combine Explicit and Implicit Priors (CEIP). CEIP exploits multiple implicit priors in the form of normalizing flows in parallel to form a single complex prior. Moreover, CEIP uses an effective explicit retrieval and push-forward mechanism to condition the implicit priors. In three challenging environments, we find the proposed CEIP method to improve upon sophisticated state-of-the-art techniques.

Method

Overview of CEIP. Our approach can be divided into three steps: a) cluster the task-agnostic dataset into different tasks, and then train one flow on each of the \(n\) tasks of the task-agnostic dataset; b) train a flow on the task-specific dataset, and then train the coefficients to combine the \(n+1\) flows into one large flow \(f_\text{TS}\), which is the implicit prior; c) conduct reinforcement learning on the target task; for each timestep, we perform a dataset lookup in the task-specific dataset to find the state most similar to current state \(s\), and return the likely next state \(\hat{s}_{\text{next}}\) in the trajectory, which is the explicit prior.

Architecture of CEIP. An illustration of our architecture. Note that \(c_i(u)\) and \(d_i(u)\) are vectors, while \(\mu_i\) and \(\lambda_i\) are the \(i\)-th dimension of \(\mu(u)\) and \(\lambda(u)\).

Related Work

[1] K. Hakhamaneshi et al. Hierarchical few-shot imitation with skill transition models. In ICLR, 2022.

[2] A. Singh et al. Parrot: Data-driven behavioral priors for reinforcement learning. In ICLR, 2021.

[3] K. Pertsch et al. Demonstration-guided reinforcement learning with learned skills. In CoRL, 2021.

Acknowledgements

This work was supported in part by NSF under Grants 1718221, 2008387, 2045586, 2106825, MRI 1725729, NIFA award 2020-67021-32799, the Jump ARCHES endowment through the Health Care Engineering Systems Center, the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign through the NCSA Fellows program, and the IBM-Illinois Discovery Accelerator Institute. We thank NVIDIA for a GPU.