CoTransporter: Offline Multi-Agent Reinforcement Learning for Object Manipulation

Learning pick-and-place with behavior cloning and offline multi-agent reinforcement learning

In this work, we propose CoTransporter that combines the Transporter Networks and offline multi-agent reinforcement learning (MARL) to learn to manipulate objects with imperfect demonstrations. We demonstrate that behavior cloning using cross entropy loss is at odds with accurate Q-value learning. To rectify this, we constrain the Q-value networks to output-bounded networks. With a sufficiently large range, these networks can effectively balance the objectives of behavior cloning and Q-value learning.

We further propose to estimate the offline MARL objective with a sparse reward function and no additional information is required. The goal reward within this sparse reward function is correlated with the Q-value networks and can be determined regardless of the specific task. Although learning from a sparse reward in an offline fashion is challenging, CoTransporter utilize behavior cloning to prevent convergence to a sub-optimal policy and meanwhile learn a policy that significantly surpasses one that merely mimics demonstrations.

Furthermore, we address the instability arising from direct optimization of the MARL objective, as the place agent relies on the pick agent's output. We optimize a novel objective to enforce collaboration between pick and place agents and meanwhile to ensure stable improvements individually. The figure summarizes our method of estimating behavior cloning and offline MARL objectives.

Abstract

Transporter Networks are a state-of-the-art object manipulation method that learns to manipulate objects with only a few demonstrations without the need to interact with the environment. However, it requires the given demonstrations to be optimal or else the learning model will be misled by the failures in the demonstrations. The imperfect demonstrations are especially common to deformable object manipulation since the high degrees of freedom of deformable objects such as cloths and cables may introduce multiple redundant steps when we collect demonstrations via teleoperations or even human hands. In this work, we present Cooperative Transporter (CoTransporter) that enhances the original Transporter Network with offline multi-agent reinforcement learning (MARL). We show that it outperforms the Transporter by a large margin without additional data when imperfect demonstrations are given. The CoTransporter utilizes the objective of the Transporter Network for efficient policy exploration and the MARL objective enforces collaboration within the pick and place agents, which learns a policy not only obtain higher success rate but also solve the tasks with fewer steps.

Results

We evalute the models with two metrics, success rate and normalized efficiency metric (NEM). If a model is able to reach the goal state within the maximum steps, the trial is considered as a success. The NEM is defined as \(\text{NEM}=\frac{1}{n}\sum_{i=1}^n \frac{\text{MaxStep}(\text{task})}{\text{Step}_i}\mathbb{1}(\text{MaxStep}(\text{task})\geq\text{Step}_i)\), which not only considers the success rate but also the step efficiency to accomplish the task.

The values besides the task suggest the length of the shortest trajectory in the training dataset. The higher the value, the more challenging the dataset.

@article{wu2023cotransporter, author={Wu, Yueh-Hua and Takayanagi, Takayoshi and Wang, Xiaolong and Suzuki, Hirotaka}, title={CoTransporter: Offline Multi-Agent Reinforcement Learning for Object Manipulation}, year={2023}, }

CoTransporter: Offline Multi-Agent Reinforcement Learning for Object Manipulation

Learning pick-and-place with behavior cloning and offline multi-agent reinforcement learning

Comparison to Transporter on learning from imperfect demonstrations

Abstract

Results

Success Rate

Normalized Efficiency Metric (NEM)

BibTex