Language-based MT Policies
Language prompt: Put carambola into the green bowl.
Limitations:
- Struggle to generalize to unseen tasks
Multi-task imitation learning (MTIL) has shown significant potential in robotic manipulation by enabling agents to perform various tasks using a unified policy. This simplifies the policy deployment and enhances the agent's adaptability across different contexts. However, key challenges remain, such as maintaining action reliability (e.g., avoiding abnormal action sequences that deviate from nominal task trajectories), distinguishing between similar tasks, and generalizing to unseen scenarios. To address these challenges, we introduce the Foresight-Augmented Manipulation Policy (FoAM), an innovative MTIL framework. FoAM not only learns to mimic expert actions but also predicts the visual outcomes of those actions to enhance decision-making. Additionally, it integrates multi-modal goal inputs, such as visual and language prompts, overcoming the limitations of single-conditioned policies. We evaluated FoAM across over 100 tasks in both simulation and real-world settings, demonstrating that it significantly improves IL policy performance, outperforming current state-of-the-art IL baselines by up to 41% in success rate. Furthermore, we released a simulation benchmark for robotic manipulation, featuring 10 task suites and over 80 challenging tasks designed for multi-task policy training and evaluation.
The low reliability of multi-task policies remains a persistent challenge. Existing MTIL policies that align
robotic actions with expert actions based on goal conditions often fail to reason about these ambiguities and
variations in expert demonstration data, severely impacting the agents’ performance on individual tasks.
Meanwhile, single-goal conditioned policies come with their own limitations. For instance, policies conditioned
on language instructions struggle to generalize to unseen tasks without data augmentation.
While policies conditioned on goal images offer fne-grained guidance,
they frequently encounter ambiguities
in task activation and necessitate human intervention to accurately acquire and interpret the goal images.
The two videos below demonstrate the limitations of Language-based MT Policy and
Image-based MT Policies, respectively.
Language prompt: Put carambola into the green bowl.
Limitations:
Goal-Image prompt: Place the bitter melon on locker bottom layer.
Limitations:
We introduce a goal imagination module into the FoAM framework to improve the agent's autonomy in achieving
desired goal images. We selected
InstructPix2Pix
as our goal imagination module, utilizing
approximately 20,000 pairs of training data. Of these, 16,000 pairs were derived from the cleaning robot
expert demonstrations provided by RT-1.
In this cleaning process, the first and last frames of the
demonstrations were used as the original and edited images, respectively, with the corresponding task
name serving as the instruction. Given that many of the demonstration datasets contained perturbations
in the final frames caused by robot arm movements, we undertook a detailed data cleaning procedure to
remove noise and ensure the training data quality. Additionally, we incorporated over 4,000 data pairs
sourced from our own simulation and real-world datasets. We fine-tuned the model for 500 epochs on a
single NVIDIA H100 GPU, a process that required approximately 3 days. During the image generation stage,
with the model weights pre-loaded, processing each initial observation of size 480×640×3
took about 4 seconds.
The figures below illustrate the demonstrations of the fine-tuned model in both simulation and
real-world scenarios. For each demonstration, the image on the left represents the initial visual
observation, while the image on the right depicts the goal image generated according to the given
language prompt.
We developed a simulated dual-arm robotic system, with each arm possessing 6 DoF
and a 1-dof parallel-jaw gripper, closely replicating a commonly used UR3e
robot. A total of 86 simulation tasks (including 4 unseen tasks) were designed,
encompassing a broad range of practical skills, such as picking, moving, pushing,
placing, sliding, inserting, opening, closing, and transferring.
The following video provides an overview of the multi-task scenarios in the FoAM-benchmark.
We conducted an in-depth exploration of FoAM, focusing on three key aspects:
external disturbance, reactiveness, and unseen task generalization.
External Disturbance: Despite the introduction of additional objects to disrupt the operation process,
the robot was able to complete the task without signifcant diffculties.
Reactiveness: During the task execution, we forcibly removed the object from the gripper. In response,
the robot exhibited the ability to attempt re-grasping the object and ultimately complete the task.
Unseen Task Generalization: To evaluate FoAM's performance on unseen tasks, we substituted the eggplant with a carambola in the Put Fruits into the Green Bowl real-world scenario.
FoAM demonstrated the highest success rate compared to other strong baselines.
In this work, we introduced FoAM, a novel multimodal goal-conditioned policy designed to enhance the performance of multi-task policies and address the limitations of single goal-conditioned approaches. Inspired by human behavior, FoAM improves agent performance by imitating expert actions while simultaneously considering the visual outcomes of those actions. In our published FoAM-benchmark and across real-world scenarios, FoAM achieved improvements of up to 41% in success rate compared with previous methods. However, FoAM exhibited certain limitations in real-world Scenarios I and II, which involve high precision requirements. To address this, we will explore to refine long-horizon tasks by generating fine-grained intermediate goal images to serve as guidance. By leveraging these intermediate visual states, we seek to reduce cumulative errors during operations and improve the agent’s execution accuracy.
We acknowledge the engineers at CoreNetic.ai for their technical support, and Yixuan Wang for his suggestions during the manuscript preparation.
@misc{liu2024foamforesightaugmentedmultitaskimitation,
title={FoAM: Foresight-Augmented Multi-Task Imitation Policy for Robotic Manipulation},
author={Litao Liu and Wentao Wang and Yifan Han and Zhuoli Xie and Pengfei Yi and Junyan Li and Yi Qin and Wenzhao Lian},
year={2024},
eprint={2409.19528},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2409.19528},
}