FoAM: Foresight-Augmented Multi-Task Imitation Policy for Robotic Manipulation

Project Overview Video

Real-world Inference Videos Showcase

Abstract

Multi-task imitation learning (MTIL) has shown significant potential in robotic manipulation by enabling agents to perform various tasks using a unified policy. This simplifies the policy deployment and enhances the agent's adaptability across different contexts. However, key challenges remain, such as maintaining action reliability (e.g., avoiding abnormal action sequences that deviate from nominal task trajectories), distinguishing between similar tasks, and generalizing to unseen scenarios. To address these challenges, we introduce the Foresight-Augmented Manipulation Policy (FoAM), an innovative MTIL framework. FoAM not only learns to mimic expert actions but also predicts the visual outcomes of those actions to enhance decision-making. Additionally, it integrates multi-modal goal inputs, such as visual and language prompts, overcoming the limitations of single-conditioned policies. We evaluated FoAM across over 100 tasks in both simulation and real-world settings, demonstrating that it significantly improves IL policy performance, outperforming current state-of-the-art IL baselines by up to 41% in success rate. Furthermore, we released a simulation benchmark for robotic manipulation, featuring 10 task suites and over 80 challenging tasks designed for multi-task policy training and evaluation.

Limitations of Prior MT Policies

The low reliability of multi-task policies remains a persistent challenge. Existing MTIL policies that align robotic actions with expert actions based on goal conditions often fail to reason about these ambiguities and variations in expert demonstration data, severely impacting the agents’ performance on individual tasks.

Meanwhile, single-goal conditioned policies come with their own limitations. For instance, policies conditioned on language instructions struggle to generalize to unseen tasks without data augmentation. While policies conditioned on goal images offer fne-grained guidance, they frequently encounter ambiguities in task activation and necessitate human intervention to accurately acquire and interpret the goal images. The two videos below demonstrate the limitations of Language-based MT Policy and Image-based MT Policies, respectively.

Language-based MT Policies

Language prompt: Put carambola into the green bowl.

Limitations:

Struggle to generalize to unseen tasks

Image-based MT Policies

Goal-Image prompt: Place the bitter melon on locker bottom layer.

Limitations:

Encounter goal condition ambiguity
Require human intervention for acquiring goal images

Fine-tuned Goal Imagination Module

We introduce a goal imagination module into the FoAM framework to improve the agent's autonomy in achieving desired goal images. We selected InstructPix2Pix as our goal imagination module, utilizing approximately 20,000 pairs of training data. Of these, 16,000 pairs were derived from the cleaning robot expert demonstrations provided by RT-1. In this cleaning process, the first and last frames of the demonstrations were used as the original and edited images, respectively, with the corresponding task name serving as the instruction. Given that many of the demonstration datasets contained perturbations in the final frames caused by robot arm movements, we undertook a detailed data cleaning procedure to remove noise and ensure the training data quality. Additionally, we incorporated over 4,000 data pairs sourced from our own simulation and real-world datasets. We fine-tuned the model for 500 epochs on a single NVIDIA H100 GPU, a process that required approximately 3 days. During the image generation stage, with the model weights pre-loaded, processing each initial observation of size 480×640×3 took about 4 seconds. The figures below illustrate the demonstrations of the fine-tuned model in both simulation and real-world scenarios. For each demonstration, the image on the left represents the initial visual observation, while the image on the right depicts the goal image generated according to the given language prompt.

Our experiments showed that FoAM with goal imagination module demonstrated more robust performance than FoAM. We attribute this to the deep semantic information retained in the images generated by VLM, which helps prevent the model from overfitting when working with small datasets. Furthermore, the goal images generated by VLM maintain a consistent overall style. This style uniformity ensures that goal images generated at different times share similar features, enhancing the robot's ability to adapt to dynamic real-world conditions, thereby improving task activation reliability. Additionally, with the introduction of VLM, the agent can autonomously and efficiently acquire the goal image, with a 480×640 pixel goal image being obtained in an average of 4 seconds.

FoAM Benchmark

We developed a simulated dual-arm robotic system, with each arm possessing 6 DoF and a 1-dof parallel-jaw gripper, closely replicating a commonly used UR3e robot. A total of 86 simulation tasks (including 4 unseen tasks) were designed, encompassing a broad range of practical skills, such as picking, moving, pushing, placing, sliding, inserting, opening, closing, and transferring. The following video provides an overview of the multi-task scenarios in the FoAM-benchmark.

The tasks in the benchmark are categorized into five categories for performance analysis: pink for dual-arm tasks, yellow for cabinet-based tasks, green for block-based tasks, orange for locker-based tasks, and gray for other tasks. The FoAM-benchmark offers high-degree-of-freedom simulation suites. FoAM-benchmark code, evaluation metrics for each task, and customization tutorials for creating own tasks are available on FoAM-benchmark.

Robustness Analysis

We conducted an in-depth exploration of FoAM, focusing on three key aspects: external disturbance, reactiveness, and unseen task generalization.

External Disturbance: Despite the introduction of additional objects to disrupt the operation process, the robot was able to complete the task without signifcant diffculties.

Reactiveness: During the task execution, we forcibly removed the object from the gripper. In response, the robot exhibited the ability to attempt re-grasping the object and ultimately complete the task.

Unseen Task Generalization: To evaluate FoAM's performance on unseen tasks, we substituted the eggplant with a carambola in the Put Fruits into the Green Bowl real-world scenario. FoAM demonstrated the highest success rate compared to other strong baselines.

External Disturbance

Reactiveness

Unseen Task Generalization

Conclusion and Future Work

In this work, we introduced FoAM, a novel multimodal goal-conditioned policy designed to enhance the performance of multi-task policies and address the limitations of single goal-conditioned approaches. Inspired by human behavior, FoAM improves agent performance by imitating expert actions while simultaneously considering the visual outcomes of those actions. In our published FoAM-benchmark and across real-world scenarios, FoAM achieved improvements of up to 41% in success rate compared with previous methods. However, FoAM exhibited certain limitations in real-world Scenarios I and II, which involve high precision requirements. To address this, we will explore to refine long-horizon tasks by generating fine-grained intermediate goal images to serve as guidance. By leveraging these intermediate visual states, we seek to reduce cumulative errors during operations and improve the agent’s execution accuracy.

Acknowledgments

To be announced after the review process.