One-shot Imitation in a Non-Stationary Environment via Multi-Modal Skill

Published 13 Feb 2024 in cs.AI | (2402.08369v1)

Abstract: One-shot imitation is to learn a new task from a single demonstration, yet it is a challenging problem to adopt it for complex tasks with the high domain diversity inherent in a non-stationary environment. To tackle the problem, we explore the compositionality of complex tasks, and present a novel skill-based imitation learning framework enabling one-shot imitation and zero-shot adaptation; from a single demonstration for a complex unseen task, a semantic skill sequence is inferred and then each skill in the sequence is converted into an action sequence optimized for environmental hidden dynamics that can vary over time. Specifically, we leverage a vision-LLM to learn a semantic skill set from offline video datasets, where each skill is represented on the vision-language embedding space, and adapt meta-learning with dynamics inference to enable zero-shot skill adaptation. We evaluate our framework with various one-shot imitation scenarios for extended multi-stage Meta-world tasks, showing its superiority in learning complex tasks, generalizing to dynamics changes, and extending to different demonstration conditions and modalities, compared to other baselines.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (6)

View on Semantic Scholar

Summary

The paper presents OnIS, a framework that enables one-shot imitation and zero-shot adaptation through multi-modal semantic skill representation.
It employs a vision-language model with contrastive learning to encode semantic skills and a dynamics-aware module for real-time skill transfer.
Evaluations in simulated and real-world tasks demonstrate improved success rates and robust generalization compared to prior methods.

The paper "One-shot Imitation in a Non-Stationary Environment via Multi-Modal Skill" (2402.08369) presents a novel framework, OnIS, to address the challenge of one-shot imitation and zero-shot adaptation for autonomous agents performing complex tasks in non-stationary environments. This framework leverages semantic skill representation using a vision-LLM, combined with meta-learning techniques, to enable task decomposition and dynamics-aware skill transfer.

Introduction

The paper focuses on tackling the one-shot imitation problem where an autonomous agent can learn a complex new task from only a single demonstration. The approach uses skill-based imitation learning to handle tasks with changing dynamics across various domains in non-stationary environments. The authors propose leveraging the compositionality inherent in control tasks and semantic representations developed by vision-language pretrained models. This method enables the adaptation of the learned skills to various dynamic conditions without explicit task-specific training on each condition.

Problem Formulation

The problem is formulated within the context of Hidden Parameter Markov Decision Processes (HiP-MDP), allowing for dynamic variation over time without requiring explicit reward signals. The challenge lies in the agent's ability to infer a semantic skill sequence from a demonstration, adapt it dynamically, and execute actions optimized for current environmental conditions. This setup is illustrated using multi-stage tasks requiring sequential execution of subtasks in the Meta-world environment, a simulated robotics manipulation platform.

Figure 1: Multi-stage Meta-world task.

Our Approach

Semantic Skill Learning

The OnIS framework includes a semantic skill sequence encoder using a vision-LLM such as CLIP to learn semantic skill representations. The semantic skill sequence encoder employs prompt-based contrastive learning to translate video demonstrations into temporally organized semantic skills.

Figure 2: OnIS framework: In (a), the semantic skill sequence encoder $\Phi_{enc}$ and the semantic skill decoder $\Phi_{dec}$ are trained offline using the CLIP vision-language pretrained model.

The skills generated hold semantic correspondence with actions required to complete the task, enabling zero-shot adaptation across different task dynamics by employing a dynamic-aware skill transfer scheme.

Dynamics-aware Skill Transfer

The paper introduces a novel dynamics encoder and skill transfer module trained jointly via contrastive learning and behavior cloning, enabling real-time adaptation to changing environmental dynamics. The framework optimizes action sequences based on inferred current dynamics and semantic skills, facilitating reliable execution across diverse conditions.

Evaluations

The framework is evaluated in the Meta-world environment, demonstrating superior performance in tasks where unseen dynamics are presented. Results show significant improvement in task success rates and robust generalization across varied dynamic conditions compared to existing methods such as BC-Z and SPiRL.

Figure 3: Effect by annotated sample size: the x-axis denotes the number of annotated samples used for training the semantic skill sequence encoder in S-OnIS, and the y-axis denotes the one-shot imitation performance by S-OnIS.

Use Cases

Real-world Demonstrations

The framework successfully handles real-world video demonstrations, validating its robustness and practical applicability. Real-world tests show that OnIS can accurately infer task semantics and perform corresponding actions even with variances in demonstration quality.

Figure 4: Real-world demonstrations.

The framework's efficacy in handling noisy demonstrations and strict performance requirements is further enhanced by the semantic embedding approach, which aligns demonstrated and executed task trajectories effectively.

Conclusion

OnIS represents a significant step towards enabling one-shot imitation learning in non-stationary environments. By combining semantic skill representation with dynamics-aware adaptation methods, this framework shows not only elevated performance but also promising scalability and versatility in implementing real-world applications. Future work may focus on extending this approach to environments with limited data availability and exploring further enhancements in semantic skill compositionality.

Markdown Report Issue