Publications

Let Me Help You! Neuro-Symbolic Short-Context Action Anticipation
Sarthak Bhagat, Samuel Li, Joseph Campbell, Yaqi Xie, Katia Sycara, Simon Stepputtis IEEE Robotics and Automation Letters 2024
In an era where robots become available to the general public, the applicability of assistive robotics extends across numerous aspects of daily life, including in-home robotics. This work presents a novel approach for such systems, leveraging long-horizon action anticipation from short-observation contexts. In an assistive cooking task, we demonstrate that predicting human intention leads to effective collaboration between humans and robots. Compared to prior approaches, our method halves the required observation time of human behavior before accurate future predictions can be made, thus, allowing for quick and effective task support from short contexts. To provide sufficient context in such scenarios, our proposed method analyzes the human user and their interaction with surrounding scene objects by imbuing the system with additional domain knowledge, encoding the scene object’s affordances. We integrate this knowledge into a transformer-based action anticipation architecture, which alters the attention mechanism between different visual features by either boosting or attenuating the attention between them. Through this approach, we achieve an up to 9% improvement on two common action anticipation benchmarks, namely 50Salads and Breakfast . After predicting a sequence of future actions, our system selects an appropriate assistive action that is subsequently executed on a robot for a joint salad preparation task between a human and a robot.
ShapeGrasp: Zero-Shot Task-Oriented Grasping with Large Language Models through Geometric Decomposition
Samuel Li, Sarthak Bhagat, Joseph Campbell, Yaqi Xie, Woojun Kim, Katia Sycara, Simon Stepputtis IEEE International Conference on Intelligent Robots and Systems 2024
Task-oriented grasping of unfamiliar objects is a necessary skill for robots in dynamic in-home environments. Inspired by the human capability to grasp such objects through intuition about their shape and structure, we present a novel zero-shot task-oriented grasping method leveraging a geometric decomposition of the target object into simple, convex shapes that we represent in a graph structure, including geometric attributes and spatial relationships. Our approach employs minimal essential information - the object’s name and the intended task - to facilitate zero-shot task-oriented grasping. We utilize the commonsense reasoning capabilities of large language models to dynamically assign semantic meaning to each decomposed part and subsequently reason over the utility of each part for the intended task. Through extensive experiments on a real-world robotics platform, we demonstrate that our grasping approach’s decomposition and reasoning pipeline is capable of selecting the correct part in 92% of the cases and successfully grasping the object in 82% of the tasks we evaluate. Additional videos, experiments, code, and data are available on our project website.
A Comparison of Imitation Learning Algorithms for Bimanual Manipulation
Michael Drolet, Simon Stepputtis, Siva Kailas, Ajinkya Jain, Jan Peters, Stefan Schaal, Heni Ben Amor IEEE Robotics and Automation Letters 2024
Amidst the wide popularity of imitation learning algorithms in robotics, their properties regarding hyperparameter sensitivity, ease of training, data efficiency, and performance have not been well-studied in high-precision industry-inspired environments. In this work, we demonstrate the limitations and benefits of prominent imitation learning approaches and analyze their capabilities regarding these properties. We evaluate each algorithm on a complex bimanual manipulation task involving an over-constrained dynamics system in a setting involving multiple contacts between the manipulated object and the environment. While we find that imitation learning is well suited to solve such complex tasks, not all algorithms are equal in terms of handling environmental and hyperparameter perturbations, training requirements, performance, and ease of use. We investigate the empirical influence of these key characteristics by employing a carefully designed experimental procedure and learning environment. Paper website: https://bimanual-imitation.github.io/
Adaptive Action Advising with Different Rewards
Yue Guo, Xijia Zhang, Simon Stepputtis, Joseph Campbell, Katia P. Sycara Conference on Lifelong Learning Agents 2024
Action advising is a critical aspect of reinforcement learning, involving a teacher-student paradigm wherein the teacher, possessing a pre-trained policy, advises the student with the actions calculated from its policy based on the latter’s observations, thereby improving the student’s task performance. An important requirement is for the teacher to be able to learn to robustly adapt and give effective advice in new environments where the reward is different from the one the teacher has been trained on. This issue has not been considered in the current teacher-student literature, therefore, most of the work require the teacher to be pre-trained with the same reward that the student interacts with and cannot generalize advice that differs from the policy; the reward that the student gained through interaction with the environment is also directly given to the teacher, regardless the exploration process. To fill this gap, our proposed method enhances action advising by allowing the teacher to learn by observing and collecting data from the student and adapting its reward function. We empirically evaluate our method over three environments consisting of a Gridworld, an ALE skiing, and a Pacman, and find that our method demonstrates improved policy returns and sample efficiency.
Geometric Shape Reasoning for Zero-Shot Task-Oriented Grasping
Samuel Li, Sarthak Bhagat, Joseph Campbell, Yaqi Xie, Woojun Kim, Katia Sycara, Simon Stepputtis ICRA Workshop on 3D Visual Representations for Robot Manipulation 2024
Task-oriented grasping of unfamiliar objects is a necessary skill for robots in dynamic in-home environments. Inspired by the human capability to grasp such objects through intuition about their shape and structure, we present a novel zero-shot task-oriented grasping method leveraging a geometric decomposition of the target object into simple, convex shapes that we represent in a graph structure, including geometric attributes and spatial relationships. Our approach employs minimal essential information - the object’s name and the intended task - to facilitate zero-shot task-oriented grasping. We utilize the commonsense reasoning capabilities of large language models to dynamically assign semantic meaning to each decomposed part and subsequently reason over the utility of each part for the intended task. Through extensive experiments on a real-world robotics platform, we demonstrate that our grasping approach’s decomposition and reasoning pipeline is capable of selecting the correct part in 92% of the cases and successfully grasping the object in 82% of the tasks we evaluate. Additional videos, experiments, code, and data are available on our project website: this https URL.
Symbolic Graph Inference for Compound Scene Understanding
FNU Aryan, Simon Stepputtis, Sarthak Bhagat, Joseph Campbell, Kwonjoon Lee, Hossein Nourkhiz Mahjoub, Katia Sycara ICRA Workshop on Ontologies and Standards for Robotics and Automation 2024
Scene understanding is a fundamental capability needed in many domains ranging from question-answering to robotics. Unlike recent end-to-end approaches that must explicitly learn varying compositions of the same scene, our method reasons over their constituent objects and analyzes their arrangement to infer a scene’s meaning. We propose a novel approach that reasons over a scene’s scene- and knowledge-graph, capturing spatial information while being able to utilize general domain knowledge in a joint graph search. Empirically, we demonstrate the feasibility of our method on the ADE20K dataset and compare it to current scene understanding approaches.
Transfer Learning via Temporal Contrastive Learning
Weihao Zeng, Joseph Campbell, Simon Stepputtis, Katia Sycara ICRA Workshop on Multi-Agent Dynamic Games 2024
This paper introduces a novel transfer learning framework for deep multi-agent reinforcement learning. The approach automatically combines goal-conditioned policies with temporal contrastive learning to discover meaningful sub-goals. The approach involves pre-training a goal-conditioned agent, finetuning it on the target domain, and using contrastive learning to construct a planning graph that guides the agent via sub-goals. Experiments on multi-agent coordination Overcooked tasks demonstrate improved sample efficiency, the ability to solve sparse-reward and long-horizon problems, and enhanced interpretability compared to baselines. The results highlight the effectiveness of integrating goal-conditioned policies with unsupervised temporal abstraction learning for complex multi-agent transfer learning. Compared to state-of-the-art baselines, our method achieves the same or better performances while requiring only 21.7% of the training samples.
Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation
Zifu Wan, Pingping Zhang, Yuhao Wang, Silong Yong, Simon Stepputtis, Katia Sycara, Yaqi Xie IEEE/CVF Winter Conference on Applications of Computer Vision 2024
Multi-modal semantic segmentation significantly enhances AI agents' perception and scene understanding, especially under adverse conditions like low-light or overexposed environments. Leveraging additional modalities (X-modality) like thermal and depth alongside traditional RGB provides complementary information, enabling more robust and reliable prediction. In this work, we introduce Sigma, a Siamese Mamba network for multi-modal semantic segmentation utilizing the advanced Mamba. Unlike conventional methods that rely on CNNs, with their limited local receptive fields, or Vision Transformers (ViTs), which offer global receptive fields at the cost of quadratic complexity, our model achieves global receptive fields with linear complexity. By employing a Siamese encoder and innovating a Mamba-based fusion mechanism, we effectively select essential information from different modalities. A decoder is then developed to enhance the channel-wise modeling ability of the model. Our proposed method is rigorously evaluated on both RGB-Thermal and RGB-Depth semantic segmentation tasks, demonstrating its superiority and marking the first successful application of State Space Models (SSMs) in multi-modal perception tasks. Code is available at our project website.
HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation
Ce Zhang, Simon Stepputtis, Joseph Campbell, Katia Sycara, Yaqi Xie Conference on Computer Vision and Pattern Recognition 2024
Being able to understand visual scenes is a precursor for many downstream tasks, including autonomous driving, robotics, and other vision-based approaches. A common approach enabling the ability to reason over visual data is Scene Graph Generation (SGG); however, many existing approaches assume undisturbed vision, i.e., the absence of real-world corruptions such as fog, snow, smoke, as well as non-uniform perturbations like sun glare or water drops. In this work, we propose a novel SGG benchmark containing procedurally generated weather corruptions and other transformations over the Visual Genome dataset. Further, we introduce a corresponding approach, Hierarchical Knowledge Enhanced Robust Scene Graph Generation (HiKER-SGG), providing a strong baseline for scene graph generation under such challenging setting. At its core, HiKER-SGG utilizes a hierarchical knowledge graph in order to refine its predictions from coarse initial estimates to detailed predictions. In our extensive experiments, we show that HiKER-SGG does not only demonstrate superior performance on corrupted images in a zero-shot manner, but also outperforms current state-of-the-art methods on uncorrupted SGG tasks. Code is available at this https URL
Long-Horizon Dialogue Understanding for Role Identification in the Game of Avalon with Large Language Models
Simon Stepputtis, Joseph Campbell, Yaqi Xie, Zhengyang Qi, Wenxin Sharon Zhang, Ruiyi Wang, Sanketh Rangreji, Michael Lewis, Katia Sycara Findings of the Association for Computational Linguistics - EMNLP 2023
Deception and persuasion play a critical role in long-horizon dialogues between multiple parties, especially when the interests, goals, and motivations of the participants are not aligned. Such complex tasks pose challenges for current Large Language Models (LLM) as deception and persuasion can easily mislead them, especially in long-horizon multi-party dialogues. To this end, we explore the game of Avalon: The Resistance, a social deduction game in which players must determine each other’s hidden identities to complete their team’s objective. We introduce an online testbed and a dataset containing 20 carefully collected and labeled games among human players that exhibit long-horizon deception in a cooperative-competitive setting. We discuss the capabilities of LLMs to utilize deceptive long-horizon conversations between six human players to determine each player’s goal and motivation. Particularly, we discuss the multimodal integration of the chat between the players and the game’s state that grounds the conversation, providing further insights into the true player identities. We find that even current state-of-the-art LLMs do not reach human performance, making our dataset a compelling benchmark to investigate the decision-making and language-processing capabilities of LLMs. Our dataset and online testbed can be found at our project website: this https URL
Theory of Mind for Multi-Agent Collaboration via Large Language Models
Huao Li, Yu Quan Chong, Simon Stepputtis, Joseph Campbell, Dana Hughes, Michael Lewis, Katia Sycara Empirical Methods in Natural Language Processing 2023
While Large Language Models (LLMs) have demonstrated impressive accomplishments in both reasoning and planning, their abilities in multi-agent collaborations remains largely unexplored. This study evaluates LLM-based agents in a multi-agent cooperative text game with Theory of Mind (ToM) inference tasks, comparing their performance with Multi-Agent Reinforcement Learning (MARL) and planning-based baselines. We observed evidence of emergent collaborative behaviors and high-order Theory of Mind capabilities among LLM-based agents. Our results reveal limitations in LLM-based agents’ planning optimization due to systematic failures in managing long-horizon contexts and hallucination about the task state. We explore the use of explicit belief state representations to mitigate these issues, finding that it enhances task performance and the accuracy of ToM inferences for LLM-based agents.
Characterizing Out-of-Distribution Error via Optimal Transport
Yuzhe Lu, Yilong Qin, Runtian Zhai, Andrew Shen, Ketong Chen, Zhenlin Wang, Soheil Kolouri, Simon Stepputtis, Joseph Campbell, Katia Sycara Conference on Neural Information Processing Systems 2023
Out-of-distribution (OOD) data poses serious challenges in deployed machine learning models, so methods of predicting a model’s performance on OOD data without labels are important for machine learning safety. While a number of methods have been proposed by prior work, they often underestimate the actual error, sometimes by a large margin, which greatly impacts their applicability to real tasks. In this work, we identify pseudo-label shift, or the difference between the predicted and true OOD label distributions, as a key indicator to this underestimation. Based on this observation, we introduce a novel method for estimating model performance by leveraging optimal transport theory, Confidence Optimal Transport (COT), and show that it provably provides more robust error estimates in the presence of pseudo-label shift. Additionally, we introduce an empirically-motivated variant of COT, Confidence Optimal Transport with Thresholding (COTT), which applies thresholding to the individual transport costs and further improves the accuracy of COT’s error estimates. We evaluate COT and COTT on a variety of standard benchmarks that induce various types of distribution shift – synthetic, novel subpopulation, and natural – and show that our approaches significantly outperform existing state-of-the-art methods with an up to 3x lower prediction error.
Hiker-sgg: Hierarchical knowledge enhanced robust scene graph generation
Ce Zhang, Simon Stepputtis, Joseph Campbell, Katia Sycara, Yaqi Xie NeurIPS Workshop on New Frontiers in Graph Learning 2023
Being able to understand visual scenes is a precursor for many downstream tasks, including autonomous driving, robotics, and other vision-based approaches. A common approach enabling the ability to reason over visual data is Scene Graph Generation (SGG); however, many existing approaches assume undisturbed vision, i.e., the absence of real-world corruptions such as fog, snow, smoke, as well as non-uniform perturbations like sun glare or water drops. In this work, we propose a novel SGG benchmark containing procedurally generated weather corruptions and other transformations over the Visual Genome dataset. Further, we introduce a corresponding approach, Hierarchical Knowledge Enhanced Robust Scene Graph Generation (HiKER-SGG), providing a strong baseline for scene graph generation under such challenging setting. At its core, HiKER-SGG utilizes a hierarchical knowledge graph in order to refine its predictions from coarse initial estimates to detailed predictions. In our extensive experiments, we show that HiKER-SGG does not only demonstrate superior performance on corrupted images in a zero-shot manner, but also outperforms current state-of-the-art methods on uncorrupted SGG tasks. Code is available at this https URL
A framework for intervention based team support in time critical tasks
Dana Hughes, Huao Li, Max Chis, Ini Oguntola, Simon Stepputtis, Keyang Zheng, Joseph Campbell, Katia Sycara, Michael Lewis IEEE International Conference on Systems, Man, and Cybernetics 2023
In this paper we describe the intervention framework of ATLAS, an artificial socially intelligent agent that advises teams. The framework treats interventions as atomic components, and manages the lifecycle of each intervention through presentation, as well as followups to interventions. The key benefit of this framework is that it allows for rapid development of scenario-specific Interventions that leverage scenario-agnostic team models. The implementation of this framework is reported for three player teams in a Search and Rescue task simulated in Minecraft. Low competence teams advised by ATLAS improved more between first and second trials than those with a human advisor while the reverse was found for high competence. Four times as many interventions were proposed as were presented. 15 % of advice was withheld to avoid repetitive advice, excessive rate of advice, and needlessly advising high performing teams, while a Theory of Mind model and delay for confirmation mechanism filtered out other unnecessary advice.
Knowledge-guided short-context action anticipation in human-centric videos
Sarthak Bhagat, Simon Stepputtis, Joseph Campbell, Katia Sycara ICCV Workshop on AI for Creative Video Editing and Understanding 2023
This work focuses on anticipating long-term human actions, particularly using short video segments, which can speed up editing workflows through improved suggestions while fostering creativity by suggesting narratives. To this end, we imbue a transformer network with a symbolic knowledge graph for action anticipation in video segments by boosting certain aspects of the transformer’s attention mechanism at run-time. Demonstrated on two benchmark datasets, Breakfast and 50Salads, our approach outperforms current state-of-the-art methods for long-term action anticipation using short video context by up to 9%.
Sample-efficient learning of novel visual concepts
Sarthak Bhagat*, Simon Stepputtis*, Joseph Campbell, Katia Sycara Conference on Lifelong Learning Agents 2023
Despite the advances made in visual object recognition, state-of-the-art deep learning models struggle to effectively recognize novel objects in a few-shot setting where only a limited number of examples are provided. Unlike humans who excel at such tasks, these models often fail to leverage known relationships between entities in order to draw conclusions about such objects. In this work, we show that incorporating a symbolic knowledge graph into a state-of-the-art recognition model enables a new approach for effective few-shot classification. In our proposed neuro-symbolic architecture and training methodology, the knowledge graph is augmented with additional relationships extracted from a small set of examples, improving its ability to recognize novel objects by considering the presence of interconnected entities. Unlike existing few-shot classifiers, we show that this enables our model to incorporate not only objects but also abstract concepts and affordances. The existence of the knowledge graph also makes this approach amenable to interpretability through analysis of the relationships contained within it. We empirically show that our approach outperforms current state-of-the-art few-shot multi-label classification methods on the COCO dataset and evaluate the addition of abstract concepts and affordances on the Visual Genome dataset.
Introspective action advising for interpretable transfer learning
Joseph Campbell, Yue Guo, Fiona Xie, Simon Stepputtis, Katia Sycara Conference on Lifelong Learning Agents 2023
Transfer learning can be applied in deep reinforcement learning to accelerate the training of a policy in a target task by transferring knowledge from a policy learned in a related source task. This is commonly achieved by copying pretrained weights from the source policy to the target policy prior to training, under the constraint that they use the same model architecture. However, not only does this require a robust representation learned over a wide distribution of states – often failing to transfer between specialist models trained over single tasks – but it is largely uninterpretable and provides little indication of what knowledge is transferred. In this work, we propose an alternative approach to transfer learning between tasks based on action advising, in which a teacher trained in a source task actively guides a student’s exploration in a target task. Through introspection, the teacher is capable of identifying when advice is beneficial to the student and should be given, and when it is not. Our approach allows knowledge transfer between policies agnostic of the underlying representations, and we empirically show that this leads to improved convergence rates in Gridworld and Atari environments while providing insight into what knowledge is transferred.
Learning Modular Language-Conditioned Robot Policies Through Attention
Yifan Zhou, Shubham Sonawani, Mariano Phielipp, Heni Ben Amor, Simon Stepputtis Autonomous Robots 2023
Training language-conditioned policies is typically time-consuming and resource-intensive. Additionally, the resulting controllers are tailored to the specific robot they were trained on, making it difficult to transfer them to other robots with different dynamics. To address these challenges, we propose a new approach called Hierarchical Modularity, which enables more efficient training and subsequent transfer of such policies across different types of robots. The approach incorporates Supervised Attention which bridges the gap between modular and end-to-end learning by enabling the re-use of functional building blocks. In this contribution, we build upon our previous work, showcasing the extended utilities and improved performance by expanding the hierarchy to include new tasks and introducing an automated pipeline for synthesizing a large quantity of novel objects. We demonstrate the effectiveness of this approach through extensive simulated and real-world robot manipulation experiments.
Explainable Action Advising for Multi-Agent Reinforcement Learning
Yue Guo, Joseph Campbell, Simon Stepputtis, Ruiyu Li, Dana Hughes, Fei Fang, Katia Sycara International Conference on Robotics and Automation 2023
Action advising is a knowledge transfer technique for reinforcement learning based on the teacher-student paradigm. An expert teacher provides advice to a student during training in order to improve the student’s sample efficiency and policy performance. Such advice is commonly given in the form of state-action pairs. However, it makes it difficult for the student to reason with and apply to novel states. We introduce Explainable Action Advising, in which the teacher provides action advice as well as associated explanations indicating why the action was chosen. This allows the student to self-reflect on what it has learned, enabling advice generalization and leading to improved sample efficiency and learning performance - even in environments where the teacher is sub-optimal. We empirically show that our framework is effective in both single-agent and multi-agent scenarios, yielding improved policy returns and convergence rates when compared to state-of-the-art methods.
Modularity through Attention: Efficient Training and Transfer of Language-Conditioned Policies for Robot Manipulation
Yifan Zhou, Shubham Sonawani, Mariano Phielipp, Simon Stepputtis, Heni Ben Amor Conference on Robot Learning 2022
Language-conditioned policies allow robots to interpret and execute human instructions. Learning such policies requires a substantial investment with regards to time and compute resources. Still, the resulting controllers are highly device-specific and cannot easily be transferred to a robot with different morphology, capability, appearance or dynamics. In this paper, we propose a sample-efficient approach for training language-conditioned manipulation policies that allows for rapid transfer across different types of robots. By introducing a novel method, namely Hierarchical Modularity, and adopting supervised attention across multiple sub-modules, we bridge the divide between modular and end-to-end learning and enable the reuse of functional building blocks. In both simulated and real world robot manipulation experiments, we demonstrate that our method outperforms the current state-of-the-art methods and can transfer policies across different robots in a sample-efficient manner. Finally, we show that the functionality of learned sub-modules is maintained beyond the training process and can be used to introspect the robot decision-making process.
Concept Learning for Interpretable Multi-Agent Reinforcement Learning
Renos Zabounidis, Joseph Campbell, Simon Stepputtis, Dana Hughes, Katia P. Sycara Conference on Robot Learning 2022
Multi-agent robotic systems are increasingly operating in real-world environments in close proximity to humans, yet are largely controlled by policy models with inscrutable deep neural network representations. We introduce a method for incorporating interpretable concepts from a domain expert into models trained through multi-agent reinforcement learning, by requiring the model to first predict such concepts then utilize them for decision making. This allows an expert to both reason about the resulting concept policy models in terms of these high-level concepts at run-time, as well as intervene and correct mispredictions to improve performance. We show that this yields improved interpretability and training stability, with benefits to policy performance and sample efficiency in a simulated and real-world cooperative-competitive multi-agent game.
A System for Imitation Learning of Contact-Rich Bimanual Manipulation Policies
Simon Stepputtis, Maryam Bandari, Stefan Schaal, Heni Ben Amor International Conference on Intelligent Robots and Systems 2022
In this paper, we discuss a framework for teaching bimanual manipulation tasks by imitation. To this end, we present a system and algorithms for learning compliant and contact-rich robot behavior from human demonstrations. The presented system combines insights from admittance control and machine learning to extract control policies that can (a) recover from and adapt to a variety of disturbances in time and space, while also (b) effectively leveraging physical contact with the environment. We demonstrate the effectiveness of our approach using a real-world insertion task involving multiple simultaneous contacts between a manipulated object and insertion pegs. We also investigate efficient means of collecting training data for such bimanual settings. To this end, we conduct a human-subject study and analyze the effort and mental demand as reported by the users. Our experiments show that, while harder to provide, the additional force/torque information available in teleoperated demonstrations is crucial for phase estimation and task success. Ultimately, force/torque data substantially improves manipulation robustness, resulting in a 90% success rate in a multipoint insertion task.
Language-Conditioned Human-Agent Teaming
Simon Stepputtis Robotics: Science and Systems Pioneers Workshop 2022
In my work, I am focusing on multimodal techniques for robot learning and motor skill acquisition. Most intuitively, humans communicate desires to a robot using natural language, rooting from their internal mental state that encodes their beliefs and intentions. Modeling this mental state is a crucial part of truly understanding the human partner, as it is not captured by natural language alone. To this end, my research focuses on creating robotic systems that truly understand the human partner by combining Theory of Mind, natural language processing, and vision to complete various manipulation tasks.
Language-Conditioned Imitation Learning for Robot Manipulation Tasks
Simon Stepputtis, Joseph Campbell, Mariano Phielipp, Stefan Lee, Chitta Baral, Heni Ben Amor Conference on Neural Information Processing Systems 2020
Imitation learning is a popular approach for teaching motor skills to robots. However, most approaches focus on extracting policy parameters from execution traces alone (i.e., motion trajectories and perceptual data). No adequate communication channel exists between the human expert and the robot to describe critical aspects of the task, such as the properties of the target object or the intended shape of the motion. Motivated by insights into the human teaching process, we introduce a method for incorporating unstructured natural language into imitation learning. At training time, the expert can provide demonstrations along with verbal descriptions in order to describe the underlying intent (e.g., “go to the large green bowl”). The training process then interrelates these two modalities to encode the correlations between language, perception, and motion. The resulting language-conditioned visuomotor policies can be conditioned at runtime on new human commands and instructions, which allows for more fine-grained control over the trained policies while also reducing situational ambiguity. We demonstrate in a set of simulation experiments how our approach can learn language-conditioned manipulation policies for a seven-degree-of-freedom robot arm and compare the results to a variety of alternative methods.
Imitation Learning of Robot Policies by Combining Language, Vision, and Demonstration
Simon Stepputtis, Joseph Campbell, Mariano Phielipp, Chitta Baral, Heni Ben Amor NeurIPS Workshop on Robot Learning: Control and Interaction in the Real World 2019
In this work we propose a novel end-to-end imitation learning approach that combines natural language, vision, and motion information to produce an abstract representation of a task, which in turn is used to synthesize specific motion controllers at run-time. This multimodal approach enables generalization to a wide variety of environmental conditions and allows an end-user to direct a robot policy through verbal communication. We empirically validate our approach with an extensive set of simulations and show that it achieves a high task success rate over a variety of conditions while remaining amenable to probabilistic interpretability.
Improved Exploration Through Latent Trajectory Optimization in Deep Deterministic Policy Gradient
Kevin Sebastian Luck, Mel Vecerik, Simon Stepputtis, Heni Ben Amor, Jonathan Scholz International Conference on Intelligent Robots and Systems 2019
Model-free reinforcement learning algorithms such as Deep Deterministic Policy Gradient (DDPG) often require additional exploration strategies, especially if the actor is of deterministic nature. This work evaluates the use of model-based trajectory optimization methods used for exploration in Deep Deterministic Policy Gradient when trained on a latent image embedding. In addition, an extension of DDPG is derived using a value function as critic, making use of a learned deep dynamics model to compute the policy gradient. This approach leads to a symbiotic relationship between the deep reinforcement learning algorithm and the latent trajectory optimizer. The trajectory optimizer benefits from the critic learned by the RL algorithm and the latter from the enhanced exploration generated by the planner. The developed methods are evaluated on two continuous control tasks, one in simulation and one in the real world. In particular, a Baxter robot is trained to perform an insertion task, while only receiving sparse rewards and images as observations from the environment.
Learning Interactive Behaviors for Musculoskeletal Robots Using Bayesian Interaction Primitives
Joseph Campbell, Arne Hitzmann, Simon Stepputtis, Shuhei Ikemoto, Koh Hosoda, Heni Ben Amor International Conference on Intelligent Robots and Systems 2019
Musculoskeletal robots that are based on pneumatic actuation have a variety of properties, such as compliance and back-drivability, that render them particularly appealing for human-robot collaboration. However, programming interactive and responsive behaviors for such systems is extremely challenging due to the nonlinearity and uncertainty inherent to their control. In this paper, we propose an approach for learning Bayesian Interaction Primitives for musculoskeletal robots given a limited set of example demonstrations. We show that this approach is capable of real-time state estimation and response generation for interaction with a robot for which no analytical model exists. Human-robot interaction experiments on a ‘handshake’ task show that the approach generalizes to new positions, interaction partners, and movement velocities.
Probabilistic Multimodal Modeling for Human-Robot Interaction Tasks
Joseph Campbell, Simon Stepputtis, Heni Ben Amor Conference on Robot Learning 2019
Human-robot interaction benefits greatly from multimodal sensor inputs as they enable increased robustness and generalization accuracy. Despite this observation, few HRI methods are capable of efficiently performing inference for multimodal systems. In this work, we introduce a reformulation of Interaction Primitives which allows for learning from demonstration of interaction tasks, while also gracefully handling non-linearities inherent to multimodal inference in such scenarios. We also empirically show that our method results in more accurate, more robust, and faster inference than standard Interaction Primitives and other common methods in challenging HRI scenarios.
Neural Policy Translation for Robot Control
Simon Stepputtis, Chitta Baral, Heni Ben Amor Southwest Robotics Symposium 2019
Teaching new skills to robots is usually a tedious process that requires expert knowledge and a substantial amount of time, depending on the complexity of the new task. Especially when used for imitation learning, rapid and intuitive ways of teaching novel tasks are needed. In this work, we outline Neural Policy Translation (NPT) – a novel approach that enables robots to directly learn a new skill by translating natural language and kinesthetic demonstrations into neural network policies.
Extrinsic Dexterity Through Active Slip Control Using Deep Predictive Models
Simon Stepputtis, Yezhou Yang, Heni Ben Amor International Conference on Robotics and Automation 2018
We present a machine learning methodology for actively controlling slip, in order to increase robot dexterity. Leveraging recent insights in deep learning, we propose a Deep Predictive Model that uses tactile sensor information to reason about slip and its future influence on the manipulated object. The obtained information is then used to precisely manipulate objects within a robot end-effector using external perturbations imposed by gravity or acceleration. We show in a set of experiments that this approach can be used to increase a robot’s repertoire of motor skills.
One-Shot Learning of Human–Robot Handovers with Triadic Interaction Meshes
David Vogt, Simon Stepputtis, Bernhard Jung, Heni Ben Amor Autonomous Robots 2018
We propose an imitation learning methodology that allows robots to seamlessly retrieve and pass objects to and from human users. Instead of hand-coding interaction parameters, we extract relevant information such as joint correlations and spatial relationships from a single task demonstration of two humans. At the center of our approach is an interaction model that enables a robot to generalize an observed demonstration spatially and temporally to new situations. To this end, we propose a data-driven method for generating interaction meshes that link both interaction partners to the manipulated object. The feasibility of the approach is evaluated in a within user study which shows that human–human task demonstration can lead to more natural and intuitive interactions with the robot.
Towards Semantic Policies for Human-Robot Collaboration
Simon Stepputtis, Chitta Baral, Heni Ben Amor Southwest Robotics Symposium 2018
As the application domain of robots moves closer to our daily lives, algorithms and methods are needed to ensure safe and meaningful human-machine interaction. Robots need to be able to understand human body movements, as well as the semantic meaning of these actions. To overcome this challenge, this research aims to create novel ways of teaching complex tasks to a robot by \textbf{combining traditional learning-from-demonstration with natural language processing and semantic analysis.
Speech Enhanced Imitation Learning and Task Abstraction for Human-Robot Interaction
Simon Stepputtis, Chitta Baral, Heni Ben Amor IROS Workshop on Synergies Between Learning and Interaction 2017
In this short paper, we show how to learn interaction primitives and networks from long interactions by taking advantage of language and speech markers. The speech markers are obtained from free speech that accompanies the demonstration. We perform experiments to show the value of using speech markers for learning interaction primitives.
Active Slip Control for In-Hand Object Manipulation using Deep Predictive Models
Simon Stepputtis, Heni Ben Amor RSS Workshop on Tactile Sensing for Manipulation: Hardware, Modeling, and Learning 2017
We discuss a machine learning methodology for actively controlling slip, in order to increase robot dexterity. Leveraging recent insights in Deep Learning, we propose a Deep Predictive Model that uses tactile sensor information to reason about slip and its future influence on the manipulated object. We show in a set of experiments that this approach can be used to increase a robot’s repertoire of skills.
Deep Predictive Models for Active Slip Control
Simon Stepputtis, Heni Ben Amor RSS Workshop on (Empirically) Data-Driven Robotic Manipulation 2017
We discuss a machine learning methodology for actively controlling slip, in order to increase robot dexterity. Leveraging recent insights in Deep Learning, we propose a Deep Predictive Model that uses tactile sensor information to reason about slip and its future influence on the manipulated object. We show in a set of experiments that this approach can be used to increase a robot’s repertoire of skills.
A System for Learning Continuous Human-Robot Interactions from Human-Human Demonstrations
David Vogt, Simon Stepputtis, Steve Grehl, Bernhard Jung, Heni Ben Amor International Conference on Robotics and Automation 2017
We present a data-driven imitation learning system for learning human-robot interactions from human-human demonstrations. During training, the movements of two interaction partners are recorded through motion capture and an interaction model is learned. At runtime, the interaction model is used to continuously adapt the robot’s motion, both spatially and temporally, to the movements of the human interaction partner. We show the effectiveness of the approach on complex, sequential tasks by presenting two applications involving collaborative human-robot assembly. Experiments with varied object hand-over positions and task execution speeds confirm the capabilities for spatio-temporal adaption of the demonstrated behavior to the current situation.
Learning Human-Robot Interactions from Human-Human Demonstrations (With Applications in Lego Rocket Assembly)
David Vogt, Simon Stepputtis, Richard Weinhold, Bernhard Jung, Heni Ben Amor International Conference on Humanoid Robots 2016
This video demonstrates a novel imitation learning approach for learning human-robot interactions from human- human demonstrations. During training, the movements of two human interaction partners are recorded via motion capture. From this, an interaction model is learned that inherently captures important spatial relationships as well as temporal synchrony of body movements between the two interacting partners. The interaction model is based on interaction meshes that were first adopted by the computer graphics community for the offline animation of interacting virtual characters. We developed a variant of interaction meshes that is suitable for real-time human-robot interaction scenarios. During human- robot collaboration, the learned interaction model allows for adequate spatio-temporal adaptation of the robots behavior to the movements of the human cooperation partner. Thus, the presented approach is well suited for collaborative tasks requiring continuous body movement coordination of a human and a robot. The feasibility of the approach is demonstrated with the example of a cooperative Lego rocket assembly task.

Publications

Let Me Help You! Neuro-Symbolic Short-Context Action Anticipation

ShapeGrasp: Zero-Shot Task-Oriented Grasping with Large Language Models through Geometric Decomposition

A Comparison of Imitation Learning Algorithms for Bimanual Manipulation

Adaptive Action Advising with Different Rewards

Geometric Shape Reasoning for Zero-Shot Task-Oriented Grasping

Symbolic Graph Inference for Compound Scene Understanding

Transfer Learning via Temporal Contrastive Learning

Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation

Long-Horizon Dialogue Understanding for Role Identification in the Game of Avalon with Large Language Models

Theory of Mind for Multi-Agent Collaboration via Large Language Models

Characterizing Out-of-Distribution Error via Optimal Transport

Hiker-sgg: Hierarchical knowledge enhanced robust scene graph generation

A framework for intervention based team support in time critical tasks

Knowledge-guided short-context action anticipation in human-centric videos

Sample-efficient learning of novel visual concepts

Introspective action advising for interpretable transfer learning

Learning Modular Language-Conditioned Robot Policies Through Attention

Explainable Action Advising for Multi-Agent Reinforcement Learning

Modularity through Attention: Efficient Training and Transfer of Language-Conditioned Policies for Robot Manipulation

Concept Learning for Interpretable Multi-Agent Reinforcement Learning

A System for Imitation Learning of Contact-Rich Bimanual Manipulation Policies

Language-Conditioned Human-Agent Teaming

Language-Conditioned Imitation Learning for Robot Manipulation Tasks

Imitation Learning of Robot Policies by Combining Language, Vision, and Demonstration

Improved Exploration Through Latent Trajectory Optimization in Deep Deterministic Policy Gradient

Learning Interactive Behaviors for Musculoskeletal Robots Using Bayesian Interaction Primitives

Probabilistic Multimodal Modeling for Human-Robot Interaction Tasks

Neural Policy Translation for Robot Control

Extrinsic Dexterity Through Active Slip Control Using Deep Predictive Models

One-Shot Learning of Human–Robot Handovers with Triadic Interaction Meshes

Towards Semantic Policies for Human-Robot Collaboration

Speech Enhanced Imitation Learning and Task Abstraction for Human-Robot Interaction

Active Slip Control for In-Hand Object Manipulation using Deep Predictive Models

Deep Predictive Models for Active Slip Control

A System for Learning Continuous Human-Robot Interactions from Human-Human Demonstrations

Learning Human-Robot Interactions from Human-Human Demonstrations (With Applications in Lego Rocket Assembly)