MIND MELD
A machine learning framework for robots to learn from suboptimal human demonstrators
Motivation and Background
Let’s envision a future where humans and robots are working together out in the real-world, for example, assistive robots in the home. This is a difficult problem because the real-world can be unstructured and unpredictable (e.g., the floor of my house is covered in dog toys that my robot vaccuum regularly gets stuck on). Ideally, robots would be pre-programmed to know how to act in any situation; however, it is impossible for programmers to predict every situation that a robot may encounter. Additionally, people have different preferences and needs. One person might want their cups put away in the cabinet right-side-up, whereas someone else may prefer cups to be upside-down. Therefore, the end-user should be able to communicate their needs and preferences to a robot, without needing programming experience.
One method that enables non-expert users to teach robots new skills is Learning from Demonstration (LfD). With LfD, the person shows the robot how they want the task done, and the robot learns from the person’s demonstration. There are two main types of LfD: Human-centric and Robot-centric LfD. Human-centric LfD is the easiest for humans because the human performs the task, while the robot observes. For example, to teach an autonomous vehicle the correct way to drive, the person would simply drive the car. However, the robot only sees correct examples of how to do the task and does not learn how to recover from mistakes. To solve this problem, in Robot-centric LfD, the robot attempts the task and the human provides corrective feedback to the robot. In the autonomous vehicle example, the vehicle would drive, while the person moves the steering wheel to teach the robot what it should be doing instead. If the robot gets off course, the person can show the robot how to get back on track. The problem with Robot-centric LfD is that people tend to provide suboptimal feedback, due to fatigue, humans not understanding how the robot learns, etc. Incorrect feedback makes it harder for the robot to learn effectively. Also, people provide heterogenous feedback, meaning that people have different teaching strategies. In the autonomous vehicle example, when the robot is making a wrong turn, some demonstrators may turn the wheel all the way in the other direction (over-correcting), whereas others may only turn the wheel a small amount (under-correcting). In this project, we take advantage of the benefits of Robot-centric LfD, and mitigate the problems by creating a method that can learn from suboptimal and heterogenous demonstrators.
MIND MELD
Our method is called Mutual Information-driven Meta-learning from Demonstration (MIND MELD) (Schrum et al., 2022).
Overview
MIND MELD works by first using a person’s corrective demonstrations to learn how they are teaching suboptimally. In the driving domain, the person provides demonstrations by moving the steering wheel. In this case, the ways they can be suboptimal are over/under-correcting (moving the steering wheel too much or too little) or being anticipatory/delayed (moving the wheel too early or too late). MIND MELD learns a continous embedding that describes the person’s suboptimal behavior. Then, MIND MELD utilizes this embedding to shift the person’s corrective demonstrations closer to an optimal demonstration.
Optimal Labels and Calibration Tasks
How do we define optimal? In this case, an optimal corrective label steers the car in the shortest path to the goal, while avoiding obstacles.
In order to learn a person’s suboptimal teaching style, we need to compare their corrective labels to optimal labels. Therefore, we calculate optimal labels for a small set of calibration tasks. These calibration tasks are a simple representative subset of tasks. In the driving domain, examples of calibration tasks are the car turning left when it should have gone right or going straight and hitting an obstacle when it should have turned left around the obstacle.
With a more complicated robot task, we envision the person needing to teach the robot a few simple tasks with a block (e.g., pick and place, pushing the block, etc.) to calibrate the robot, meaning the robot learns the person’s suboptimal embedding. Then the person could move on to teaching the robot house cleaning tasks, and the robot could correct for the person’s suboptimal teaching style.
Note: We only need ground truth labels for the calibration tasks. We do not know optimal labels for test tasks.
Network Architecture
The MIND MELD network architecture is in the figure below.
The MIND MELD architecture consists of three parts: 1) the bi-directional Long Short-Term Memory (LSTM) encoder, \(\mathcal{E}_{\phi'}\), 2) the prediction subnetwork, \(f_\theta\), and 3) the recreation subnetwork, \(q_\phi\). The goal of the prediction subnetwork, \(f_\theta\), is to use the demonstrator \(p\)’s corrective feedback, \(a_t^{(p)}\), and the demonstrator’s suboptimal embedding, \(w^{(p)}\), to learn the difference between the demonstrator’s feedback and optimal, \(\hat{d}_{t}^{(p)}\). Additionally, the goal of the recreation subnetwork, \(q_\phi\) is to learn the personalized embedding, \(w^{(p)}\), from the demonstrator labels and how far their labels are from optimal.
Training Phases:
- Calibration Phase: A set of users complete the calibration tasks, where we know the ground truth labels. We collect a dataset to learn the MIND MELD parameters, \(\theta, \phi, \phi'\). The parameters are then frozen for the testing phase.
- Testing Phase: A new user completes the calibration tasks to determine their personalized embedding, \(w^{(p)}\). Using \(w^{(p)}\), we calculate how much their demonstration should be shifted, \(\hat{d}_{t}^{(p)}\), on a new task (where optimal is unknown).
Bi-directional LSTM, \(\mathcal{E}_{\phi'}\): The input, \(a_t^{(p)}\) is demonstrator \(p\) ‘s corrective label at time \(t\). The \(\Delta t\) labels before and after time \(t\) are passed into the bi-directional LSTM encoder. We use a bi-directional LSTM because the labels are not independent through time, meaning that the labels before and after the current label can provide information about the person’s teaching style. The bi-directional LSTM encodes the timeseries input into an encoding, \(z^{(p)}_{t-\Delta t: t+\Delta t}\).
Prediction Subnetwork, \(f_\theta\) : The encoded demonstrator feedback, \(z^{(p)}_{t-\Delta t: t+\Delta t}\), and the personalized embedding describing the person’s suboptimal teaching style, \(w^{(p)}\), is fed through the prediction subnetwork, \(f_\theta\). The prediction subnetwork then outputs \(\hat{d}_{t}^{(p)}\), which is the amount that the person’s corrective demonstration should be shifted to be closer to optimal. We train this subnetwork using the calibration tasks where we have access to optimal labels, \(o_t\). We utilize a mean-squared error (MSE) loss between the predicted difference, \(\hat{d}_{t}^{(p)}\), and the true difference, \(d_{t}^{(p)}\), where \(d_{t}^{(p)} = a_t^{(p)} - o_t\).
The recreation subnetwork, \(q_\phi\), learns the estimate for the personalized embedding, \(\hat{w}^{(p)}\), and \(w^{(p)}\) is also an input to the architecture. We use the calibration tasks to estimate a user’s embedding, \(\hat{w}^{(p)}\). When we are determining a user’s embedding, we intialize \(w^{(p)}\) to the average embedding from all previous users. The input is updated through backpropagation during each round of training. After the user has completed the calibration tasks and is moving on to teaching the robot a new task, we use the estimated embedding, \(\hat{w}^{(p)}\), as the input.
Recreation Subnetwork, \(q_\phi\) : The recreation subnetwork, \(q_\phi\), takes the encoded demonstrator feedback, \(z^{(p)}_{t-\Delta t: t+\Delta t}\), and the amount that the person is suboptimal, \(\hat{d}_{t}^{(p)}\), to estimate the embedding that describes the person’s suboptimal teaching style, \(\hat{w}^{(p)}\). To estimate \(w^{(p)}\), we utilize variational inference by maximizing the mutual information between \(\hat{d}_{t}^{(p)}\), \(w^{(p)}\), and \(z^{(p)}_{t-\Delta t: t+\Delta t}\). We train this subnetwork using an MSE loss between the estimated embedding, \(\hat{w}^{(p)}\), and the previous guess, \(w^{(p)}\). The intuition here is that the more we learn about how to correct the person’s feedback (as our estimate for \(\hat{d}_{t}^{(p)}\) becomes more accurate), we can be more certain of our estimate for \(w^{(p)}\). Please see the paper for more details and equations.
Experiment and Selected Results
To evaluate MIND MELD, we conduct a human-subjects experiment. In the study, the robot is an autonomous vehicle in a driving simulator, and participants teach the car to drive to a goal location without hitting any obstacles.
For the Calibration Phase, where we train the MIND MELD architecture parameters, we recruited 76 participants. In this phase, participants completed the calibration tasks, where we have the optimal labels.
To evaluate MIND MELD, we conduct a Testing Phase. In the testing phase, participants first completed the calibration tasks so we could learn their personalized embedding, \(w^{(p)}\). Then participants teach the car to drive on a new, more complicated task (increased number of obstacles and turns needed to reach the goal). Participants teach the car to drive the new task for three different algorithms (in a random order): MIND MELD, a robot-centric baseline (DAgger), and a human-centric baseline (Behavioral Cloning (BC)). For each algorithm, participants teach the car for twelve trials. We recruited 42 participants for this phase.
Metrics
To compare the accuracy of MIND MELD to the baselines, we measure the average distance to the goal and the probability of reaching the goal for each trial. Participants also complete surveys to measure the amount of perceived workload to teach using each algorithm and the likeability, perceived intelligence, and trust of each algorithm.
Overall, we found that MIND MELD outperformed both baselines in terms of accuracy and participant perceptions!
Average Distance from Goal
MIND MELD gets significantly closer to the goal compared to both baselines.
Probability of Reaching the Goal
In these plots, we compare the probability of each algorithm reaching the goal once (top left), twice (top right), three times (bottom left), and four times (bottom right) for each trial. Overall, MIND MELD has a significantly higher probability of reaching the goal compared to baselines. In the bottom right plot, MIND MELD has a 100% chance of reaching the goal four times during the trials, while both baselines have less than a 50% chance of achieving the same performance.
Subjective Metrics
Participants rated MIND MELD has significantly less work to teach compared to baselines. Participants also perceived MIND MELD as more likeable and intelligent and trusted MIND MELD more compared to baselines.