Using the RoboTurk platform, we were able to collect the largest dataset for robotic manipulation via teleoperation to date using only 3 Sawyer robot arms. When compared to previous works, we collect between 8x-100x the amount of robot hours with episodes lasting 3x-40x longer which shows the diversity of demonstrations as well as task difficulty.
* indicates extrapolated values from the information reported by the dataset
The tasks that were chosen are difficult and lend themselves to producing a highly diverse dataset that admits many different approaches to solving the task. Below are random samples from the dataset for each task.
The RoboTurk system required extensions to be able to work successfully and scalably on real robots. The system is set up in much the same way as it was in simulation with a few extensions to ensure safety and reliability. We summarize the extensions below.
We use the RoboTurk platform to collect a large dataset on three difficult manipulation tasks that involve planning, vision, and dexterity. All three of these tasks require high-level reasoning (the "what") on the part of the demonstrator and low-level dexterity (the "how") making human demonstrations necessary.
The "laundry layout" task
In the "laundry layout" task, users are to use the robot to efficiently flatten the cloth or other fabric object. This task requires reasoning to identify how best to flatten the object as well as dexterous manipulation behaviors like pushing, pulling, and pick-and-place.
The "tower creation" task
In the "tower creation" task, users are to use the robot to stack common kitchen items into the tallest tower that is possible. This task requires reasoning to construct a tower that is both tall and stable as well as dexterous manipulation behaviors like stacking.
The "object search" task
In the "object search" task, users are to search in a bin for three objects of the same class without discarding distractor objects. This task requires reasoning to search effectively and identify target objects as well as dexterous manipulation behaviors like picking and fitting.
The dataset collected contains information from a variety of different sensors and data streams. In particular, we collect the following streams:
As a result of the unstructured nature of the tasks chosen, the demonstrations that were collected exhibit multimodality. From similar initial configurations of the objects in the environment, there are many different approaches to solve the same task and reach a similar end state.
Using the same collection of objects, there are many different towers that can be constructed. Common behaviors include flipping large bowls over, stacking large objects on top of small ones, and combining cups together to create a stable base.
Using the same initial folding configuration, there are several ways to unfold the towel. Common behaviors include using the side of the table to assist, pushing the cloth on the table, and grasping from the corner.
Qualitatively, the two examples above highlight the different strategies to achieve success and the differences in performance of these strategies.
As users interact with the system longer, the quality of their demonstrations generally increases. Users learn to more efficiently complete the object search and laundry layout tasks, which is exemplified by decreased time to completion. Total amount of effort (as measured by amount of translational movement at the end effector) does not change significantly as experience increases. This shows that users translate the end effector at the same rate. However, users learn to become more dexterous with their manipulation by increasingly utilizing changes in orientation as experience increases.
A majority (60.8%) of users felt comfortable using the system within 15 minutes with 96% comfortable within an hour. Users were asked to self-evaluate their performance on each task using the NASA TLX survey which gives a measure of task difficulty (higher total corresponds to increased difficulty). Note that * indicates lower is better.
To illustrate the utility of our dataset, we examine the possibility of inferring a reward signal from the collected demonstrations. These methods show that the dataset is useful for imitation learning.
We show a t-SNE visualization of learned embeddings of the images (on the Laundry layout task) over the course of a trajectory. It is clear that the trajectories begin in a similar position (labeled 1 in the figures) and proceed through different paths in the embeddings which demonstrate the multimodality of the dataset.
We show that measuring the cosine distance between image embeddings over the course of the trajectory with the final image of the trajectory produces a meaningful reward signal. The reward clearly increases when the tower is closer to completion.