Scaling Robot Supervision to Hundreds of Hours with RoboTurk

Robotic Manipulation Dataset through Human Reasoning and Dexterity

We demonstrate applying the RoboTurk platform to real robots and show the possibility of scaling to hundreds of hours of data using few real robots. Previous works with large scale robotic manipulation datasets have had a low signal-to-noise ratio as they have been collected through self-supervised methods. We collect the largest dataset for robotic manipulation through remote teloperation. Over the course of 1 week using 54 operators, we collected 111 hours of robotic manipulation data on 3 challenging manipulation tasks that require dexterous control and human planning.

The RoboTurk Real Robot Dataset: By the Numbers





Hours of Demonstrations

Week To Collect

Unique Users

Total Demonstrations

Using the RoboTurk platform, we were able to collect the largest dataset for robotic manipulation via teleoperation to date using only 3 Sawyer robot arms. When compared to previous works, we collect between 8x-100x the amount of robot hours with episodes lasting 3x-40x longer which shows the diversity of demonstrations as well as task difficulty.

Comparison To Similar Robot Datasets Collected Using Human Supervision




Avg Task Length (sec)

Number of Demos

Total Time (hours)







Pick, Grasp, Align




Human Demos

Pick, Place, Push





Pick, Place, Push, Pour





iPhone AR

Long Horizon Object Manip




* indicates extrapolated values from the information reported by the dataset

Dataset Diversity

The tasks that were chosen are difficult and lend themselves to producing a highly diverse dataset that admits many different approaches to solving the task. Below are random samples from the dataset for each task.

Extensions to the RoboTurk platform

The RoboTurk system required extensions to be able to work successfully and scalably on real robots. The system is set up in much the same way as it was in simulation with a few extensions to ensure safety and reliability. We summarize the extensions below.

  1. Ensure the safety of the robots and human supervisors by limiting the speed of user movements and placing limits on the motions of the robots/
  2. Enforcing a queue structure to ensure only one user can use each robot during a given time frame.
  3. A method of collecting and storing a large amount of data in real time.
  4. Account for the additional latency inherent in controlling real systems.

Task Design for the RoboTurk Real Robot Dataset

We use the RoboTurk platform to collect a large dataset on three difficult manipulation tasks that involve planning, vision, and dexterity. All three of these tasks require high-level reasoning (the "what") on the part of the demonstrator and low-level dexterity (the "how") making human demonstrations necessary.

The "laundry layout" task

In the "laundry layout" task, users are to use the robot to efficiently flatten the cloth or other fabric object. This task requires reasoning to identify how best to flatten the object as well as dexterous manipulation behaviors like pushing, pulling, and pick-and-place.

The "tower creation" task

In the "tower creation" task, users are to use the robot to stack common kitchen items into the tallest tower that is possible. This task requires reasoning to construct a tower that is both tall and stable as well as dexterous manipulation behaviors like stacking.

The "object search" task

In the "object search" task, users are to search in a bin for three objects of the same class without discarding distractor objects. This task requires reasoning to search effectively and identify target objects as well as dexterous manipulation behaviors like picking and fitting.

Collecting the Data

The dataset collected contains information from a variety of different sensors and data streams. In particular, we collect the following streams:

  • Low level robot state (joint-level information)
  • User control stream - phone poses and controls received from the operator
  • Overhead Kinect RBGD images
  • Front-facing RGB camera (the same view that was provided to users)


As a result of the unstructured nature of the tasks chosen, the demonstrations that were collected exhibit multimodality. From similar initial configurations of the objects in the environment, there are many different approaches to solve the same task and reach a similar end state.

Using the same collection of objects, there are many different towers that can be constructed. Common behaviors include flipping large bowls over, stacking large objects on top of small ones, and combining cups together to create a stable base.

Using the same initial folding configuration, there are several ways to unfold the towel. Common behaviors include using the side of the table to assist, pushing the cloth on the table, and grasping from the corner.

Qualitatively, the two examples above highlight the different strategies to achieve success and the differences in performance of these strategies.

User Improvement

As users interact with the system longer, the quality of their demonstrations generally increases. Users learn to more efficiently complete the object search and laundry layout tasks, which is exemplified by decreased time to completion. Total amount of effort (as measured by amount of translational movement at the end effector) does not change significantly as experience increases. This shows that users translate the end effector at the same rate. However, users learn to become more dexterous with their manipulation by increasingly utilizing changes in orientation as experience increases.

Object Search

Laundry Layout

Task Difficulty Analysis


Object Search

Tower Stacking

Laundry Layout


53.9 ± 11.2

76.9 ± 12.2

51.5 ± 12.8


12.2 ± 3.7

13.2 ± 4.4

9.3 ± 5.3


11.1 ± 4.0

11.8 ± 4.7

9.6 ± 4.6


6.0 ± 4.6

12.2 ± 6.2

7.9 ± 4.9


6.3 ± 4.8

12.4 ± 5.1

7.1 ± 6.1


10.8 ± 4.2

14.2 ± 3.9

10.3 ± 5.1


7.5 ± 5.0

13.1 ± 5.2

7.3 ± 5.4

A majority (60.8%) of users felt comfortable using the system within 15 minutes with 96% comfortable within an hour. Users were asked to self-evaluate their performance on each task using the NASA TLX survey which gives a measure of task difficulty (higher total corresponds to increased difficulty). Note that * indicates lower is better.

Inferring A Reward Signal From Demonstrations

To illustrate the utility of our dataset, we examine the possibility of inferring a reward signal from the collected demonstrations. These methods show that the dataset is useful for imitation learning.

We show a t-SNE visualization of learned embeddings of the images (on the Laundry layout task) over the course of a trajectory. It is clear that the trajectories begin in a similar position (labeled 1 in the figures) and proceed through different paths in the embeddings which demonstrate the multimodality of the dataset.

We show that measuring the cosine distance between image embeddings over the course of the trajectory with the final image of the trajectory produces a meaningful reward signal. The reward clearly increases when the tower is closer to completion.