RoboTurk — A System for Large-Scale Teleoperation of Robots

The RoboTurk Real Robot Dataset: By the Numbers

111.25

1

54 2114

Hours of Demonstrations

Week To Collect

Unique Users

Total Demonstrations

Using the RoboTurk platform, we were able to collect the largest dataset for robotic manipulation via teleoperation to date using only 3 Sawyer robot arms. Demonstrating diversity, dexterity, and scale. When compared to previous works, we collect between 8x-100x the amount of robot hours with episodes lasting 3x-40x longer which shows the diversity of demonstrations as well as task difficulty.

Comparison To Similar Robot Datasets Collected Using Human Supervision

Name

Interface

Task

Avg Task Length (sec)

Number of Demos

Total Time (hours)

JIGSAWS

daVinci

Surgery

60*

103

1.66 Deep Imitation

VR

Pick, Grasp, Align

5*

1664

2.35 DAML

Human Demos

Pick, Place, Push

5*

2941

4.08 MIME

Kinesthetic

Pick, Place, Push, Pour

6*

8260

13.7*

Ours

iPhone AR

Long Horizon Object Manip

186 2144

111.25

* indicates extrapolated values from the information reported by the dataset

Scaling To New Users

Users were able to rapidly adapt to the control interface and successfully complete all the tasks with many seeing their fist task completion on their very first try. This shows the benefit to using humans for data collection over self-supervised methods which might be incapable of solving complext tasks.

< 2 min

< 4 min

Time until first object search object placement

Time until first successful towel layout

People can also teleoperate from anywhere around the world. We have successfully teleoperated from the top of moutanins and from three different countries around the world. This specific video was shot in Macau when the robot was located in our lab in California.

Dataset: Diversity

The tasks that were chosen are difficult and lend themselves to producing a highly diverse dataset that admits many different approaches to solving the task. Below are random samples from the dataset for each task.

We also observe the use of many different emergent strategies for solving the tasks. Below we present some of them from the tower creation task.

An 'inverted cone'

Flipping the bottom bowl

Alternating cups and bowls

Using three cups to make a stable base

Dataset: Dexterity

The tasks selected also require a high degree of dexterity in the types of interactions and grasps required to complete the tasks. Many times a simple top down grasp would not be capable of solving the task. We observed many interesting approaches to dexterity, and even creative uses of the objects themselves.

Stuffing objects into the goal location

Creative object use

Move an object for a better grasp

Flipping objects for a better grasp

Clever grasp for placement

Approach object at an angle

Task Design for the RoboTurk Real Robot Dataset

We use the RoboTurk platform to collect a large dataset on three difficult manipulation tasks that involve planning, vision, and dexterity. All three of these tasks require high-level reasoning (the "what") on the part of the demonstrator and low-level dexterity (the "how") making human demonstrations necessary.

The "laundry layout" task

In the "laundry layout" task, users are to use the robot to efficiently flatten the cloth or other fabric object. This task requires reasoning to identify how best to flatten the object as well as dexterous manipulation behaviors like pushing, pulling, and pick-and-place.

The "tower creation" task

In the "tower creation" task, users are to use the robot to stack common kitchen items into the tallest tower that is possible. This task requires reasoning to construct a tower that is both tall and stable as well as dexterous manipulation behaviors like stacking.

The "object search" task

In the "object search" task, users are to search in a bin for three objects of the same class without discarding distractor objects. This task requires reasoning to search effectively and identify target objects as well as dexterous manipulation behaviors like picking and fitting.

Collecting the Data

The dataset collected contains information from a variety of different sensors and data streams. In particular, we collect the following streams:

Low level robot state (joint-level information)
User control stream - phone poses and controls received from the operator
Overhead Kinect RBGD images
Front-facing RGB camera (the same view that was provided to users)

User Improvement

As users interact with the system longer, the quality of their demonstrations generally increases. Users learn to more efficiently complete the object search and laundry layout tasks, which is exemplified by decreased time to completion. Total amount of effort (as measured by amount of translational movement at the end effector) does not change significantly as experience increases. This shows that users translate the end effector at the same rate. However, users learn to become more dexterous with their manipulation by increasingly utilizing changes in orientation as experience increases.

Object Search

Laundry Layout

Task Difficulty Analysis

Measure

Object Search

Tower Stacking

Laundry Layout

Total

53.9 ± 11.2

76.9 ± 12.2

51.5 ± 12.8

Mental

12.2 ± 3.7

13.2 ± 4.4

9.3 ± 5.3

Physical

11.1 ± 4.0

11.8 ± 4.7

9.6 ± 4.6

Temporal

6.0 ± 4.6

12.2 ± 6.2

7.9 ± 4.9

Performance*

6.3 ± 4.8

12.4 ± 5.1

7.1 ± 6.1

Effort

10.8 ± 4.2

14.2 ± 3.9

10.3 ± 5.1

Frustration

7.5 ± 5.0

13.1 ± 5.2

7.3 ± 5.4

A majority (60.8%) of users felt comfortable using the system within 15 minutes with 96% comfortable within an hour. Users were asked to self-evaluate their performance on each task using the NASA TLX survey which gives a measure of task difficulty (higher total corresponds to increased difficulty). Note that * indicates lower is better.

Using The Dataset: Reward Learning

To illustrate the utility of our dataset, we examine the possibility of inferring a reward signal from the collected demonstrations. These methods show that the dataset is useful for imitation learning.

Specifically, we use a modified version of Time Contrastive Networks to efficiently learn an embeding space that is meaningful across long time horizons where a reward signal can be inferred by negative distances in the embedding space. The circled numbers show correspondences between points on the curve and images from the trajectory.

Using The Dataset: Imitation Learning

We additionally show that the colected data can be used with straightforward imitation learning methods. Specifically, we demonstrate successful utilization of Behavioral Cloning on simplified instances of the Laundry Layout task where the agent needs only one unfolding step to solve the environment. See the trained (sped up 3x) policy below.