Automated grasping has a long history of research that is increasing due to interest from industry. One grand challenge for robotics is Universal Picking: the ability to robustly grasp a broad variety of objects in diverse environments for applications from warehouses to assembly lines to homes. Although many researchers now openly share code and data, it is challenging to compare and/or reproduce experimental results to identify which aspects of which approaches work best due to variations in assumptions and experimental protocols, e.g., sensors, lighting, robot arms, grippers, and objects.
In computer vision, the emergence of specific reproducible benchmarks advanced the field considerably and provided “gradients” that expose gaps in the state of the art. With physical experiments, however, performance depends crucially on the hardware and environment and it is not possible for every lab to experiment with the exact same conditions. Nor is exact uniformity desirable as methods must ultimately work across various sensors, robot arms, grippers, object sets, and environments.
Industrial practitioners characterize picking in terms of the three R’s: rate, reliability, and range (class of objects). One metric for comparison is mean picks per hour (MPPH), which is common in the logistics industry where it is recognized that human workers can operate in the range of 400–600 MPPH for warehousing operations [1].* This can be formalized as \begin{align*}&{\mathbb {E}} \left [\rho \right] : {\mathrm {mean ~picks ~per ~hour}} \\ &{\mathbb {E}} (\rho) = {v} \ast {q}\,\, ({\mathrm {computed ~as ~mean ~over}} \,\,{N}\,\, {\mathrm {grasp ~attempts}}) \end{align*}
\begin{equation*}{v=1/(t_{s}+t_{c}+t_{r})} \end{equation*}
Performance may vary based on the experimental conditions such as the robot arm used (cobot versus industrial arm) or application considered (e.g., picking any object versus picking a specific object). To aid in comparison, we recommend reporting the variables above with the number of trials and the experimental conditions below, including tolerances and standard errors where applicable as follows.
Procedure: Results should differentiate between training and test procedures and describe several factors. First is the environment configuration that includes the workspace area for objects and the set of objects that will not be grasped such as bins, shelves, and tables. Another is the calibration subprocedure, such as registering sensors to the robot coordinate frame. A third important factor is the subprocedure for sampling initial object configurations, which may be random (e.g., placing objects in a container, shaking, and emptying into a workspace), or structured (e.g., packed boxes and items on shelves). An additional consideration is the subprocedure for evaluating success and taking auxiliary measurements (e.g., which object was targeted), which may involve human responses, sensor readings, or machine learning. It is also important to describe any additional training procedures used for machine learning or to set environment parameters (e.g., location and distance between bins, shelves, and tables).
Object Range: Object characteristics such as shape, size, materials (e.g., rigid versus deformable), friction, mass distribution, and reflectance properties can reflect performance. The unique sets of objects used for training and testing should be reported with an image displaying each object without occlusions and either a reference CAD model for 3-D printed objects or purchase information for other objects. We recommend using objects from published object data sets such as YCB [2], and when objects are left out or new objects are introduced, authors should provide an explanation for the object choice.
Success Metrics: The success metric used to evaluate performance should be described in detail. Example success criteria might include whether any object is dropped into a target bin or whether a specific object is placed in the desired pose. To evaluate MPPH, it is also important to report the runtime in seconds for each grasp attempt in terms of sensing, computation, and movement.
In addition, the data on MPPH can include raw data collected from experiments such as videos, sensor readings, and sequences of robot arm joint angles to aid in replicability. Additional metrics that are useful to report are: the identity of targeted objects, confidence values from the grasp planner, and precision-recall metrics based on confidence values.
Computer System: The type and number of computer processors used at runtime, including both CPUs and GPUs, should be reported. It is also helpful to report the operating system and other specifications of the computer (e.g., network interface card bandwidth, USB bandwidth).
Robot Arm: Include details on the kinematics, workspace, payload, speed, accuracy/repeatability, software, price, and whether the manipulator is mobile or fixed. The robot arm manufacturer, model, and version should also be reported.
Robot Grippers/Hands: Include details on the number of finger contacts, actuation method (e.g., electric and pneumatic), control scheme (e.g., position, force, and hybrid), kinematics, workspace, contact surfaces (e.g., area, friction, stiffness, and materials), payload, speed, and repeatability.
Sensors: Include details on the modality (e.g., color versus RGB-D), lenses, resolution, field of view, light sensitivity, and noise levels. The camera manufacturer and model should be reported along with the configuration categorized as follows:
Single Fixed Viewpoint: The system receives color and depth images from one externally mounted fixed RGB-D sensor (overhead or over-the-shoulder).
Multiple Fused Viewpoints: The system takes input from several fixed RGB-D sensors and combines them into one 3-D estimate.
Moving Viewpoint: The system takes input from one or more moving sensors on board a mobile manipulator or eye-in-hand sensor.
Controller: Include details on control methods for the robot arm and gripper. Example controllers include cartesian position, impedance, hybrid position/force, or custom methods.
Lighting: Due to the effect on sensing, it is useful to report the following lighting conditions: lighting levels (lux), structured versus ambient, and indoor versus outdoor, including combinations.
Application: Performance may be categorized based on the target application, such as warehouse picking, pick and place, industrial kitting, stowing, assembly, and assistive care tasks such as instrumental activities of daily living.
Other factors may also be important. We welcome others in the community to join the online discussion, which began in November 2017, via e-mail or the collaborative document http://goo.gl/6M5rfw.
We also refer the reader to other efforts on benchmarking grasping and manipulation [2]–[4]. We thank our colleagues who also provided input to the discussion: Sergey Levine, Sidd Srinivasa, Howie Choset, Russ Tedrake, Vincent Vanhoucke, Raia Hadsell, Kurt Konolige, Tom Fuhlbrigge, Tye Brady, Juan Aparicio Ojea, and Peter Puchwein.
* “A human is capable of performing.. at a rate of approx 400 [picks] per hour with minimal errors, while the best robot in the first APC achieved a rate of approx 30 [picks] per hour with [an 84% success rate].” [5]