I. Introduction
The integration of diffusion models [1], [2], [3] and the recent trend of exploiting data symmetries through equivariant model architectures [4], [5], [6] has demonstrated promising results in handling multimodal grasping data, marking these approaches as the state-of-the-art in robotics grasping. However, current grasping methods are often designed for task-specific end-effectors and exhibit limited transferability to alternative gripper architectures. These methods are tailored to predefined hardware configurations, assuming fixed characteristics such as degrees of freedom (DoF), gripper geometry, and the physical contact mechanics of the end-effector. Transferability is crucial for developing generalizable grasp detection methods that can work across various robot systems with different gripper designs. It requires solutions that are resilient to specific hyperparameters and hidden biases, allowing them to incorporate data from diverse sources [7], [8]. Incorporating gripper-agnostic design into architectures could pave the way for future foundation models in robotic grasping [9], [10]. Our contribution is twofold. First, we propose an approach based on an -equivariant architecture [3], [11] that demonstrates state-of-the-art performance in both single-gripper settings and multi-gripper benchmarks and provide experimental validation of our approach on diverse object datasets [12], [13]. Second, we open-source
Implementation can be found at https://github.com/boschresearch/mj-grasp-sim.
our grasp scene generation framework, which integrates eight gripper types and over 1,000 objects. Our findings strongly suggest that training on cluttered heaps is beneficial for grasp synthesis methods; as such, we include a pipeline for variable-size object heap generation in both tabletop and bin picking settings.