High-resolution representations are important for vision-based robotic
grasping problems. Existing works generally encode the input images into
low-resolution representations via sub-networks and then recover
high-resolution representations. This will lose spatial information, and errors
introduced by the decoder will be more serious when multiple types of objects
are considered or objects are far away from the camera. To address these
issues, we revisit the design paradigm of CNN for robotic perception tasks. We
demonstrate that using parallel branches as opposed to serial stacked
convolutional layers will be a more powerful design for robotic visual grasping
tasks. In particular, guidelines of neural network design are provided for
robotic perception tasks, e.g., high-resolution representation and lightweight
design, which respond to the challenges in different manipulation scenarios. We
then develop a novel grasping visual architecture referred to as HRG-Net, a
parallel-branch structure that always maintains a high-resolution
representation and repeatedly exchanges information across resolutions.
Extensive experiments validate that these two designs can effectively enhance
the accuracy of visual-based grasping and accelerate network training. We show
a series of comparative experiments in real physical environments at Youtube:
https://youtu.be/Jhlsp-xzHFY