10 research outputs found
Underwater Gesture Recognition Using Classical Computer Vision and Deep Learning Techniques
Underwater Gesture Recognition is a challenging task since conditions which are normally not an issue in gesture recognition on land must be considered. Such issues include low visibility, low contrast, and unequal spectral propagation. In this work, we explore the underwater gesture recognition problem by taking on the recently released Cognitive Autonomous Diving Buddy Underwater Gestures dataset. The contributions of this paper are as follows: (1) Use traditional computer vision techniques along with classical machine learning to perform gesture recognition on the CADDY dataset; (2) Apply deep learning using a convolutional neural network to solve the same problem; (3) Perform confusion matrix analysis to determine the types of gestures that are relatively difficult to recognize and understand why; (4) Compare the performance of the methods above in terms of accuracy and inference speed. We achieve up to 97.06% accuracy with our CNN. To the best of our knowledge, our work is one of the earliest attempts, if not the first, to apply computer vision and machine learning techniques for gesture recognition on the said dataset. As such, we hope this work will serve as a benchmark for future work on the CADDY dataset
Robotic Detection of a Human-Comprehensible Gestural Language for Underwater Multi-Human-Robot Collaboration
In this paper, we present a motion-based robotic communication framework that
enables non-verbal communication among autonomous underwater vehicles (AUVs)
and human divers. We design a gestural language for AUV-to-AUV communication
which can be easily understood by divers observing the conversation unlike
typical radio frequency, light, or audio based AUV communication. To allow AUVs
to visually understand a gesture from another AUV, we propose a deep network
(RRCommNet) which exploits a self-attention mechanism to learn to recognize
each message by extracting maximally discriminative spatio-temporal features.
We train this network on diverse simulated and real-world data. Our
experimental evaluations, both in simulation and in closed-water robot trials,
demonstrate that the proposed RRCommNet architecture is able to decipher
gesture-based messages with an average accuracy of 88-94% on simulated data,
73-83% on real data (depending on the version of the model used). Further, by
performing a message transcription study with human participants, we also show
that the proposed language can be understood by humans, with an overall
transcription accuracy of 88%. Finally, we discuss the inference runtime of
RRCommNet on embedded GPU hardware, for real-time use on board AUVs in the
field
Token-Selective Vision Transformer for fine-grained image recognition of marine organisms
IntroductionThe objective of fine-grained image classification on marine organisms is to distinguish the subtle variations in the organisms so as to accurately classify them into subcategories. The key to accurate classification is to locate the distinguishing feature regions, such as the fish’s eye, fins, or tail, etc. Images of marine organisms are hard to work with as they are often taken from multiple angles and contain different scenes, additionally they usually have complex backgrounds and often contain human or other distractions, all of which makes it difficult to focus on the marine organism itself and identify its most distinctive features.Related workMost existing fine-grained image classification methods based on Convolutional Neural Networks (CNN) cannot accurately enough locate the distinguishing feature regions, and the identified regions also contain a large amount of background data. Vision Transformer (ViT) has strong global information capturing abilities and gives strong performances in traditional classification tasks. The core of ViT, is a Multi-Head Self-Attention mechanism (MSA) which first establishes a connection between different patch tokens in a pair of images, then combines all the information of the tokens for classification.MethodsHowever, not all tokens are conducive to fine-grained classification, many of them contain extraneous data (noise). We hope to eliminate the influence of interfering tokens such as background data on the identification of marine organisms, and then gradually narrow down the local feature area to accurately determine the distinctive features. To this end, this paper put forwards a novel Transformer-based framework, namely Token-Selective Vision Transformer (TSVT), in which the Token-Selective Self-Attention (TSSA) is proposed to select the discriminating important tokens for attention computation which helps limits the attention to more precise local regions. TSSA is applied to different layers, and the number of selected tokens in each layer decreases on the basis of the previous layer, this method gradually locates the distinguishing regions in a hierarchical manner.ResultsThe effectiveness of TSVT is verified on three marine organism datasets and it is demonstrated that TSVT can achieve the state-of-the-art performance
Recommended from our members
Deep Perception Without a Camera: Enabling 3D Reconstruction and Object Recognition using Lidar and Sonar Sensing
Deep learning has recently revolutionized robot perception in many canonical robotic applications, such as autonomous driving. However, a similar transformation has yet to occur in more harsh environments including underwater and underground. This is due in part to the difficulty in deploying robots in these environments, which lack large real training datasets and often necessitate the use of non-traditional sensors for deep learning (e.g. imaging sonars and lidars). In this dissertation we demonstrate that by explicitly accounting for the sensor noise beget by challenging environments and by incorporating synthetic data in the training process, the power of deep learning can be leveraged for deployment in these harsh environments.
In our first contribution we develop a framework that enables the real-time 3D reconstruction of underwater environments using features from 2D sonar images. Due to noisy and low-resolution imagery as compared with standard cameras, accurate sonar image analysis necessitates the explicit consideration of noise. While deep learning by using Convolutional Neural Networks (CNNs) has been leveraged on sonar images, previous CNN-based methods do not explicitly consider the noise (from factors such as multi-pathing or irregular surfaces) often present in the images. In this contribution our key insight is to use atrous convolution, which has a larger field of context than standard convolution and is thus not misled as much by localized noise. We demonstrate that atrous convolution, as well as human-in-the-loop feature annotation, provides real-time reconstruction capability on datasets captured onboard our underwater vehicle while operating in a variety of environments.
In our second contribution we remove the human from the loop and develop an approach which leverages deep learning for a fully automated 3D underwater reconstruction algorithm using 2D sonar images as input. Our algorithm is able to produce accurate estimates even when common physical models break down due to phenomena such as non-diffuse reflections. Inspired by our success in the previous contribution, we propose the utilization of CNNs as a powerful method to extract meaningful information without being misled by noisy data. To ensure training convergence, we also introduce a self-supervised method that uses the physics of the sonar sensor to train the network on real data without ground-truth information for training. Our method can produce accurate 3D estimates given only a single image. We demonstrate that our method produces 3D reconstructions with an 80\% reduction in Root Mean Square Error compared to previous approaches, both in simulation and on real data.
We then extend this approach to leverage the series of images the robot collects as it moves through the environment. Specifically, we develop two CNNs that take as input multiple images captured at different points in time and output a more accurate prediction than just using a single image as input. To our knowledge this is the first such multi-sonar-image CNN designed for the 3D underwater reconstruction task. We validate this extension on synthetic and real data and show up to a 5\% improvement over competing methods.
Finally, we develop an improved method for incorporating synthetic data into the training process. This takes our previous contribution a step further by more tightly coupling synthetic and real point cloud feature extraction. We develop an adversarial training technique, which along with the standard object detection loss provides a training signal that encourages similar feature extraction from both synthetic and real clouds. This brings the training process closer to the preferred scenario: where the synthetic point clouds contain features that are very similar to those found in the real scans. We validate our approach in the context of the data-limited DARPA Subterranean Challenge and demonstrate that our 3D adversarial training architecture improves 3D object detection performance by up to 15\% depending on the data representation
CADDY Underwater Stereo-Vision Dataset for Human–Robot Interaction (HRI) in the Context of Diver Activities
In this article, we present a novel underwater dataset collected from several field trials within the EU FP7 project “Cognitive autonomous diving buddy (CADDY)”, where an Autonomous Underwater Vehicle (AUV) was used to interact with divers and monitor their activities. To our knowledge, this is one of the first efforts to collect a large public dataset in underwater environments with the purpose of studying and boosting object classification, segmentation and human pose estimation tasks. The first part of the dataset contains stereo camera recordings (≈10 K) of divers performing hand gestures to communicate with an AUV in different environmental conditions. The gestures can be used to test the robustness of visual detection and classification algorithms in underwater conditions, e.g., under color attenuation and light backscatter. The second part includes stereo footage (≈12.7 K) of divers free-swimming in front of the AUV, along with synchronized measurements from Inertial Measurement Units (IMU) located throughout the diver’s suit (DiverNet), which serve as ground-truth for human pose and tracking methods. In both cases, these rectified images allow the investigation of 3D representation and reasoning pipelines from low-texture targets commonly present in underwater scenarios. This work describes the recording platform, sensor calibration procedure plus the data format and the software utilities provided to use the dataset