30 research outputs found

    Performance evaluation of deep feature learning for RGB-D image/video classification

    Get PDF
    Deep Neural Networks for image/video classification have obtained much success in various computer vision applications. Existing deep learning algorithms are widely used on RGB images or video data. Meanwhile, with the development of low-cost RGB-D sensors (such as Microsoft Kinect and Xtion Pro-Live), high-quality RGB-D data can be easily acquired and used to enhance computer vision algorithms [14]. It would be interesting to investigate how deep learning can be employed for extracting and fusing features from RGB-D data. In this paper, after briefly reviewing the basic concepts of RGB-D information and four prevalent deep learning models (i.e., Deep Belief Networks (DBNs), Stacked Denoising Auto-Encoders (SDAE), Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) Neural Networks), we conduct extensive experiments on five popular RGB-D datasets including three image datasets and two video datasets. We then present a detailed analysis about the comparison between the learned feature representations from the four deep learning models. In addition, a few suggestions on how to adjust hyper-parameters for learning deep neural networks are made in this paper. According to the extensive experimental results, we believe that this evaluation will provide insights and a deeper understanding of different deep learning algorithms for RGB-D feature extraction and fusion

    ENHANCING KURDISH SIGN LANGUAGE RECOGNITION THROUGH RANDOM FOREST CLASSIFIER AND NOISE REDUCTION VIA SINGULAR VALUE DECOMPOSITION (SVD)

    Get PDF
    Deaf people around the world face difficulty communicating with others. Hence, they use their own language to communicate with each other. This paper introduces a new approach for Kurdish sign language recognition using the random forest classifier algorithm aiming to facilitate communication for deaf communities to communicate with others without relying on human interpreters. On the other side, for further enhancement of the images captured during recognition linear algebra techniques have been used such as singular value decomposition for image compression and Moore–Penrose inverse for blur removal. Kurdish language has 34 alphabets and (10 numeric numbers 10, . . . ,3 ,2 ,1). Additionally, three extra signs have been created and added to the dataset, such as space, backspace, and delete sentences for the purpose of real-time translation. A collection of 800 images has been gathered for each character, out of 800 images, only 80 per character were used due to their similar positions but varied alignment, totalling 3,520 images for the dataset (44 characters  80 images each). Two simulation scenarios were carried out: one with optimal conditions - a white background and adequate lighting, and another with challenges such as complex backgrounds and varied lighting angles. Both achieved high match rates of 96% and 87%, respectively. Further, a classification report analyzed precision, recall, and F1 score metrics

    Feature Learning for RGB-D Data

    Get PDF
    RGB-D data has turned out to be a very useful representation for solving fundamental computer vision problems. It takes the advantages of the color images that provide appearance information of an object and also the depth image that is immune to the variations in color, illumination, rotation angle and scale. With the invention of the low-cost Microsoft Kinect sensor, which was initially used for gaming and later became a popular device for computer vision, high quality RGB-D data can be acquired easily. RGB-D image/video can facilitate a wide range of application areas, such as computer vision, robotics, construction and medical imaging. Furthermore, how to fuse RGB information and depth information is still a problem in computer vision. It is not enough to simply concatenate RGB data and depth data together. A new fusion method could better fuse RGB images and depth images. It still needs more powerful algorithms on this. In this thesis, to explore more advantages of RGB-D data, we use some popular RGB-D datasets for deep feature learning algorithms evaluation, hyper-parameter optimization, local multi-modal feature learning, RGB-D data fusion and recognizing RGB information from RGB-D images: i)With the success of Deep Neural Network in computer vision, deep features from fused RGB-D data can be proved to gain better results than RGB data only. However, different deep learning algorithms show different performance on different RGB-D datasets. Through large-scale experiments to comprehensively evaluate the performance of deep feature learning models for RGB-D image/ video classification, we obtain the conclusion that RGB-D fusion methods using CNNs always outperform other selected methods (DBNs, SDAE and LSTM). On the other side, since LSTM can learn from experience to classify, process and predict time series, it achieved better performances than DBN and SDAE in video classification tasks. ii) Hyper-parameter optimization can help researchers quickly choose an initial set of hyper-parameters for a new coming classification task, thus reducing the number of trials in terms of hyper-parameter space. We present a simple and efficient framework for improving the efficiency and accuracy of hyper-parameter optimization by considering the classification complexity of a particular dataset. We verify this framework on three real-world RGB-D datasets. After the analysis of experiments, we confirm that our framework can provide deeper insights into the relationship between dataset classification tasks and hyperparameters optimization, thus quickly choosing an accurate initial set of hyper-parameters for a new coming classification task. iii) We propose a new Convolutional Neural Networks (CNNs)-based local multi-modal feature learning framework for RGB-D scene classification. This method can effectively capture much of the local structure from the RGB-D scene images and automatically learn a fusion strategy for the object-level recognition step instead of simply training a classifier on top of features extracted from both modalities. Experiments are conducted on two popular datasets to thoroughly test the performance of our method, which show that our method with local multi-modal CNNs greatly outperforms state-of-the-art approaches. Our method has the potential to improve RGB-D scene understanding. Some extended evaluation shows that CNNs trained using a scene-centric dataset is able to achieve an improvement on scene benchmarks compared to a network trained using an object-centric dataset. iv) We propose a novel method for RGB-D data fusion. We project raw RGB-D data into a complex space and then jointly extract features from the fused RGB-D images. Besides three observations about the fusion methods, the experimental results also show that our method achieves competing performance against the classical SIFT. v) We propose a novel method called adaptive Visual-Depth Embedding (aVDE) which learns the compact shared latent space between two representations of labeled RGB and depth modalities in the source domain first. Then the shared latent space can help the transfer of the depth information to the unlabeled target dataset. At last, aVDE matches features and reweights instances jointly across the shared latent space and the projected target domain for an adaptive classifier. This method can utilize the additional depth information in the source domain and simultaneously reduce the domain mismatch between the source and target domains. On two real-world image datasets, the experimental results illustrate that the proposed method significantly outperforms the state-of-the-art methods

    Action recognition from RGB-D data

    Get PDF
    In recent years, action recognition based on RGB-D data has attracted increasing attention. Different from traditional 2D action recognition, RGB-D data contains extra depth and skeleton modalities. Different modalities have their own characteristics. This thesis presents seven novel methods to take advantages of the three modalities for action recognition. First, effective handcrafted features are designed and frequent pattern mining method is employed to mine the most discriminative, representative and nonredundant features for skeleton-based action recognition. Second, to take advantages of powerful Convolutional Neural Networks (ConvNets), it is proposed to represent spatio-temporal information carried in 3D skeleton sequences in three 2D images by encoding the joint trajectories and their dynamics into color distribution in the images, and ConvNets are adopted to learn the discriminative features for human action recognition. Third, for depth-based action recognition, three strategies of data augmentation are proposed to apply ConvNets to small training datasets. Forth, to take full advantage of the 3D structural information offered in the depth modality and its being insensitive to illumination variations, three simple, compact yet effective images-based representations are proposed and ConvNets are adopted for feature extraction and classification. However, both of previous two methods are sensitive to noise and could not differentiate well fine-grained actions. Fifth, it is proposed to represent a depth map sequence into three pairs of structured dynamic images at body, part and joint levels respectively through bidirectional rank pooling to deal with the issue. The structured dynamic image preserves the spatial-temporal information, enhances the structure information across both body parts/joints and different temporal scales, and takes advantages of ConvNets for action recognition. Sixth, it is proposed to extract and use scene flow for action recognition from RGB and depth data. Last, to exploit the joint information in multi-modal features arising from heterogeneous sources (RGB, depth), it is proposed to cooperatively train a single ConvNet (referred to as c-ConvNet) on both RGB features and depth features, and deeply aggregate the two modalities to achieve robust action recognition

    Human activity learning for assistive robotics using a classifier ensemble

    Get PDF
    Assistive robots in ambient assisted living environments can be equipped with learning capabilities to effectively learn and execute human activities. This paper proposes a human activity learning (HAL) system for application in assistive robotics. An RGB-depth sensor is used to acquire information of human activities, and a set of statistical, spatial and temporal features for encoding key aspects of human activities are extracted from the acquired information of human activities. Redundant features are removed and the relevant features used in the HAL model. An ensemble of three individual classifiers—support vector machines (SVMs), K-nearest neighbour and random forest - is employed to learn the activities. The performance of the proposed system is improved when compared with the performance of other methods using a single classifier. This approach is evaluated on experimental dataset created for this work and also on a benchmark dataset—the Cornell Activity Dataset (CAD-60). Experimental results show the overall performance achieved by the proposed system is comparable to the state of the art and has the potential to benefit applications in assistive robots for reducing the time spent in learning activities

    Exploring Robot Teleoperation in Virtual Reality

    Get PDF
    This thesis presents research on VR-based robot teleoperation with a focus on remote environment visualisation in virtual reality, the effects of remote environment reconstruction scale in virtual reality on the human-operator's ability to control the robot and human-operator's visual attention patterns when teleoperating a robot from virtual reality. A VR-based robot teleoperation framework was developed, it is compatible with various robotic systems and cameras, allowing for teleoperation and supervised control with any ROS-compatible robot and visualisation of the environment through any ROS-compatible RGB and RGBD cameras. The framework includes mapping, segmentation, tactile exploration, and non-physically demanding VR interface navigation and controls through any Unity-compatible VR headset and controllers or haptic devices. Point clouds are a common way to visualise remote environments in 3D, but they often have distortions and occlusions, making it difficult to accurately represent objects' textures. This can lead to poor decision-making during teleoperation if objects are inaccurately represented in the VR reconstruction. A study using an end-effector-mounted RGBD camera with OctoMap mapping of the remote environment was conducted to explore the remote environment with fewer point cloud distortions and occlusions while using a relatively small bandwidth. Additionally, a tactile exploration study proposed a novel method for visually presenting information about objects' materials in the VR interface, to improve the operator's decision-making and address the challenges of point cloud visualisation. Two studies have been conducted to understand the effect of virtual world dynamic scaling on teleoperation flow. The first study investigated the use of rate mode control with constant and variable mapping of the operator's joystick position to the speed (rate) of the robot's end-effector, depending on the virtual world scale. The results showed that variable mapping allowed participants to teleoperate the robot more effectively but at the cost of increased perceived workload. The second study compared how operators used a virtual world scale in supervised control, comparing the virtual world scale of participants at the beginning and end of a 3-day experiment. The results showed that as operators got better at the task they as a group used a different virtual world scale, and participants' prior video gaming experience also affected the virtual world scale chosen by operators. Similarly, the human-operator's visual attention study has investigated how their visual attention changes as they become better at teleoperating a robot using the framework. The results revealed the most important objects in the VR reconstructed remote environment as indicated by operators' visual attention patterns as well as their visual priorities shifts as they got better at teleoperating the robot. The study also demonstrated that operators’ prior video gaming experience affects their ability to teleoperate the robot and their visual attention behaviours

    Machine learning methods for sign language recognition: a critical review and analysis.

    Get PDF
    Sign language is an essential tool to bridge the communication gap between normal and hearing-impaired people. However, the diversity of over 7000 present-day sign languages with variability in motion position, hand shape, and position of body parts making automatic sign language recognition (ASLR) a complex system. In order to overcome such complexity, researchers are investigating better ways of developing ASLR systems to seek intelligent solutions and have demonstrated remarkable success. This paper aims to analyse the research published on intelligent systems in sign language recognition over the past two decades. A total of 649 publications related to decision support and intelligent systems on sign language recognition (SLR) are extracted from the Scopus database and analysed. The extracted publications are analysed using bibliometric VOSViewer software to (1) obtain the publications temporal and regional distributions, (2) create the cooperation networks between affiliations and authors and identify productive institutions in this context. Moreover, reviews of techniques for vision-based sign language recognition are presented. Various features extraction and classification techniques used in SLR to achieve good results are discussed. The literature review presented in this paper shows the importance of incorporating intelligent solutions into the sign language recognition systems and reveals that perfect intelligent systems for sign language recognition are still an open problem. Overall, it is expected that this study will facilitate knowledge accumulation and creation of intelligent-based SLR and provide readers, researchers, and practitioners a roadmap to guide future direction

    Exploring Robot Teleoperation in Virtual Reality

    Get PDF
    This thesis presents research on VR-based robot teleoperation with a focus on remote environment visualisation in virtual reality, the effects of remote environment reconstruction scale in virtual reality on the human-operator's ability to control the robot and human-operator's visual attention patterns when teleoperating a robot from virtual reality. A VR-based robot teleoperation framework was developed, it is compatible with various robotic systems and cameras, allowing for teleoperation and supervised control with any ROS-compatible robot and visualisation of the environment through any ROS-compatible RGB and RGBD cameras. The framework includes mapping, segmentation, tactile exploration, and non-physically demanding VR interface navigation and controls through any Unity-compatible VR headset and controllers or haptic devices. Point clouds are a common way to visualise remote environments in 3D, but they often have distortions and occlusions, making it difficult to accurately represent objects' textures. This can lead to poor decision-making during teleoperation if objects are inaccurately represented in the VR reconstruction. A study using an end-effector-mounted RGBD camera with OctoMap mapping of the remote environment was conducted to explore the remote environment with fewer point cloud distortions and occlusions while using a relatively small bandwidth. Additionally, a tactile exploration study proposed a novel method for visually presenting information about objects' materials in the VR interface, to improve the operator's decision-making and address the challenges of point cloud visualisation. Two studies have been conducted to understand the effect of virtual world dynamic scaling on teleoperation flow. The first study investigated the use of rate mode control with constant and variable mapping of the operator's joystick position to the speed (rate) of the robot's end-effector, depending on the virtual world scale. The results showed that variable mapping allowed participants to teleoperate the robot more effectively but at the cost of increased perceived workload. The second study compared how operators used a virtual world scale in supervised control, comparing the virtual world scale of participants at the beginning and end of a 3-day experiment. The results showed that as operators got better at the task they as a group used a different virtual world scale, and participants' prior video gaming experience also affected the virtual world scale chosen by operators. Similarly, the human-operator's visual attention study has investigated how their visual attention changes as they become better at teleoperating a robot using the framework. The results revealed the most important objects in the VR reconstructed remote environment as indicated by operators' visual attention patterns as well as their visual priorities shifts as they got better at teleoperating the robot. The study also demonstrated that operators’ prior video gaming experience affects their ability to teleoperate the robot and their visual attention behaviours
    corecore