129 research outputs found

    Face mask recognition from audio: the MASC database and an overview on the mask challenge

    Get PDF
    The sudden outbreak of COVID-19 has resulted in tough challenges for the field of biometrics due to its spread via physical contact, and the regulations of wearing face masks. Given these constraints, voice biometrics can offer a suitable contact-less biometric solution; they can benefit from models that classify whether a speaker is wearing a mask or not. This article reviews the Mask Sub-Challenge (MSC) of the INTERSPEECH 2020 COMputational PARalinguistics challengE (ComParE), which focused on the following classification task: Given an audio chunk of a speaker, classify whether the speaker is wearing a mask or not. First, we report the collection of the Mask Augsburg Speech Corpus (MASC) and the baseline approaches used to solve the problem, achieving a performance of [Formula: see text] Unweighted Average Recall (UAR). We then summarise the methodologies explored in the submitted and accepted papers that mainly used two common patterns: (i) phonetic-based audio features, or (ii) spectrogram representations of audio combined with Convolutional Neural Networks (CNNs) typically used in image processing. Most approaches enhance their models by adapting ensembles of different models and attempting to increase the size of the training data using various techniques. We review and discuss the results of the participants of this sub-challenge, where the winner scored a UAR of [Formula: see text]. Moreover, we present the results of fusing the approaches, leading to a UAR of [Formula: see text]. Finally, we present a smartphone app that can be used as a proof of concept demonstration to detect in real-time whether users are wearing a face mask; we also benchmark the run-time of the best models

    Failure Mode Identification of Elastomer for Well Completion Systems using Mask R-CNN

    Get PDF

    Semantic Segmentation Enhanced Transformer Model for Human Attention Prediction

    Full text link
    Saliency Prediction aims to predict the attention distribution of human eyes given an RGB image. Most of the recent state-of-the-art methods are based on deep image feature representations from traditional CNNs. However, the traditional convolution could not capture the global features of the image well due to its small kernel size. Besides, the high-level factors which closely correlate to human visual perception, e.g., objects, color, light, etc., are not considered. Inspired by these, we propose a Transformer-based method with semantic segmentation as another learning objective. More global cues of the image could be captured by Transformer. In addition, simultaneously learning the object segmentation simulates the human visual perception, which we would verify in our investigation of human gaze control in cognitive science. We build an extra decoder for the subtask and the multiple tasks share the same Transformer encoder, forcing it to learn from multiple feature spaces. We find in practice simply adding the subtask might confuse the main task learning, hence Multi-task Attention Module is proposed to deal with the feature interaction between the multiple learning targets. Our method achieves competitive performance compared to other state-of-the-art methods

    High-Level Descriptors for Fall Event Detection Supported by a Multi-Stream Network

    Get PDF
    The need for assertive video classification has been increasingly in demand. Especially for detecting endangering situations, it is crucial to have a quick response to avoid triggering more serious problems. During this work, we target video classification concerning falls. Our study focuses on the use of high-level descriptors able to correctly characterize the event. These descriptor results will serve as inputs to a multi-stream architecture of VGG-16 networks. Therefore, our proposal is based on the analysis of the best combination of high-level extracted features for the binary classification of videos. This approach was tested on three known datasets, and has proven to yield similar results as other more consuming methods found in the literature

    Neural Information Processing Techniques for Skeleton-Based Action Recognition

    Get PDF
    Human action recognition is one of the core research problems in human-centered computing and computer vision. This problem lays the technical foundations for a wide range of applications, such as human-robot interaction, virtual reality, sports analysis, and so on. Recently, skeleton-based action recognition, as a subarea of action recognition, is swiftly accumulating attention and popularity. The task is to recognize actions performed by human articulation points. Compared with other data modalities, 3D human skeleton representations have extensive unique desirable characteristics, including succinctness, robustness, racial-impartiality, and many more. Currently, research on skeleton-based action recognition primarily concentrates on designing new spatial and temporal neural network operators to more thoroughly extract action features. In this thesis, on the other hand, we aim to propose methods that can be compatibly equipped with existing approaches. That is, we desire to further collaboratively strengthen current algorithms rather than forming competition with them. To this end, we propose five techniques and one large-scale human skeleton dataset. First, we present fusing higher-order spatial features in the form of angular encoding into modern architectures to robustly capture the relationships between joints and body parts. Many skeleton-based action recognizers are confused by actions that have similar motion trajectories. The proposed angular features robustly capture the relationships between joints and body parts, achieving new state-of-the-art accuracy in two large benchmarks, including NTU60 and NTU120, while employing fewer parameters and reduced run time. Second, we design two temporal accessories that facilitate existing skeleton-based action recognizers to more richly capture motion patterns. Specifically, the proposed two modules support alleviating the adverse influence of signal noise as well as guide networks to explicitly capture the sequence's chronological order. The two accessories facilitate a simple skeleton-based action recognizer to achieve new state-of-the-art (SOTA) accuracy on two large benchmark datasets. Third, we devise a new form of graph neural network as a potential new network backbone for extracting topological information of skeletonized human sequences. The proposed graph neural network is capable of learning relative positions between the nodes within a graph, substantially improving performance on various synthetic and real-world graph datasets while enjoying stable scalability. Fourth, we propose an information-theoretic technique to address imbalanced datasets, \ie, the categorical distribution of class labels is non-uniform. The proposed method improves classification accuracy when the training dataset is imbalanced. Our result provides an alternative view: neural network classifiers are mutual information estimators. Fifth, we present a neural crowdsourcing method to correct human errors. When annotating skeleton-based actions, human annotators may not reach a unanimous action category due to ambiguities of skeleton motion trajectories from different actions. The proposed method can help unify different annotated results into a single label. Sixth, we collect a large-scale human skeleton dataset for benchmarking existing methods and defining new problems for achieving the commercialization of skeleton-based action recognition. Using ANUBIS, we evaluate the performance of current skeleton-based action recognizers. At the end of this thesis, we conclude our proposed methods and propose four technique problems that may need to be solved first in order to commercialize skeleton-based action recognition in reality

    Sparse Neural Network Training with In-Time Over-Parameterization

    Get PDF

    Deep Learning Techniques for Electroencephalography Analysis

    Get PDF
    In this thesis we design deep learning techniques for training deep neural networks on electroencephalography (EEG) data and in particular on two problems, namely EEG-based motor imagery decoding and EEG-based affect recognition, addressing challenges associated with them. Regarding the problem of motor imagery (MI) decoding, we first consider the various kinds of domain shifts in the EEG signals, caused by inter-individual differences (e.g. brain anatomy, personality and cognitive profile). These domain shifts render multi-subject training a challenging task and impede robust cross-subject generalization. We build a two-stage model ensemble architecture and propose two objectives to train it, combining the strengths of curriculum learning and collaborative training. Our subject-independent experiments on the large datasets of Physionet and OpenBMI, verify the effectiveness of our approach. Next, we explore the utilization of the spatial covariance of EEG signals through alignment techniques, with the goal of learning domain-invariant representations. We introduce a Riemannian framework that concurrently performs covariance-based signal alignment and data augmentation, while training a convolutional neural network (CNN) on EEG time-series. Experiments on the BCI IV-2a dataset show that our method performs superiorly over traditional alignment, by inducing regularization to the weights of the CNN. We also study the problem of EEG-based affect recognition, inspired by works suggesting that emotions can be expressed in relative terms, i.e. through ordinal comparisons between different affective state levels. We propose treating data samples in a pairwise manner to infer the ordinal relation between their corresponding affective state labels, as an auxiliary training objective. We incorporate our objective in a deep network architecture which we jointly train on the tasks of sample-wise classification and pairwise ordinal ranking. We evaluate our method on the affective datasets of DEAP and SEED and obtain performance improvements over deep networks trained without the additional ranking objective
    corecore