343 research outputs found
Feature Learning from Spectrograms for Assessment of Personality Traits
Several methods have recently been proposed to analyze speech and
automatically infer the personality of the speaker. These methods often rely on
prosodic and other hand crafted speech processing features extracted with
off-the-shelf toolboxes. To achieve high accuracy, numerous features are
typically extracted using complex and highly parameterized algorithms. In this
paper, a new method based on feature learning and spectrogram analysis is
proposed to simplify the feature extraction process while maintaining a high
level of accuracy. The proposed method learns a dictionary of discriminant
features from patches extracted in the spectrogram representations of training
speech segments. Each speech segment is then encoded using the dictionary, and
the resulting feature set is used to perform classification of personality
traits. Experiments indicate that the proposed method achieves state-of-the-art
results with a significant reduction in complexity when compared to the most
recent reference methods. The number of features, and difficulties linked to
the feature extraction process are greatly reduced as only one type of
descriptors is used, for which the 6 parameters can be tuned automatically. In
contrast, the simplest reference method uses 4 types of descriptors to which 6
functionals are applied, resulting in over 20 parameters to be tuned.Comment: 12 pages, 3 figure
Music Genre Classification with ResNet and Bi-GRU Using Visual Spectrograms
Music recommendation systems have emerged as a vital component to enhance
user experience and satisfaction for the music streaming services, which
dominates music consumption. The key challenge in improving these recommender
systems lies in comprehending the complexity of music data, specifically for
the underpinning music genre classification. The limitations of manual genre
classification have highlighted the need for a more advanced system, namely the
Automatic Music Genre Classification (AMGC) system. While traditional machine
learning techniques have shown potential in genre classification, they heavily
rely on manually engineered features and feature selection, failing to capture
the full complexity of music data. On the other hand, deep learning
classification architectures like the traditional Convolutional Neural Networks
(CNN) are effective in capturing the spatial hierarchies but struggle to
capture the temporal dynamics inherent in music data. To address these
challenges, this study proposes a novel approach using visual spectrograms as
input, and propose a hybrid model that combines the strength of the Residual
neural Network (ResNet) and the Gated Recurrent Unit (GRU). This model is
designed to provide a more comprehensive analysis of music data, offering the
potential to improve the music recommender systems through achieving a more
comprehensive analysis of music data and hence potentially more accurate genre
classification
Reconhecimento de padrões em expressões faciais : algoritmos e aplicações
Orientador: Hélio PedriniTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: O reconhecimento de emoções tem-se tornado um tópico relevante de pesquisa pela comunidade cientÃfica, uma vez que desempenha um papel essencial na melhoria contÃnua dos sistemas de interação humano-computador. Ele pode ser aplicado em diversas áreas, tais como medicina, entretenimento, vigilância, biometria, educação, redes sociais e computação afetiva. Há alguns desafios em aberto relacionados ao desenvolvimento de sistemas emocionais baseados em expressões faciais, como dados que refletem emoções mais espontâneas e cenários reais. Nesta tese de doutorado, apresentamos diferentes metodologias para o desenvolvimento de sistemas de reconhecimento de emoções baseado em expressões faciais, bem como sua aplicabilidade na resolução de outros problemas semelhantes. A primeira metodologia é apresentada para o reconhecimento de emoções em expressões faciais ocluÃdas baseada no Histograma da Transformada Census (CENTRIST). Expressões faciais ocluÃdas são reconstruÃdas usando a Análise Robusta de Componentes Principais (RPCA). A extração de caracterÃsticas das expressões faciais é realizada pelo CENTRIST, bem como pelos Padrões Binários Locais (LBP), pela Codificação Local do Gradiente (LGC) e por uma extensão do LGC. O espaço de caracterÃsticas gerado é reduzido aplicando-se a Análise de Componentes Principais (PCA) e a Análise Discriminante Linear (LDA). Os algoritmos K-Vizinhos mais Próximos (KNN) e Máquinas de Vetores de Suporte (SVM) são usados para classificação. O método alcançou taxas de acerto competitivas para expressões faciais ocluÃdas e não ocluÃdas. A segunda é proposta para o reconhecimento dinâmico de expressões faciais baseado em Ritmos Visuais (VR) e Imagens da História do Movimento (MHI), de modo que uma fusão de ambos descritores codifique informações de aparência, forma e movimento dos vÃdeos. Para extração das caracterÃsticas, o Descritor Local de Weber (WLD), o CENTRIST, o Histograma de Gradientes Orientados (HOG) e a Matriz de Coocorrência em NÃvel de Cinza (GLCM) são empregados. A abordagem apresenta uma nova proposta para o reconhecimento dinâmico de expressões faciais e uma análise da relevância das partes faciais. A terceira é um método eficaz apresentado para o reconhecimento de emoções audiovisuais com base na fala e nas expressões faciais. A metodologia envolve uma rede neural hÃbrida para extrair caracterÃsticas visuais e de áudio dos vÃdeos. Para extração de áudio, uma Rede Neural Convolucional (CNN) baseada no log-espectrograma de Mel é usada, enquanto uma CNN construÃda sobre a Transformada de Census é empregada para a extração das caracterÃsticas visuais. Os atributos audiovisuais são reduzidos por PCA e LDA, então classificados por KNN, SVM, Regressão LogÃstica (LR) e Gaussian Naïve Bayes (GNB). A abordagem obteve taxas de reconhecimento competitivas, especialmente em dados espontâneos. A penúltima investiga o problema de detectar a sÃndrome de Down a partir de fotografias. Um descritor geométrico é proposto para extrair caracterÃsticas faciais. Experimentos realizados em uma base de dados pública mostram a eficácia da metodologia desenvolvida. A última metodologia trata do reconhecimento de sÃndromes genéticas em fotografias. O método visa extrair atributos faciais usando caracterÃsticas de uma rede neural profunda e medidas antropométricas. Experimentos são realizados em uma base de dados pública, alcançando taxas de reconhecimento competitivasAbstract: Emotion recognition has become a relevant research topic by the scientific community, since it plays an essential role in the continuous improvement of human-computer interaction systems. It can be applied in various areas, for instance, medicine, entertainment, surveillance, biometrics, education, social networks, and affective computing. There are some open challenges related to the development of emotion systems based on facial expressions, such as data that reflect more spontaneous emotions and real scenarios. In this doctoral dissertation, we propose different methodologies to the development of emotion recognition systems based on facial expressions, as well as their applicability in the development of other similar problems. The first is an emotion recognition methodology for occluded facial expressions based on the Census Transform Histogram (CENTRIST). Occluded facial expressions are reconstructed using an algorithm based on Robust Principal Component Analysis (RPCA). Extraction of facial expression features is then performed by CENTRIST, as well as Local Binary Patterns (LBP), Local Gradient Coding (LGC), and an LGC extension. The generated feature space is reduced by applying Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) algorithms are used for classification. This method reached competitive accuracy rates for occluded and non-occluded facial expressions. The second proposes a dynamic facial expression recognition based on Visual Rhythms (VR) and Motion History Images (MHI), such that a fusion of both encodes appearance, shape, and motion information of the video sequences. For feature extraction, Weber Local Descriptor (WLD), CENTRIST, Histogram of Oriented Gradients (HOG), and Gray-Level Co-occurrence Matrix (GLCM) are employed. This approach shows a new direction for performing dynamic facial expression recognition, and an analysis of the relevance of facial parts. The third is an effective method for audio-visual emotion recognition based on speech and facial expressions. The methodology involves a hybrid neural network to extract audio and visual features from videos. For audio extraction, a Convolutional Neural Network (CNN) based on log Mel-spectrogram is used, whereas a CNN built on Census Transform is employed for visual extraction. The audio and visual features are reduced by PCA and LDA, and classified through KNN, SVM, Logistic Regression (LR), and Gaussian Naïve Bayes (GNB). This approach achieves competitive recognition rates, especially in a spontaneous data set. The second last investigates the problem of detecting Down syndrome from photographs. A geometric descriptor is proposed to extract facial features. Experiments performed on a public data set show the effectiveness of the developed methodology. The last methodology is about recognizing genetic disorders in photos. This method focuses on extracting facial features using deep features and anthropometric measurements. Experiments are conducted on a public data set, achieving competitive recognition ratesDoutoradoCiência da ComputaçãoDoutora em Ciência da Computação140532/2019-6CNPQCAPE
Non-Facial Video Spatiotemporal Forensic Analysis Using Deep Learning Techniques
Digital content manipulation software is working as a boon for people to edit recorded video or audio content. To prevent the unethical use of such readily available altering tools, digital multimedia forensics is becoming increasingly important. Hence, this study aims to identify whether the video and audio of the given digital content are fake or real. For temporal video forgery detection, the convolutional 3D layers are used to build a model which can identify temporal forgeries with an average accuracy of 85% on the validation dataset. Also, the identification of audio forgery, using a ResNet-34 pre-trained model and the transfer learning approach, has been achieved. The proposed model achieves an accuracy of 99% with 0.3% validation loss on the validation part of the logical access dataset, which is better than earlier models in the range of 90-95% accuracy on the validation set
Pulse Doppler Radar Target Recognition Using a Two-Stage SVM Procedure
Cataloged from PDF version of article.It is possible to detect and classify moving and stationary targets using ground surveillance pulse-Doppler radars (PDRs). A two-stage support vector machine (SVM) based target classification scheme is described here. The first stage tries to estimate the most descriptive temporal segment of the radar echo signal and the target signal is classified using the selected temporal segment in the second stage. Mel-frequency cepstral coefficients of radar echo signals are used as feature vectors in both stages. The proposed system is compared with the covariance and Gaussian mixture model (GMM) based classifiers. The effects of the window duration and number of feature parameters over classification performance are also investigated. Experimental results are presented
Domestic Activities Classification from Audio Recordings Using Multi-scale Dilated Depthwise Separable Convolutional Network
Domestic activities classification (DAC) from audio recordings aims at
classifying audio recordings into pre-defined categories of domestic
activities, which is an effective way for estimation of daily activities
performed in home environment. In this paper, we propose a method for DAC from
audio recordings using a multi-scale dilated depthwise separable convolutional
network (DSCN). The DSCN is a lightweight neural network with small size of
parameters and thus suitable to be deployed in portable terminals with limited
computing resources. To expand the receptive field with the same size of DSCN's
parameters, dilated convolution, instead of normal convolution, is used in the
DSCN for further improving the DSCN's performance. In addition, the embeddings
of various scales learned by the dilated DSCN are concatenated as a multi-scale
embedding for representing property differences among various classes of
domestic activities. Evaluated on a public dataset of the Task 5 of the 2018
challenge on Detection and Classification of Acoustic Scenes and Events
(DCASE-2018), the results show that: both dilated convolution and multi-scale
embedding contribute to the performance improvement of the proposed method; and
the proposed method outperforms the methods based on state-of-the-art
lightweight network in terms of classification accuracy.Comment: 5 pages, 2 figures, 4 tables. Accepted for publication in IEEE
MMSP202
A survey on artificial intelligence-based acoustic source identification
The concept of Acoustic Source Identification (ASI), which refers to the process of identifying noise sources has attracted increasing attention in recent years. The ASI technology can be used for surveillance, monitoring, and maintenance applications in a wide range of sectors, such as defence, manufacturing, healthcare, and agriculture. Acoustic signature analysis and pattern recognition remain the core technologies for noise source identification. Manual identification of acoustic signatures, however, has become increasingly challenging as dataset sizes grow. As a result, the use of Artificial Intelligence (AI) techniques for identifying noise sources has become increasingly relevant and useful. In this paper, we provide a comprehensive review of AI-based acoustic source identification techniques. We analyze the strengths and weaknesses of AI-based ASI processes and associated methods proposed by researchers in the literature. Additionally, we did a detailed survey of ASI applications in machinery, underwater applications, environment/event source recognition, healthcare, and other fields. We also highlight relevant research directions
- …