88 research outputs found

    Speech Mode Classification using the Fusion of CNNs and LSTM Networks

    Get PDF
    Speech mode classification is an area that has not been as widely explored in the field of sound classification as others such as environmental sounds, music genre, and speaker identification. But what is speech mode? While mode is defined as the way or the manner in which something occurs or is expressed or done, speech mode is defined as the style in which the speech is delivered by a person. There are some reports on speech mode classification using conventional methods, such as whispering and talking using a normal phonetic sound. However, to the best of our knowledge, deep learning-based methods have not been reported in the open literature for the aforementioned classification scenario. Specifically, in this work we assess the performance of image-based classification algorithms on this challenging speech mode classification problem, including the usage of pre-trained deep neural networks, namely AlexNet, ResNet18 and SqueezeNet. Thus, we compare the classification efficiency of a set of deep learning-based classifiers, while we also assess the impact of different 2D image representations (spectrograms, mel-spectrograms, and their image-based fusion) on classification accuracy. These representations are used as input to the networks after being generated from the original audio signals. Next, we compare the accuracy of the DL-based classifies to a set of machine learning (ML) ones that use as their inputs Mel-Frequency Cepstral Coefficients (MFCCs) features. Then, after determining the most efficient sampling rate for our classification problem (i.e. 32kHz), we study the performance of our proposed method of combining CNN with LSTM (Long Short-Term Memory) networks. For this purpose, we use the features extracted from the deep networks of the previous step. We conclude our study by evaluating the role of sampling rates on classification accuracy by generating two sets of 2D image representations – one with 32kHz and the other with 16kHz sampling. Experimental results show that after cross validation the accuracy of DL-based approaches is 15% higher than ML ones, with SqueezeNet yielding an accuracy of more than 91% at 32kHz, whether we use transfer learning, feature-level fusion or score-level fusion (92.5%). Our proposed method using LSTMs further increased that accuracy by more than 3%, resulting in an average accuracy of 95.7%

    AudioCNN: Audio Event Classification With Deep Learning Based Multi-Channel Fusion Networks

    Get PDF
    Title from PDF of title page viewed February 25, 2021VitaIncludes bibliographical references (page 42-44)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2020Thesis advisor: Yugyung LeeIn recent years, there is growing interest in environmental sound classification with a plethora of real-world applications, especially in audio fields like speech and music. Recent research works have proven spectral images based on deep learning models for better performance than standard methods. This thesis intends to design a fusion system by combining various audio features, including Spectrogram (SG), Chromagram (CG), and Mel Frequency Cepstral Coefficient (MFCC), for useful environmental sound classification. We propose the AudioCNN model based on a fusion network consisting of multiple Convolutional Neural Networks (CNN) with aggregation methods for various spectral image spectrogram features and audio-specific data augmentation techniques. We have conducted our extensive experiments with benchmark datasets, including Urbansound8k, ESC-50, and ESC-10, emotion datasets. We have obtained state-of-the-art results by outperforming the previous solutions. The experiment results show that combined features with lighter network CNN models outperform baseline environmental sound classification methods. The proposed Multi-Channel fusion network with data augmentation achieved competitive results on UrbanSound8K datasets compared to existing models.Introduction -- Background -- Related work -- Methodology -- Results and evaluation -- Conclusio

    Contributions à la sonification d’image et à la classification de sons

    Full text link
    L’objectif de cette thèse est d’étudier d’une part le problème de sonification d’image et de le solutionner à travers de nouveaux modèles de correspondance entre domaines visuel et sonore. D’autre part d’étudier le problème de la classification de son et de le résoudre avec des méthodes ayant fait leurs preuves dans le domaine de la reconnaissance d’image. La sonification d’image est la traduction de données d’image (forme, couleur, texture, objet) en sons. Il est utilisé dans les domaines de l’assistance visuelle et de l’accessibilité des images pour les personnes malvoyantes. En raison de sa complexité, un système de sonification d’image qui traduit correctement les données d’image en son de manière intuitive n’est pas facile à concevoir. Notre première contribution est de proposer un nouveau système de sonification d’image de bas-niveau qui utilise une approche hiérarchique basée sur les caractéristiques visuelles. Il traduit, à l’aide de notes musicales, la plupart des propriétés d’une image (couleur, gradient, contour, texture, région) vers le domaine audio, de manière très prévisible et donc est facilement ensuite décodable par l’être humain. Notre deuxième contribution est une application Android de sonification de haut niveau qui est complémentaire à notre première contribution car elle implémente la traduction des objets et du contenu sémantique de l’image. Il propose également une base de données pour la sonification d’image. Finalement dans le domaine de l’audio, notre dernière contribution généralise le motif binaire local (LBP) à 1D et le combine avec des descripteurs audio pour faire de la classification de sons environnementaux. La méthode proposée surpasse les résultats des méthodes qui utilisent des algorithmes d’apprentissage automatique classiques et est plus rapide que toutes les méthodes de réseau neuronal convolutif. Il représente un meilleur choix lorsqu’il y a une rareté des données ou une puissance de calcul minimale.The objective of this thesis is to study on the one hand the problem of image sonification and to solve it through new models of mapping between visual and sound domains. On the other hand, to study the problem of sound classification and to solve it with methods which have proven track record in the field of image recognition. Image sonification is the translation of image data (shape, color, texture, objects) into sounds. It is used in vision assistance and image accessibility domains for visual impaired people. Due to its complexity, an image sonification system that properly conveys the image data to sound in an intuitive way is not easy to design. Our first contribution is to propose a new low-level image sonification system which uses an hierarchical visual feature-based approach to translate, usingmusical notes, most of the properties of an image (color, gradient, edge, texture, region) to the audio domain, in a very predictable way in which is then easily decodable by the human being. Our second contribution is a high-level sonification Android application which is complementary to our first contribution because it implements the translation to the audio domain of the objects and the semantic content of an image. It also proposes a dataset for an image sonification. Finally, in the audio domain, our third contribution generalizes the Local Binary Pattern (LBP) to 1D and combines it with audio features for an environmental sound classification task. The proposed method outperforms the results of methods that uses handcrafted features with classical machine learning algorithms and is faster than any convolutional neural network methods. It represents a better choice when there is data scarcity or minimal computing power

    Use of Pattern Classification Algorithms to Interpret Passive and Active Data Streams from a Walking-Speed Robotic Sensor Platform

    Get PDF
    In order to perform useful tasks for us, robots must have the ability to notice, recognize, and respond to objects and events in their environment. This requires the acquisition and synthesis of information from a variety of sensors. Here we investigate the performance of a number of sensor modalities in an unstructured outdoor environment, including the Microsoft Kinect, thermal infrared camera, and coffee can radar. Special attention is given to acoustic echolocation measurements of approaching vehicles, where an acoustic parametric array propagates an audible signal to the oncoming target and the Kinect microphone array records the reflected backscattered signal. Although useful information about the target is hidden inside the noisy time domain measurements, the Dynamic Wavelet Fingerprint process (DWFP) is used to create a time-frequency representation of the data. A small-dimensional feature vector is created for each measurement using an intelligent feature selection process for use in statistical pattern classification routines. Using our experimentally measured data from real vehicles at 50 m, this process is able to correctly classify vehicles into one of five classes with 94% accuracy. Fully three-dimensional simulations allow us to study the nonlinear beam propagation and interaction with real-world targets to improve classification results

    HYPERSPECTRAL IMAGING AND PATTERN RECOGNITION TECHNOLOGIES FOR REAL TIME FRUIT SAFETY AND QUALITY INSPECTION

    Get PDF
    Hyperspectral band selection and band combination has become a powerful tool and have gained enormous interest among researchers. An important task in hyperspectral data processing is to reduce the redundancy of the spectral and spatial information without losing any valuable details that are needed for the subsequent detection, discrimination and classification processes. An integrated principal component analysis (PCA) and Fisher linear discriminant (FLD) method has been developed for feature band selection, and other pattern recognition technologies have been applied and compared with the developed method. The results on different types of defects from cucumber and apple samples show that the integrated PCA-FLD method outperforms PCA, FLD and canonical discriminant methods when they are used separately for classification. The integrated method adds a new tool for the multivariate analysis of hyperspectral images and can be extended to other hyperspectral imaging applications. Dimensionality reduction not only serves as the first step of data processing that leads to a significant decrease in computational complexity in the successive procedures, but also a research tool for determining optimal spectra requirement for online automatic inspection of fruit. In this study, the hyperspectral research shows that the near infrared spectrum at 753nm is best for detecting apple defect. When applied for online apple defect inspection, over 98% of good apple detection rate is achieved. However, commercially available apple sorting and inspection machines cannot effectively solve the stem-calyx problems involved in automatic apple defects detection. In this study, a dual-spectrum NIR/MIR sensing method is applied. This technique can effectively distinguish true defects from stems and calyxes, which leads to a potential solution of the problem. The results of this study will advance the technology in fruit safety and quality inspection and improve the cost-effectiveness of fruit packing processes

    A Cognitive Approach to Mobile Robot Environment Mapping and Path Planning

    Get PDF
    This thesis presents a novel neurophysiological based navigation system which uses less memory and power than other neurophysiological based systems, as well as traditional navigation systems performing similar tasks. This is accomplished by emulating the rodent’s specialized navigation and spatial awareness brain cells, as found in and around the hippocampus and entorhinal cortex, at a higher level of abstraction than previously used neural representations. Specifically, the focus of this research will be on replicating place cells, boundary cells, head direction cells, and grid cells using data structures and logic driven by each cell’s interpreted behavior. This method is used along with a unique multimodal source model for place cell activation to create a cognitive map. Path planning is performed by using a combination of Euclidean distance path checking, goal memory, and the A* algorithm. Localization is accomplished using simple, low power sensors, such as a camera, ultrasonic sensors, motor encoders and a gyroscope. The place code data structures are initialized as the mobile robot finds goal locations and other unique locations, and are then linked as paths between goal locations, as goals are found during exploration. The place code creates a hybrid cognitive map of metric and topological data. In doing so, much less memory is needed to represent the robot’s roaming environment, as compared to traditional mapping methods, such as occupancy grids. A comparison of the memory and processing savings are presented, as well as to the functional similarities of our design to the rodent’ specialized navigation cells

    Pattern Recognition

    Get PDF
    Pattern recognition is a very wide research field. It involves factors as diverse as sensors, feature extraction, pattern classification, decision fusion, applications and others. The signals processed are commonly one, two or three dimensional, the processing is done in real- time or takes hours and days, some systems look for one narrow object class, others search huge databases for entries with at least a small amount of similarity. No single person can claim expertise across the whole field, which develops rapidly, updates its paradigms and comprehends several philosophical approaches. This book reflects this diversity by presenting a selection of recent developments within the area of pattern recognition and related fields. It covers theoretical advances in classification and feature extraction as well as application-oriented works. Authors of these 25 works present and advocate recent achievements of their research related to the field of pattern recognition

    The perceptual flow of phonetic feature processing

    Get PDF

    Irish Machine Vision and Image Processing Conference Proceedings 2017

    Get PDF
    • …
    corecore