106 research outputs found

    Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection

    Get PDF
    Speech Activity Detection (SAD) plays an important role in mobile communications and automatic speech recognition (ASR). Developing efficient SAD systems for real-world applications is a challenging task due to the presence of noise. We propose a new approach to SAD where we treat it as a two-dimensional multilabel image classification problem. To classify the audio segments, we compute their Short-time Fourier Transform spectrograms and classify them with a Convolutional Recurrent Neural Network (CRNN), traditionally used in image recognition. Our CRNN uses a sigmoid activation function, max-pooling in the frequency domain, and a convolutional operation as a moving average filter to remove misclassified spikes. On the development set of Task 1 of the 2019 Fearless Steps Challenge, our system achieved a decision cost function (DCF) of 2.89%, a 66.4% improvement over the baseline. Moreover, it achieved a DCF score of 3.318% on the evaluation dataset of the challenge, ranking first among all submissions

    Machine Learning for Human Activity Detection in Smart Homes

    Get PDF
    Recognizing human activities in domestic environments from audio and active power consumption sensors is a challenging task since on the one hand, environmental sound signals are multi-source, heterogeneous, and varying in time and on the other hand, the active power consumption varies significantly for similar type electrical appliances. Many systems have been proposed to process environmental sound signals for event detection in ambient assisted living applications. Typically, these systems use feature extraction, selection, and classification. However, despite major advances, several important questions remain unanswered, especially in real-world settings. A part of this thesis contributes to the body of knowledge in the field by addressing the following problems for ambient sounds recorded in various real-world kitchen environments: 1) which features, and which classifiers are most suitable in the presence of background noise? 2) what is the effect of signal duration on recognition accuracy? 3) how do the SNR and the distance between the microphone and the audio source affect the recognition accuracy in an environment in which the system was not trained? We show that for systems that use traditional classifiers, it is beneficial to combine gammatone frequency cepstral coefficients and discrete wavelet transform coefficients and to use a gradient boosting classifier. For systems based on deep learning, we consider 1D and 2D CNN using mel-spectrogram energies and mel-spectrograms images, as inputs, respectively and show that the 2D CNN outperforms the 1D CNN. We obtained competitive classification results for two such systems and validated the performance of our algorithms on public datasets (Google Brain/TensorFlow Speech Recognition Challenge and the 2017 Detection and Classification of Acoustic Scenes and Events Challenge). Regarding the problem of the energy-based human activity recognition in a household environment, machine learning techniques to infer the state of household appliances from their energy consumption data are applied and rule-based scenarios that exploit these states to detect human activity are used. Since most activities within a house are related with the operation of an electrical appliance, this unimodal approach has a significant advantage using inexpensive smart plugs and smart meters for each appliance. This part of the thesis proposes the use of unobtrusive and easy-install tools (smart plugs) for data collection and a decision engine that combines energy signal classification using dominant classifiers (compared in advanced with grid search) and a probabilistic measure for appliance usage. It helps preserving the privacy of the resident, since all the activities are stored in a local database. DNNs received great research interest in the field of computer vision. In this thesis we adapted different architectures for the problem of human activity recognition. We analyze the quality of the extracted features, and more specifically how model architectures and parameters affect the ability of the automatically extracted features from DNNs to separate activity classes in the final feature space. Additionally, the architectures that we applied for our main problem were also applied to text classification in which we consider the input text as an image and apply 2D CNNs to learn the local and global semantics of the sentences from the variations of the visual patterns of words. This work helps as a first step of creating a dialogue agent that would not require any natural language preprocessing. Finally, since in many domestic environments human speech is present with other environmental sounds, we developed a Convolutional Recurrent Neural Network, to separate the sound sources and applied novel post-processing filters, in order to have an end-to-end noise robust system. Our algorithm ranked first in the Apollo-11 Fearless Steps Challenge.Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No. 676157, project ACROSSIN

    Digital Signal Processing for The Development of Deep Learning-Based Speech Recognition Technology

    Get PDF
    This research discusses digital signal processing in the context of developing deep learning-based speech recognition technology. Given the increasing demand for accurate and efficient speech recognition systems, digital signal processing techniques are essential. The research method used is an experimental method with a quantitative approach. This research method consists of several stages: introduction, research design, data collection, data preprocessing, Deep Learning Model Development, performance training and evaluation, experiments and testing, and data analysis. These findings are expected to contribute to developing more sophisticated and applicable speech recognition systems in various fields. For example, in virtual assistants such as Siri and Google Assistant, improved speech recognition accuracy will allow for more natural interactions and faster responses, improving the user experience. This technology can be used in security systems for safer and more reliable voice authentication, replacing or supplementing passwords and fingerprints. Additionally, in accessibility technology, more accurate voice recognition will be particularly beneficial for individuals with visual impairments or mobility, allowing them to control devices and access information with just voice commands. Other benefits include improvements in automated phone apps, automatic transcription for meetings or conferences, and the development of smart home devices that can be fully voice-operated

    Advances in Binary and Multiclass Audio Segmentation with Deep Learning Techniques

    Get PDF
    Los avances tecnológicos acaecidos en la última década han cambiado completamente la forma en la que la población interactúa con el contenido multimedia. Esto ha propiciado un aumento significativo tanto en la generación como el consumo de dicho contenido. El análisis y la anotación manual de toda esta información no son factibles dado el gran volumen actual, lo que releva la necesidad de herramientas automáticas que ayuden en la transición hacia flujos de trabajo asistidos o parcialmente automáticos. En los últimos años, la mayoría de estas herramientas están basadas en el uso de redes neuronales y deep learning. En este contexto, el trabajo que se describe en esta tesis se centra en el ámbito de la extracción de información a partir de señales de audio. Particularmente, se estudia la tarea de segmentación de audio, cuyo principal objetivo es obtener una secuencia de etiquetas que aíslen diferentes regiones en una señal de entrada de acuerdo con una serie de características descritas en un conjunto predefinido de clases, como por ejemplo voz, música o ruido.La primera parte de esta memoria esta centrada en la tarea de detección de actividad de voz. Recientemente, diferentes campañas de evaluación internacionales han propuesto esta tarea como uno de sus retos. Entre ellas se encuentra el reto Fearless steps, que trabaja con audios de las grabaciones de las misiones Apollo de la NASA. Para este reto, se propone una solución basada en aprendizaje supervisado usando una red convolucional recurrente como clasificador. La principal contribución es un método que combina información de filtros de 1D y 2D en la etapa convolucional para que sea procesada posteriormente por la etapa recurrente. Motivado por la introducción de los datos del reto Fearless steps, se plantea una evaluación de diferentes técnicas de adaptación de dominio, con el objetivo de comprobar las prestaciones de un sistema entrenado con datos de dominios habituales y evaluado en este nuevo dominio presentado en el reto. Los métodos descritos no requieren de etiquetas en el dominio objetivo, lo que facilita su uso en aplicaciones prácticas. En términos generales, se observa que los métodos que buscan minimizar el cambio en las distribuciones estadísticas entre los dominios fuente y objetivo obtienen los resultados mas prometedores. Los avances recientes en técnicas de representación obtenidas mediante aprendizaje auto-supervisado han demostrado grandes mejoras en prestaciones en varias tareas relacionadas con el procesado de voz. Siguiendo esta línea, se plantea la incorporación de dichas representaciones en la tarea de detección de actividad de voz. Las ediciones más recientes del reto Fearless steps modificaron su propósito, buscando ahora evaluar las capacidades de generalización de los sistemas. El objetivo entonces con las técnicas introducidas es poder beneficiarse de grandes cantidades de datos no etiquetados para mejorar la robustez del sistema. Los resultados experimentales sugieren que el aprendizaje auto-supervisado de representaciones permite obtener sistemas que son mucho menos sensibles al cambio de dominio.En la segunda parte de este documento se analiza una tarea de segmentación de audio más genérica que busca clasificar de manera simultanea una señal de audio como voz, música, ruido o una combinación de estas. En el contexto de los datos propuesto para el reto de segmentación de audio Albayzín 2010, se presenta un enfoque basado en el uso de redes neuronales recurrentes como clasificador principal, y un modelo de postprocesado integrado por modelos ocultos de Markov. Se introduce un nuevo bloque en la arquitectura neuronal con el objetivo de eliminar la información temporal redundante, mejorando las prestaciones y reduciendo el numero de operaciones por segundo al mismo tiempo. Esta propuesta obtuvo mejores prestaciones que soluciones presentadas anteriormenteen la literatura, y que aproximaciones similares basadas en redes neuronales profundas. Mientras que los resultados con aprendizaje auto-supervisado de representaciones eran prometedores en tareas de segmentación binaria, si se aplican en tareas de segmentación multiclase surgen una serie de cuestiones. Las técnicas habituales de aumento de datos que se aplican en el entrenamiento fuerzan al modelo a compensar el ruido de fondo o la música. En estas condiciones las características obtenidas podrían no representar de manera precisa aquellas clases generadas de manera similar a las versiones aumentadas vistas en el entrenamiento. Este hecho limita la mejora global de prestaciones observada al aplicar estas técnicas en tareas como la propuesta en la evaluación Albayzín 2010.La última parte de este trabajo ha investigado la aplicación de nuevas funciones de coste en la tarea de segmentación de audio, con el principal objetivo de mitigar los problemas que se derivan de utilizar un conjunto de datos de entrenamiento limitado. Se ha demostrado que nuevas técnicas de optimización basadas en las métricas AUC y AUC parcial pueden mejorar objetivos de entrenamiento tradicionales como la entropía cruzada en varias tareas de detección. Con esta idea en mente, en esta tesis se introducen dichas técnicas en la tarea de detección de música. Considerando que la cantidad de datos etiquetados para esta tarea es limitada comparado con otras tareas, las funciones de coste basadas en la métrica AUC se aplican con el objetivo de mejorar las prestaciones cuando el conjunto de datos de entrenamiento es relativamente pequeño. La mayoría de los sistemas que utilizan las técnicas de optimización basadas en métricas AUC se limitan a tareas binarias ya que ese el ámbito de aplicación habitual de la métrica AUC. Además, el etiquetado de audios con taxonomías más detalladas en las que hay múltiples opciones posibles es más complejo, por lo que la cantidad de audio etiquetada en algunas tareas de segmentación multiclase es limitada. Como una extensión natural, se propone una generalización de las técnicas de optimización basadas en la métrica AUC binaria, de tal manera que se puedan aplicar con un número arbitrario de clases. Dos funciones de coste distintas se introducen, usando como base para su formulación las variaciones multiclase de la métrica AUC propuestas en la literatura: una basada en un enfoque uno contra uno, y otra basada en un enfoque uno contra el resto.<br /

    An Improved GA Based Modified Dynamic Neural Network for Cantonese-Digit Speech Recognition

    Get PDF
    Author name used in this publication: F. H. F. Leung2007-2008 > Academic research: refereed > Chapter in an edited book (author)published_fina

    To Invest or Not to Invest: Using Vocal Behavior to Predict Decisions of Investors in an Entrepreneurial Context

    Get PDF
    Entrepreneurial pitch competitions have become increasinglypopular in the start-up culture to attract prospective investors. As theultimate funding decision often follows from some form of social interaction,it is important to understand how the decision-making processof investors is influenced by behavioral cues. In this work, we examinewhether vocal features are associated with the ultimate funding decisionof investors by utilizing deep learning methods.We used videos of individualsin an entrepreneurial pitch competition as input to predict whetherinvestors will invest in the startup or not. We proposed models that combinedeep audio features and Handcrafted audio Features (HaF) and feedthem into two types of Recurrent Neural Networks (RNN), namely LongShort-Term Memory (LSTM) and Gated Recurrent Units (GRU). Wealso trained the RNNs with only deep features to assess whether HaFprovide additional information to the models. Our results show that it ispromising to use vocal behavior of pitchers to predict whether investorswill invest in their business idea. Different types of RNNs yielded similarperformance, yet the addition of HaF improved the performance

    To Invest or Not to Invest: Using Vocal Behavior to Predict Decisions of Investors in an Entrepreneurial Context

    Get PDF
    Entrepreneurial pitch competitions have become increasinglypopular in the start-up culture to attract prospective investors. As theultimate funding decision often follows from some form of social interaction,it is important to understand how the decision-making processof investors is influenced by behavioral cues. In this work, we examinewhether vocal features are associated with the ultimate funding decisionof investors by utilizing deep learning methods.We used videos of individualsin an entrepreneurial pitch competition as input to predict whetherinvestors will invest in the startup or not. We proposed models that combinedeep audio features and Handcrafted audio Features (HaF) and feedthem into two types of Recurrent Neural Networks (RNN), namely LongShort-Term Memory (LSTM) and Gated Recurrent Units (GRU). Wealso trained the RNNs with only deep features to assess whether HaFprovide additional information to the models. Our results show that it ispromising to use vocal behavior of pitchers to predict whether investorswill invest in their business idea. Different types of RNNs yielded similarperformance, yet the addition of HaF improved the performance

    A Comparative Analysis of Neural-Based Visual Recognisers for Speech Activity Detection

    Get PDF
    Recent advances in Neural network has offered great solutions to automation of various detections including speech activity detection (SAD). However, existing literature on SAD highlights different approaches within neural networks, but do not provide a comprehensive comparison of the approaches. This is important because such neural approaches often require hardware-intensive resources. As a result, the project provides a comparative analysis of three different approaches: classification with still images (CNN), classification based on previous images (CRNN) and classification based on a sequence of images (Seq2Seq). The project aims to find a modest approach-one that provides the highest accuracy but yet does not require expensive computation whilst providing the quickest output prediction times. Such approach can then be adapted for real-time application such as activation of infotainment systems or interactive robots etc. Results show that within the problem domain (dataset, resources etc.) the use of still images can achieve an accuracy of 97% for SAD. With the addition of RNN, the classification accuracy is increased further by 2%, as both architectures (classification based on previous images and classification of a sequence of images) achieve 99% classification accuracy. These results show that the use of history/previous images improves accuracy compared to the use of still images. Furthermore, with the RNNs ability of memory, the network can be defined smaller which results in quicker training and prediction times. Experiments also showed that CRNN is almost as accurate as the Seq2Seq architecture (99.1% vs 99.6% classification accuracy, respectively) but faster to train (326s vs 761s per epoch) and 28% faster output predictions (3.7s vs 5.19s per prediction). These results indicate that the CRNN can be a suitable choice for real-time application such as activation of infotainment systems based on classification accuracy, training and prediction times
    corecore