7 research outputs found
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
The Dollar General: Continuous Custom Gesture Recognition Techniques At Everyday Low Prices
Humans use gestures to emphasize ideas and disseminate information. Their importance is apparent in how we continuously augment social interactions with motion—gesticulating in harmony with nearly every utterance to ensure observers understand that which we wish to communicate, and their relevance has not escaped the HCI community\u27s attention. For almost as long as computers have been able to sample human motion at the user interface boundary, software systems have been made to understand gestures as command metaphors. Customization, in particular, has great potential to improve user experience, whereby users map specific gestures to specific software functions. However, custom gesture recognition remains a challenging problem, especially when training data is limited, input is continuous, and designers who wish to use customization in their software are limited by mathematical attainment, machine learning experience, domain knowledge, or a combination thereof. Data collection, filtering, segmentation, pattern matching, synthesis, and rejection analysis are all non-trivial problems a gesture recognition system must solve. To address these issues, we introduce The Dollar General (TDG), a complete pipeline composed of several novel continuous custom gesture recognition techniques. Specifically, TDG comprises an automatic low-pass filter tuner that we use to improve signal quality, a segmenter for identifying gesture candidates in a continuous input stream, a classifier for discriminating gesture candidates from non-gesture motions, and a synthetic data generation module we use to train the classifier. Our system achieves high recognition accuracy with as little as one or two training samples per gesture class, is largely input device agnostic, and does not require advanced mathematical knowledge to understand and implement. In this dissertation, we motivate the importance of gestures and customization, describe each pipeline component in detail, and introduce strategies for data collection and prototype selection
Recommended from our members
A Novel Approach for Continuous Speech Tracking and Dynamic Time Warping. Adaptive Framing Based Continuous Speech Similarity Measure and Dynamic Time Warping using Kalman Filter and Dynamic State Model
Dynamic speech properties such as time warping, silence removal and background noise interference are the most challenging issues in continuous speech signal matching. Among all of them, the time warped speech signal matching is of great interest and has been a tough challenge for the researchers. An adaptive framing based continuous speech tracking and similarity measurement approach is introduced in this work following a comprehensive research conducted in the diverse areas of speech processing. A dynamic state model is introduced based on system of linear motion equations which models the input (test) speech signal frame as a unidirectional moving object along the template speech signal. The most similar corresponding frame position in the template speech is estimated which is fused with a feature based similarity observation and the noise variances using a Kalman filter. The Kalman filter provides the final estimated frame position in the template speech at current time which is further used for prediction of a new frame size for the next step. In addition, a keyword spotting approach is proposed by introducing wavelet decomposition based dynamic noise filter and combination of beliefs. The Dempster’s theory of belief combination is deployed for the first time in relation to keyword spotting task. Performances for both; speech tracking and keyword spotting approaches are evaluated using the statistical metrics and gold standards for the binary classification. Experimental results proved the superiority of the proposed approaches over the existing methods.The appendices files are not available online
IberSPEECH 2020: XI Jornadas en TecnologĂa del Habla and VII Iberian SLTech
IberSPEECH2020 is a two-day event, bringing together the best researchers and practitioners in speech and language technologies in Iberian languages to promote interaction and discussion. The organizing committee has planned a wide variety of scientific and social activities, including technical paper presentations, keynote lectures, presentation of projects, laboratories activities, recent PhD thesis, discussion panels, a round table, and awards to the best thesis and papers. The program of IberSPEECH2020 includes a total of 32 contributions that will be presented distributed among 5 oral sessions, a PhD session, and a projects session. To ensure the quality of all the contributions, each submitted paper was reviewed by three members of the scientific review committee. All the papers in the conference will be accessible through the International Speech Communication Association (ISCA) Online Archive. Paper selection was based on the scores and comments provided by the scientific review committee, which includes 73 researchers from different institutions (mainly from Spain and Portugal, but also from France, Germany, Brazil, Iran, Greece, Hungary, Czech Republic, Ucrania, Slovenia). Furthermore, it is confirmed to publish an extension of selected papers as a special issue of the Journal of Applied Sciences, “IberSPEECH 2020: Speech and Language Technologies for Iberian Languages”, published by MDPI with fully open access. In addition to regular paper sessions, the IberSPEECH2020 scientific program features the following activities: the ALBAYZIN evaluation challenge session.Red Española de TecnologĂas del Habla. Universidad de Valladoli