712 research outputs found
Analysis on Using Synthesized Singing Techniques in Assistive Interfaces for Visually Impaired to Study Music
Tactile and auditory senses are the basic types of methods that visually impaired people sense the world. Their interaction with assistive technologies also focuses mainly on tactile and auditory interfaces. This research paper discuss about the validity of using most appropriate singing synthesizing techniques as a mediator in assistive technologies specifically built to address their music learning needs engaged with music scores and lyrics. Music scores with notations and lyrics are considered as the main mediators in musical communication channel which lies between a composer and a performer. Visually impaired music lovers have less opportunity to access this main mediator since most of them are in visual format. If we consider a music score, the vocal performer’s melody is married to all the pleasant sound producible in the form of singing. Singing best fits for a format in temporal domain compared to a tactile format in spatial domain. Therefore, conversion of existing visual format to a singing output will be the most appropriate nonlossy transition as proved by the initial research on adaptive music score trainer for visually impaired [1]. In order to extend the paths of this initial research, this study seek on existing singing synthesizing techniques and researches on auditory interfaces
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Mining of Textual Data from the Web for Speech Recognition
PrvotnĂm cĂlem tohoto projektu bylo prostudovat problematiku jazykovĂ©ho modelovánĂ pro rozpoznávánĂ Ĺ™eÄŤi a techniky pro zĂskávánĂ textovĂ˝ch dat z Webu. Text pĹ™edstavuje základnĂ techniky rozpoznávánĂ Ĺ™eÄŤi a detailnÄ›ji popisuje jazykovĂ© modely zaloĹľenĂ© na statistickĂ˝ch metodách. ZvláštÄ› se práce zabĂ˝vá kriterii pro vyhodnocenĂ kvality jazykovĂ˝ch modelĹŻ a systĂ©mĹŻ pro rozpoznávánĂ Ĺ™eÄŤi. Text dále popisuje modely a techniky dolovánĂ dat, zvláštÄ› vyhledávánĂ informacĂ. Dále jsou pĹ™edstaveny problĂ©my spojenĂ© se zĂskávánĂ dat z webu, a v kontrastu s tĂm je pĹ™edstaven vyhledávaÄŤ Google. SoučástĂ projektu byl návrh a implementace systĂ©mu pro zĂskávánĂ textu z webu, jehoĹľ detailnĂmu popisu je vÄ›nována náleĹľitá pozornost. NicmĂ©nÄ›, hlavnĂm cĂlem práce bylo ověřit, zda data zĂskaná z Webu mohou mĂt nÄ›jakĂ˝ pĹ™Ănos pro rozpoznávánĂ Ĺ™eÄŤi. PopsanĂ© techniky se tak snažà najĂt optimálnĂ zpĹŻsob, jak data zĂskaná z Webu pouĹľĂt pro zlepšenĂ ukázkovĂ˝ch jazykovĂ˝ch modelĹŻ, ale i modelĹŻ nasazenĂ˝ch v reálnĂ˝ch rozpoznávacĂch systĂ©mech.The preliminary goals of this project were to get familiar with language modeling for speech recognition and techniques for acquisition of text data from the Web. Speech recognition techniques are introduced and statistical language modeling is described in detail. The text also covers mining models and techniques, information retrieval especially. Specific problems of Web mining are discussed and Google search is introduced. Special attention was paid to detailed description of implementation of the text mining system. However, the main goal of this work was to determine, whether the data acquired from the Web can provide some improvement into the recognition systems. The text is describing experiments, which use the retrieved Web data to update sample language models.
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
Efficient, end-to-end and self-supervised methods for speech processing and generation
Deep learning has affected the speech processing and generation fields in many directions. First, end-to-end architectures allow the direct injection and synthesis of waveform samples. Secondly, the exploration of efficient solutions allow to implement these systems in computationally restricted environments, like smartphones. Finally, the latest trends exploit audio-visual data with least supervision. In this thesis these three directions are explored.
Firstly, we propose the use of recent pseudo-recurrent structures, like self-attention models and quasi-recurrent networks, to build acoustic models for text-to-speech. The proposed system, QLAD, turns out to synthesize faster on CPU and GPU than its recurrent counterpart whilst preserving the good synthesis quality level, which is competitive with state of the art vocoder-based models.
Then, a generative adversarial network is proposed for speech enhancement, named SEGAN. This model works as a speech-to-speech conversion system in time-domain, where a single inference operation is needed for all samples to operate through a fully convolutional structure. This implies an increment in modeling efficiency with respect to other existing models, which are auto-regressive and also work in time-domain. SEGAN achieves prominent results in noise supression and preservation of speech naturalness and intelligibility when compared to the other classic and deep regression based systems. We also show that SEGAN is efficient in transferring its operations to new languages and noises. A SEGAN trained for English performs similarly to this language on Catalan and Korean with only 24 seconds of adaptation data. Finally, we unveil the generative capacity of the model to recover signals from several distortions. We hence propose the concept of generalized speech enhancement. First, the model proofs to be effective to recover voiced speech from whispered one. Then the model is scaled up to solve other distortions that require a recomposition of damaged parts of the signal, like extending the bandwidth or recovering lost temporal sections, among others. The model improves by including additional acoustic losses in a multi-task setup to impose a relevant perceptual weighting on the generated result. Moreover, a two-step training schedule is also proposed to stabilize the adversarial training after the addition of such losses, and both components boost SEGAN's performance across distortions.Finally, we propose a problem-agnostic speech encoder, named PASE, together with the framework to train it. PASE is a fully convolutional network that yields compact representations from speech waveforms. These representations contain abstract information like the speaker identity, the prosodic features or the spoken contents. A self-supervised framework is also proposed to train this encoder, which suposes a new step towards unsupervised learning for speech processing. Once the encoder is trained, it can be exported to solve different tasks that require speech as input. We first explore the performance of PASE codes to solve speaker recognition, emotion recognition and speech recognition. PASE works competitively well compared to well-designed classic features in these tasks, specially after some supervised adaptation. Finally, PASE also provides good descriptors of identity for multi-speaker modeling in text-to-speech, which is advantageous to model novel identities without retraining the model.L'aprenentatge profund ha afectat els camps de processament i generaciĂł de la parla en vĂ ries direccions. Primer, les arquitectures fi-a-fi permeten la injecciĂł i sĂntesi de mostres temporals directament. D'altra banda, amb l'exploraciĂł de solucions eficients permet l'aplicaciĂł d'aquests sistemes en entorns de computaciĂł restringida, com els telèfons intel·ligents. Finalment, les darreres tendències exploren les dades d'Ă udio i veu per derivar-ne representacions amb la mĂnima supervisiĂł. En aquesta tesi precisament s'exploren aquestes tres direccions. Primer de tot, es proposa l'Ăşs d'estructures pseudo-recurrents recents, com els models d’auto atenciĂł i les xarxes quasi-recurrents, per a construir models acĂşstics text-a-veu. AixĂ, el sistema QLAD proposat en aquest treball sintetitza mĂ©s rĂ pid en CPU i GPU que el seu homòleg recurrent, preservant el mateix nivell de qualitat de sĂntesi, competitiu amb l'estat de l'art en models basats en vocoder. A continuaciĂł es proposa un model de xarxa adversĂ ria generativa per a millora de veu, anomenat SEGAN. Aquest model fa conversions de veu-a-veu en temps amb una sola operaciĂł d'inferència sobre una estructura purament convolucional. Això implica un increment en l'eficiència respecte altres models existents auto regressius i que tambĂ© treballen en el domini temporal. La SEGAN aconsegueix resultats prominents d'extracciĂł de soroll i preservaciĂł de la naturalitat i la intel·ligibilitat de la veu comparat amb altres sistemes clĂ ssics i models regressius basats en xarxes neuronals profundes en espectre. TambĂ© es demostra que la SEGAN Ă©s eficient transferint les seves operacions a nous llenguatges i sorolls. AixĂ, un model SEGAN entrenat en Anglès aconsegueix un rendiment comparable a aquesta llengua quan el transferim al catalĂ o al coreĂ amb nomĂ©s 24 segons de dades d'adaptaciĂł. Finalment, explorem l'Ăşs de tota la capacitat generativa del model i l’apliquem a recuperaciĂł de senyals de veu malmeses per vĂ ries distorsions severes. Això ho anomenem millora de la parla generalitzada. Primer, el model demostra ser efectiu per a la tasca de recuperaciĂł de senyal sonoritzat a partir de senyal xiuxiuejat. Posteriorment, el model escala a poder resoldre altres distorsions que requereixen una reconstrucciĂł de parts del senyal que s’han malmès, com extensiĂł d’ample de banda i recuperaciĂł de seccions temporals perdudes, entre d’altres. En aquesta Ăşltima aplicaciĂł del model, el fet d’incloure funcions de pèrdua acĂşsticament rellevants incrementa la naturalitat del resultat final, en una estructura multi-tasca que prediu caracterĂstiques acĂşstiques a la sortida de la xarxa discriminadora de la nostra GAN. TambĂ© es proposa fer un entrenament en dues etapes del sistema SEGAN, el qual mostra un increment significatiu de l’equilibri en la sinèrgia adversĂ ria i la qualitat generada finalment desprĂ©s d’afegir les funcions acĂşstiques. Finalment, proposem un codificador de veu agnòstic al problema, anomenat PASE, juntament amb el conjunt d’eines per entrenar-lo. El PASE Ă©s un sistema purament convolucional que crea representacions compactes de trames de veu. Aquestes representacions contenen informaciĂł abstracta com identitat del parlant, les caracterĂstiques prosòdiques i els continguts lingĂĽĂstics. TambĂ© es proposa un entorn auto-supervisat multi-tasca per tal d’entrenar aquest sistema, el qual suposa un avenç en el terreny de l’aprenentatge no supervisat en l’à mbit del processament de la parla. Una vegada el codificador esta entrenat, es pot exportar per a solventar diferents tasques que requereixin tenir senyals de veu a l’entrada. Primer explorem el rendiment d’aquest codificador per a solventar tasques de reconeixement del parlant, de l’emociĂł i de la parla, mostrant-se efectiu especialment si s’ajusta la representaciĂł de manera supervisada amb un conjunt de dades d’adaptaciĂł.Postprint (published version
Efficient, end-to-end and self-supervised methods for speech processing and generation
Deep learning has affected the speech processing and generation fields in many directions. First, end-to-end architectures allow the direct injection and synthesis of waveform samples. Secondly, the exploration of efficient solutions allow to implement these systems in computationally restricted environments, like smartphones. Finally, the latest trends exploit audio-visual data with least supervision. In this thesis these three directions are explored.
Firstly, we propose the use of recent pseudo-recurrent structures, like self-attention models and quasi-recurrent networks, to build acoustic models for text-to-speech. The proposed system, QLAD, turns out to synthesize faster on CPU and GPU than its recurrent counterpart whilst preserving the good synthesis quality level, which is competitive with state of the art vocoder-based models.
Then, a generative adversarial network is proposed for speech enhancement, named SEGAN. This model works as a speech-to-speech conversion system in time-domain, where a single inference operation is needed for all samples to operate through a fully convolutional structure. This implies an increment in modeling efficiency with respect to other existing models, which are auto-regressive and also work in time-domain. SEGAN achieves prominent results in noise supression and preservation of speech naturalness and intelligibility when compared to the other classic and deep regression based systems. We also show that SEGAN is efficient in transferring its operations to new languages and noises. A SEGAN trained for English performs similarly to this language on Catalan and Korean with only 24 seconds of adaptation data. Finally, we unveil the generative capacity of the model to recover signals from several distortions. We hence propose the concept of generalized speech enhancement. First, the model proofs to be effective to recover voiced speech from whispered one. Then the model is scaled up to solve other distortions that require a recomposition of damaged parts of the signal, like extending the bandwidth or recovering lost temporal sections, among others. The model improves by including additional acoustic losses in a multi-task setup to impose a relevant perceptual weighting on the generated result. Moreover, a two-step training schedule is also proposed to stabilize the adversarial training after the addition of such losses, and both components boost SEGAN's performance across distortions.Finally, we propose a problem-agnostic speech encoder, named PASE, together with the framework to train it. PASE is a fully convolutional network that yields compact representations from speech waveforms. These representations contain abstract information like the speaker identity, the prosodic features or the spoken contents. A self-supervised framework is also proposed to train this encoder, which suposes a new step towards unsupervised learning for speech processing. Once the encoder is trained, it can be exported to solve different tasks that require speech as input. We first explore the performance of PASE codes to solve speaker recognition, emotion recognition and speech recognition. PASE works competitively well compared to well-designed classic features in these tasks, specially after some supervised adaptation. Finally, PASE also provides good descriptors of identity for multi-speaker modeling in text-to-speech, which is advantageous to model novel identities without retraining the model.L'aprenentatge profund ha afectat els camps de processament i generaciĂł de la parla en vĂ ries direccions. Primer, les arquitectures fi-a-fi permeten la injecciĂł i sĂntesi de mostres temporals directament. D'altra banda, amb l'exploraciĂł de solucions eficients permet l'aplicaciĂł d'aquests sistemes en entorns de computaciĂł restringida, com els telèfons intel·ligents. Finalment, les darreres tendències exploren les dades d'Ă udio i veu per derivar-ne representacions amb la mĂnima supervisiĂł. En aquesta tesi precisament s'exploren aquestes tres direccions. Primer de tot, es proposa l'Ăşs d'estructures pseudo-recurrents recents, com els models d’auto atenciĂł i les xarxes quasi-recurrents, per a construir models acĂşstics text-a-veu. AixĂ, el sistema QLAD proposat en aquest treball sintetitza mĂ©s rĂ pid en CPU i GPU que el seu homòleg recurrent, preservant el mateix nivell de qualitat de sĂntesi, competitiu amb l'estat de l'art en models basats en vocoder. A continuaciĂł es proposa un model de xarxa adversĂ ria generativa per a millora de veu, anomenat SEGAN. Aquest model fa conversions de veu-a-veu en temps amb una sola operaciĂł d'inferència sobre una estructura purament convolucional. Això implica un increment en l'eficiència respecte altres models existents auto regressius i que tambĂ© treballen en el domini temporal. La SEGAN aconsegueix resultats prominents d'extracciĂł de soroll i preservaciĂł de la naturalitat i la intel·ligibilitat de la veu comparat amb altres sistemes clĂ ssics i models regressius basats en xarxes neuronals profundes en espectre. TambĂ© es demostra que la SEGAN Ă©s eficient transferint les seves operacions a nous llenguatges i sorolls. AixĂ, un model SEGAN entrenat en Anglès aconsegueix un rendiment comparable a aquesta llengua quan el transferim al catalĂ o al coreĂ amb nomĂ©s 24 segons de dades d'adaptaciĂł. Finalment, explorem l'Ăşs de tota la capacitat generativa del model i l’apliquem a recuperaciĂł de senyals de veu malmeses per vĂ ries distorsions severes. Això ho anomenem millora de la parla generalitzada. Primer, el model demostra ser efectiu per a la tasca de recuperaciĂł de senyal sonoritzat a partir de senyal xiuxiuejat. Posteriorment, el model escala a poder resoldre altres distorsions que requereixen una reconstrucciĂł de parts del senyal que s’han malmès, com extensiĂł d’ample de banda i recuperaciĂł de seccions temporals perdudes, entre d’altres. En aquesta Ăşltima aplicaciĂł del model, el fet d’incloure funcions de pèrdua acĂşsticament rellevants incrementa la naturalitat del resultat final, en una estructura multi-tasca que prediu caracterĂstiques acĂşstiques a la sortida de la xarxa discriminadora de la nostra GAN. TambĂ© es proposa fer un entrenament en dues etapes del sistema SEGAN, el qual mostra un increment significatiu de l’equilibri en la sinèrgia adversĂ ria i la qualitat generada finalment desprĂ©s d’afegir les funcions acĂşstiques. Finalment, proposem un codificador de veu agnòstic al problema, anomenat PASE, juntament amb el conjunt d’eines per entrenar-lo. El PASE Ă©s un sistema purament convolucional que crea representacions compactes de trames de veu. Aquestes representacions contenen informaciĂł abstracta com identitat del parlant, les caracterĂstiques prosòdiques i els continguts lingĂĽĂstics. TambĂ© es proposa un entorn auto-supervisat multi-tasca per tal d’entrenar aquest sistema, el qual suposa un avenç en el terreny de l’aprenentatge no supervisat en l’à mbit del processament de la parla. Una vegada el codificador esta entrenat, es pot exportar per a solventar diferents tasques que requereixin tenir senyals de veu a l’entrada. Primer explorem el rendiment d’aquest codificador per a solventar tasques de reconeixement del parlant, de l’emociĂł i de la parla, mostrant-se efectiu especialment si s’ajusta la representaciĂł de manera supervisada amb un conjunt de dades d’adaptaciĂł
- …