Search CORE

260,823 research outputs found

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

Author: Chen Xie
Du Chenpeng
Ma Ziyang
Povey Daniel
Shen Feiyu
Yang Yifan
Yu Kai
Publication venue
Publication date: 13/09/2023
Field of study

Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques. However, these studies, mainly single-task focused, faced challenges like overfitting and performance degradation in speech recognition tasks, often at the cost of sacrificing performance in multi-task scenarios. This study presents a comprehensive comparison and optimization of discrete tokens generated by various leading SSL models in speech recognition and synthesis tasks. We aim to explore the universality of speech discrete tokens across multiple speech tasks. Experimental results demonstrate that discrete tokens achieve comparable results against systems trained on FBank features in speech recognition tasks and outperform mel-spectrogram features in speech synthesis in subjective and objective metrics. These findings suggest that universal discrete tokens have enormous potential in various speech-related tasks. Our work is open-source and publicly available to facilitate research in this direction

arXiv.org e-Print Archive

Evaluating raw waveforms with deep learning frameworks for speech emotion recognition

Author: Bayraktar Ulku
Kilimci Zeynep Hilal
Kucukmanisa Ayhan
Publication venue
Publication date: 06/07/2023
Field of study

Speech emotion recognition is a challenging task in speech processing field. For this reason, feature extraction process has a crucial importance to demonstrate and process the speech signals. In this work, we represent a model, which feeds raw audio files directly into the deep neural networks without any feature extraction stage for the recognition of emotions utilizing six different data sets, EMO-DB, RAVDESS, TESS, CREMA, SAVEE, and TESS+RAVDESS. To demonstrate the contribution of proposed model, the performance of traditional feature extraction techniques namely, mel-scale spectogram, mel-frequency cepstral coefficients, are blended with machine learning algorithms, ensemble learning methods, deep and hybrid deep learning techniques. Support vector machine, decision tree, naive Bayes, random forests models are evaluated as machine learning algorithms while majority voting and stacking methods are assessed as ensemble learning techniques. Moreover, convolutional neural networks, long short-term memory networks, and hybrid CNN- LSTM model are evaluated as deep learning techniques and compared with machine learning and ensemble learning methods. To demonstrate the effectiveness of proposed model, the comparison with state-of-the-art studies are carried out. Based on the experiment results, CNN model excels existent approaches with 95.86% of accuracy for TESS+RAVDESS data set using raw audio files, thence determining the new state-of-the-art. The proposed model performs 90.34% of accuracy for EMO-DB with CNN model, 90.42% of accuracy for RAVDESS with CNN model, 99.48% of accuracy for TESS with LSTM model, 69.72% of accuracy for CREMA with CNN model, 85.76% of accuracy for SAVEE with CNN model in speaker-independent audio categorization problems.Comment: 14 pages, 6 Figures, 8 Table

arXiv.org e-Print Archive

Subphonetic Modeling for Speech Recognition

Author: Mei-yuh Hwang
Xuedong Huang
Publication venue
Publication date: 01/01/1992
Field of study

How to capture important acoustic clues and estimate essential parameters reliably is one of the central issues in speech recognition, since we will never have sufficient training data to model various acoustic-phonetic phenomena. Successful examples include subword models with many smoothing techniques. In comparison with subword models, subphonetic modeling may provide a finer level of details. We propose to model subphonetic events with Markov states and treat the state in phonetic hidden Markov models as our basic subphonetic unit-- senone. A word model is a concatenation of state-dependent senones and senones can be shared across different word models. Senones not only allow parameter sharing, but also enable pronunciation optimization and new word learning, where the phonetic baseform is replaced by the senonic baseform. In this paper, we report preliminary subphonetic modeling results, which not only significantly reduced the word error rate for speaker-independent continuous speech recognition but also demonstrated a novel application for new word learning.

CiteSeerX

Crossref

Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models

Author: Hines Andrew
Ragano Alessandro
Ullah Asad
Publication venue
Publication date: 22/09/2023
Field of study

Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme recognition versus supervised models. Training SSRL models requires a large amount of pre-training data and this poses a challenge for low resource languages. A common approach is transferring knowledge from other languages. Instead, we propose to use audio augmentation to pre-train SSRL models in a low resource condition and evaluate phoneme recognition as downstream task. We performed a systematic comparison of augmentation techniques, namely: pitch variation, noise addition, accented target-language speech and other language speech. We found combined augmentations (noise/pitch) was the best augmentation strategy outperforming accent and language knowledge transfer. We compared the performance with various quantities and types of pre-training data. We examined the scaling factor of augmented data to achieve equivalent performance to models pre-trained with target domain speech. Our findings suggest that for resource constrained languages, in-domain synthetic augmentation can outperform knowledge transfer from accented or other language speech.Comment: 5 pages, 4 figures, ICASSP2

arXiv.org e-Print Archive

Speaker gender recognition system

Author: Hong Z. (Zimeng)
Publication venue: University of Oulu
Publication date: 31/05/2017
Field of study

Abstract. Automatic gender recognition through speech is one of the fundamental mechanisms in human-machine interaction. Typical application areas of this technology range from gender-targeted advertising to gender-specific IoT (Internet of Things) applications. It can also be used to narrow down the scope of investigations in crime scenarios. There are many possible methods of recognizing the gender of a speaker. In machine learning applications, the first step is to acquire and convert the natural human voice into a form of machine understandable signal. Useful voice features then could be extracted and labelled with gender information so that are then trained by machines. After that, new input voice can be captured and processed and the machine is able to extract the features by pattern modelling. In this thesis, a real-time speaker gender recognition system was designed within Matlab environment. This system could automatically identify the gender of a speaker by voice. The implementation work utilized voice processing and feature extraction techniques to deal with an input speech coming from a microphone or a recorded speech file. The response features are extracted and classified. Then the machine learning classification method (Naïve Bayes Classifier) is used to distinguish the gender features. The recognition result with gender information is then finally displayed. The evaluation of the speaker gender recognition systems was done in an experiment with 40 participants (half male and half female) in a quite small room. The experiment recorded 400 speech samples by speakers from 16 countries in 17 languages. These 400 speech samples were tested by the gender recognition system and showed a considerably good performance, with only 29 errors of recognition (92.75% accuracy). In comparison with previous speaker gender recognition systems, most of them obtained the accuracy no more than 90% and only one obtained 100% accuracy with very limited testers. We can then conclude that the performance of the speaker gender recognition system designed in this thesis is reliable

University of Oulu Repository - Jultika

Speech recognition based on spectrograms by using deep learning

Author: Leon Roy Eduardo Aguilar
Publication venue
Publication date: 01/01/2018
Field of study

Speech Recognition is widely being used and it has become part of our day to day. Several massive and popular applications have taken its use to another level. Most of the existing systems use machine learning techniques such as artificial neural networks or fuzzy logic, whereas others may just be based in a comparative analysis of the sound signals with a large lookup tables that contain possible realizations of voice commands. These models base their speech recognition algorithms on the analysis or comparison of the analog acoustic signal itself. The sound has particular characteristics that can not be seen through the representation of its propagation wave in time. This project proposes speech recognition through an innovative model that analyzes the graphic representation of the acustic signal, its spectrogram. Therefore the model does not classify the speech through its acoustic signal but its graphical representation. This leads the research to an approximation of the problem through the use of image classification techniques. Image clasification was considered a task only the humans can do, with the devoloping of machine learning techniques this perception has drastically changed. This project covers several techniques and shows the potential of Deep Learning for objects classification and within this field presents the convolutional neural networks as the most suitable algorithim for the classifcation of spectrograms. As a method to clearly illustrate the efficacy of the proposed model, the used alorithim was trained with two self-obtained datasets. Several experiments were conducted to make a detailed comparison of the system throughput and its levels of accuracy

Universiti Teknologi Malaysia Institutional Repository

Perceptual Evaluation of Video-Realistic Speech

Author: Ezzat Tony
Geiger Gadi
Poggio Tomaso
Publication venue
Publication date: 28/02/2003
Field of study

abstract With many visual speech animation techniques now available, there is a clear need for systematic perceptual evaluation schemes. We describe here our scheme and its application to a new video-realistic (potentially indistinguishable from real recorded video) visual-speech animation system, called Mary 101. Two types of experiments were performed: a) distinguishing visually between real and synthetic image- sequences of the same utterances, ("Turing tests") and b) gauging visual speech recognition by comparing lip-reading performance of the real and synthetic image-sequences of the same utterances ("Intelligibility tests"). Subjects that were presented randomly with either real or synthetic image-sequences could not tell the synthetic from the real sequences above chance level. The same subjects when asked to lip-read the utterances from the same image-sequences recognized speech from real image-sequences significantly better than from synthetic ones. However, performance for both, real and synthetic, were at levels suggested in the literature on lip-reading. We conclude from the two experiments that the animation of Mary 101 is adequate for providing a percept of a talking head. However, additional effort is required to improve the animation for lip-reading purposes like rehabilitation and language learning. In addition, these two tasks could be considered as explicit and implicit perceptual discrimination tasks. In the explicit task (a), each stimulus is classified directly as a synthetic or real image-sequence by detecting a possible difference between the synthetic and the real image-sequences. The implicit perceptual discrimination task (b) consists of a comparison between visual recognition of speech of real and synthetic image-sequences. Our results suggest that implicit perceptual discrimination is a more sensitive method for discrimination between synthetic and real image-sequences than explicit perceptual discrimination

DSpace@MIT

Recommended from our members

Brain signal recognition using deep learning

Author: Datta Sahil
Publication venue: Brunel University London
Publication date: 01/01/2022
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel UniversityBrain Computer Interface (BCI) has the potential to offer a new generation of applications independent of muscular activity and controlled by the human brain. Brain imaging technologies are used to transfer the cognitive tasks into control commands for a BCI system. The electroencephalography (EEG) technology serves as the best available non-invasive solution for extracting signals from the brain. On the other hand, speech is the primary means of communication, but for patients suffering from locked-in syndrome, there is no easy way to communicate. Therefore, an ideal communication system for locked-in patients is a thought-to-speech BCI system. This research aims to investigate methods for the recognition of imagined speech from EEG signals using deep learning techniques. In order to design an optimal imagined speech recognition BCI, variety of issues have been solved. These include 1) proposing new feature extraction and classification framework for recognition of imagined speech from EEG signals, 2) grammatical class recognition of imagined words from EEG signals, 3) discriminating different cognitive tasks associated with speech in the brain such as overt speech, covert speech, and visual imagery. In this work machine learning, deep learning methods were used to analyze EEG signals. For recognition of imagined speech from EEG signals, a new EEG database was collected while the participants mentally spoke (imagined speech) the presented words. Along with imagined speech, EEG data was recorded for visual imagery (imagining a scene or an image) and overt speech (verbal speech). Spectro-temporal and spatio-temporal domain features were investigated for the classification of imagined words from EEG signals. Further, a deep learning framework using the convolutional network and attention mechanism was implemented for learning features in the spatial, temporal, and spectral domains. The method achieved a recognition rate of 76.6% for three binary word pairs. These experiments show that deep learning algorithms are ideal for imagined speech recognition from EEG signals due to their ability to interpret features from non-linear and non-stationary signals. Grammatical classes of imagined words from EEG signals were also recognized using a multi-channel convolution network framework. This method was extended to a multi-level recognition system for multi-class classification of imagined words which achieved an accuracy of 52.9% for 10 words, which is much better in comparison to previous work. In order to investigate the difference between imagined speech with verbal speech and visual imagery from EEG signals, we used multivariate pattern analysis (MVPA). MVPA provided the time segments when the neural oscillation for the different cognitive tasks was linearly separable. Further, frequencies that result in most discrimination between the different cognitive tasks were also explored. A framework was proposed to discriminate two cognitive tasks based on the spatio-temporal patterns in EEG signals. The proposed method used the K-means clustering algorithm to find the best electrode combination and convolutional-attention network for feature extraction and classification. The proposed method achieved a high recognition rate of 82.9% and 77.7%. The results in this research suggest that a communication based BCI system can be designed using deep learning methods. Further, this work add knowledge to the existing work in the field of communication based BCI system

Brunel University Research Archive