11 research outputs found

    Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment

    Full text link
    Dysarthria is a disability that causes a disturbance in the human speech system and reduces the quality and intelligibility of a person's speech. Because of this effect, the normal speech processing systems can not work properly on impaired speech. This disability is usually associated with physical disabilities. Therefore, designing a system that can perform some tasks by receiving voice commands in the smart home can be a significant achievement. In this work, we introduce gammatonegram as an effective method to represent audio files with discriminative details, which is used as input for the convolutional neural network. On the other word, we convert each speech file into an image and propose image recognition system to classify speech in different scenarios. Proposed CNN is based on the transfer learning method on the pre-trained Alexnet. In this research, the efficiency of the proposed system for speech recognition, speaker identification, and intelligibility assessment is evaluated. According to the results on the UA dataset, the proposed speech recognition system achieved 91.29% accuracy in speaker-dependent mode, the speaker identification system acquired 87.74% accuracy in text-dependent mode, and the intelligibility assessment system achieved 96.47% accuracy in two-class mode. Finally, we propose a multi-network speech recognition system that works fully automatically. This system is located in a cascade arrangement with the two-class intelligibility assessment system, and the output of this system activates each one of the speech recognition networks. This architecture achieves an accuracy of 92.3% WRR. The source code of this paper is available.Comment: 12 pages, 8 figure

    Computer lipreading via hybrid deep neural network hidden Markov models

    Get PDF
    Constructing a viable lipreading system is a challenge because it is claimed that only 30% of information of speech production is visible on the lips. Nevertheless, in small vocabulary tasks, there have been several reports of high accuracies. However, investigation of larger vocabulary tasks is rare. This work examines constructing a large vocabulary lipreading system using an approach based-on Deep Neural Network Hidden Markov Models (DNN-HMMs). We present the historical development of computer lipreading technology and the state-ofthe-art results in small and large vocabulary tasks. In preliminary experiments, we evaluate the performance of lipreading and audiovisual speech recognition in small vocabulary data sets. We then concentrate on the improvement of lipreading systems in a more substantial vocabulary size with a multi-speaker data set. We tackle the problem of lipreading an unseen speaker. We investigate the effect of employing several stepstopre-processvisualfeatures. Moreover, weexaminethecontributionoflanguage modelling in a lipreading system where we use longer n-grams to recognise visual speech. Our lipreading system is constructed on the 6000-word vocabulary TCDTIMIT audiovisual speech corpus. The results show that visual-only speech recognition can definitely reach about 60% word accuracy on large vocabularies. We actually achieved a mean of 59.42% measured via three-fold cross-validation on the speaker independent setting of the TCD-TIMIT corpus using Deep autoencoder features and DNN-HMM models. This is the best word accuracy of a lipreading system in a large vocabulary task reported on the TCD-TIMIT corpus. In the final part of the thesis, we examine how the DNN-HMM model improves lipreading performance. We also give an insight into lipreading by providing a feature visualisation. Finally, we present an analysis of lipreading results and suggestions for future development

    14th Conference on DATA ANALYSIS METHODS for Software Systems

    Get PDF
    DAMSS-2023 is the 14th International Conference on Data Analysis Methods for Software Systems, held in Druskininkai, Lithuania. Every year at the same venue and time. The exception was in 2020, when the world was gripped by the Covid-19 pandemic and the movement of people was severely restricted. After a year’s break, the conference was back on track, and the next conference was successful in achieving its primary goal of lively scientific communication. The conference focuses on live interaction among participants. For better efficiency of communication among participants, most of the presentations are poster presentations. This format has proven to be highly effective. However, we have several oral sections, too. The history of the conference dates back to 2009 when 16 papers were presented. It began as a workshop and has evolved into a well-known conference. The idea of such a workshop originated at the Institute of Mathematics and Informatics, now the Institute of Data Science and Digital Technologies of Vilnius University. The Lithuanian Academy of Sciences and the Lithuanian Computer Society supported this idea, which gained enthusiastic acceptance from both the Lithuanian and international scientific communities. This year’s conference features 84 presentations, with 137 registered participants from 11 countries. The conference serves as a gathering point for researchers from six Lithuanian universities, making it the main annual meeting for Lithuanian computer scientists. The primary aim of the conference is to showcase research conducted at Lithuanian and foreign universities in the fields of data science and software engineering. The annual organization of the conference facilitates the rapid exchange of new ideas within the scientific community. Seven IT companies supported the conference this year, indicating the relevance of the conference topics to the business sector. In addition, the conference is supported by the Lithuanian Research Council and the National Science and Technology Council (Taiwan, R. O. C.). The conference covers a wide range of topics, including Applied Mathematics, Artificial Intelligence, Big Data, Bioinformatics, Blockchain Technologies, Business Rules, Software Engineering, Cybersecurity, Data Science, Deep Learning, High-Performance Computing, Data Visualization, Machine Learning, Medical Informatics, Modelling Educational Data, Ontological Engineering, Optimization, Quantum Computing, Signal Processing. This book provides an overview of all presentations from the DAMSS-2023 conference

    Analysis of constant-Q filterbank based representations for speech emotion recognition

    Get PDF
    International audienceThis work analyzes the constant-Q filterbank-based time-frequency representations for speech emotion recognition (SER). Constant-Q filterbank provides non-linear spectrotemporal representation with higher frequency resolution at low frequencies. Our investigation reveals how the increased low-frequency resolution benefits SER. The time-domain comparative analysis between short-term mel-frequency spectral coefficients (MFSCs) and constant-Q filterbank-based features, namely constant-Q transform (CQT) and continuous wavelet transform (CWT), reveals that constant-Q representations provide higher time-invariance at low-frequencies. This provides increased robustness against emotion irrelevant temporal variations in pitch, especially for low-arousal emotions. The corresponding frequency-domain analysis over different emotion classes shows better resolution of pitch harmonics in constant-Q-based time-frequency representations than MFSC. These advantages of constant-Q representations are further consolidated by SER performance in the extensive evaluation of features over four publicly available databases with six advanced deep neural network architectures as the back-end classifiers. Our inferences in this study hint toward the suitability and potentiality of constant-Q features for SER

    Underwater Source Localization based on Modal Propagation and Acoustic Signal Processing

    Get PDF
    Acoustic localization plays a pivotal role in underwater vehicle systems and marine mammal detection. Previous efforts adopt synchronized arrays of sensors to extract some features like direction of arrival (DOA) or time of flight (TOF) from the received signal. However, installing and synchronizing several hydrophones over a large area is costly and challenging. To tackle this problem, we use a single-hydrophone localization system which relies on acoustic signal processing methods rather than multiple hydrophones. This system takes modal dispersion into consideration and estimates the distance between sound source and receiver (range) based on dispersion curves. It is shown that the larger the range is, the more separable the modes are. To make the modes more distinguishable, a non-linear signal processing technique, called warping, is utilized. Propagation model of low-frequency signals, such as dolphin sound, is well-studied in shallow water environment (depth D\u3c200 m), and it was demonstrated that at large ranges (range r\u3e1 km), modal dispersion is utterly visible at time frequency (TF) domain. We used Peker is model for the aforementioned situation to localize both synthetic and real underwater acoustic signals. The accuracy of the localization system is examined with various sounds, including impulsive signal, sounds with known Fourier transform, and signals with estimated source phase. Experimental results show that the warping technique can considerably lessen the localization error, especially when prior knowledge about the source signal and waveguide are available

    Artificial Intelligence for Multimedia Signal Processing

    Get PDF
    Artificial intelligence technologies are also actively applied to broadcasting and multimedia processing technologies. A lot of research has been conducted in a wide variety of fields, such as content creation, transmission, and security, and these attempts have been made in the past two to three years to improve image, video, speech, and other data compression efficiency in areas related to MPEG media processing technology. Additionally, technologies such as media creation, processing, editing, and creating scenarios are very important areas of research in multimedia processing and engineering. This book contains a collection of some topics broadly across advanced computational intelligence algorithms and technologies for emerging multimedia signal processing as: Computer vision field, speech/sound/text processing, and content analysis/information mining

    IberSPEECH 2020: XI Jornadas en TecnologĂ­a del Habla and VII Iberian SLTech

    Get PDF
    IberSPEECH2020 is a two-day event, bringing together the best researchers and practitioners in speech and language technologies in Iberian languages to promote interaction and discussion. The organizing committee has planned a wide variety of scientific and social activities, including technical paper presentations, keynote lectures, presentation of projects, laboratories activities, recent PhD thesis, discussion panels, a round table, and awards to the best thesis and papers. The program of IberSPEECH2020 includes a total of 32 contributions that will be presented distributed among 5 oral sessions, a PhD session, and a projects session. To ensure the quality of all the contributions, each submitted paper was reviewed by three members of the scientific review committee. All the papers in the conference will be accessible through the International Speech Communication Association (ISCA) Online Archive. Paper selection was based on the scores and comments provided by the scientific review committee, which includes 73 researchers from different institutions (mainly from Spain and Portugal, but also from France, Germany, Brazil, Iran, Greece, Hungary, Czech Republic, Ucrania, Slovenia). Furthermore, it is confirmed to publish an extension of selected papers as a special issue of the Journal of Applied Sciences, “IberSPEECH 2020: Speech and Language Technologies for Iberian Languages”, published by MDPI with fully open access. In addition to regular paper sessions, the IberSPEECH2020 scientific program features the following activities: the ALBAYZIN evaluation challenge session.Red Española de Tecnologías del Habla. Universidad de Valladoli

    Self-supervised learning for automatic speech recognition In low-resource environments

    Get PDF
    Supervised deep neural networks trained with substantial amounts of annotated speech data have demonstrated impressive performance across a spectrum of spoken language processing applications, frequently establishing themselves as the leading models in respective competitions. Nonetheless, a significant challenge arises from the heavy reliance on extensive annotated data for training these systems. This reliance poses a significant scalability limitation, hindering the continual enhancement of state-of-the-art performance. Moreover, it presents a more fundamental obstacle for deploying deep neural networks in speech-related domains where acquiring labeled data is inherently arduous, expensive, or time-intensive, which are considered as low-resource ASR problems in this thesis. Unlike annotated speech data, collecting untranscribed audio is typically more cost-effective. In this thesis, we investigate the application of self-supervised learning in low-resource tasks, a learning approach where the learning objective is derived directly from the input data itself. We employ this method to harness the scalability and affordability of untranscribed audio resources in problems where we do not have enough training data, with the goal of enhancing the performance of spoken language technology. In particular, we propose three self-supervised methodologies. One model is based on the concept of two-fine-tuning steps, while the other two revolve around the notion of identifying an improved hidden unit. These approaches are designed to learn contextualized speech representations from speech data lacking annotations. We demonstrate the capacity of our self-supervised techniques to learn representations that convert the higher-level characteristics of speech signals more effectively than conventional acoustic features. Additionally, we present how these representations enhance the performance of deep neural networks on ASR tasks with limited resources. Beyond introducing novel learning algorithms, we conduct in-depth analyses to comprehend the properties of the acquired self-supervised representations and elucidate the distinct design elements that separate one self-supervised model from another

    Self-supervised learning for automatic speech recognition In low-resource environments

    Get PDF
    Supervised deep neural networks trained with substantial amounts of annotated speech data have demonstrated impressive performance across a spectrum of spoken language processing applications, frequently establishing themselves as the leading models in respective competitions. Nonetheless, a significant challenge arises from the heavy reliance on extensive annotated data for training these systems. This reliance poses a significant scalability limitation, hindering the continual enhancement of state-of-the-art performance. Moreover, it presents a more fundamental obstacle for deploying deep neural networks in speech-related domains where acquiring labeled data is inherently arduous, expensive, or time-intensive, which are considered as low-resource ASR problems in this thesis. Unlike annotated speech data, collecting untranscribed audio is typically more cost-effective. In this thesis, we investigate the application of self-supervised learning in low-resource tasks, a learning approach where the learning objective is derived directly from the input data itself. We employ this method to harness the scalability and affordability of untranscribed audio resources in problems where we do not have enough training data, with the goal of enhancing the performance of spoken language technology. In particular, we propose three self-supervised methodologies. One model is based on the concept of two-fine-tuning steps, while the other two revolve around the notion of identifying an improved hidden unit. These approaches are designed to learn contextualized speech representations from speech data lacking annotations. We demonstrate the capacity of our self-supervised techniques to learn representations that convert the higher-level characteristics of speech signals more effectively than conventional acoustic features. Additionally, we present how these representations enhance the performance of deep neural networks on ASR tasks with limited resources. Beyond introducing novel learning algorithms, we conduct in-depth analyses to comprehend the properties of the acquired self-supervised representations and elucidate the distinct design elements that separate one self-supervised model from another
    corecore