273 research outputs found

    Northeastern Illinois University, Academic Catalog 2023-2024

    Get PDF
    https://neiudc.neiu.edu/catalogs/1064/thumbnail.jp

    Evaluating cognitive load of text-to-speech synthesis

    Get PDF
    This thesis addresses the vital topic of evaluating synthetic speech and its impact on the end-user, taking into consideration potential negative implications on cognitive load. While conventional methods like transcription tests and Mean Opinion Scores (MOS) tests offer a valuable overall understanding of system performance, they fail to provide deeper insights into the reasons behind the performance. As text-to-speech (TTS) systems are increasingly used in real-world applications, it becomes crucial to explore whether synthetic speech imposes a greater cognitive load on listeners compared to human speech, as excessive cognitive effort could lead to fatigue over time. The study focuses on assessing the cognitive load of synthetic speech by presenting two methodologies: the dual-task paradigm and pupillometry. The dual-task paradigm initially seemed promising but was eventually deemed unreliable and unsuitable due to uncertainties in experimental setups which requires further investigation. However, pupillometry emerged as a viable approach, demonstrating its efficacy in detecting differences in cognitive load among various speech synthesizers. Notably, the research confirmed that accurate measurement of listening difficulty requires imposing sufficient cognitive load on listeners. To achieve this, the most viable experimental setup involved measuring the pupil response while listening to speech in the presence of noise. Through these experiments, intriguing contrasts between human and synthetic speech were revealed. Human speech consistently demanded the least cognitive load. On the other hand, state-of-the-art TTS systems showed promising results, indicating a significant improvement in their cognitive load performance compared to rule-based synthesizers of the past. Pupillometry offers a deeper understanding of the contributing factors to increased cognitive load in synthetic speech processing. Particularly, an experiment highlighted that the separate modeling of spectral feature prediction and duration in TTS systems led to heightened cognitive load. However, encouragingly, many modern end-to-end TTS systems have addressed these issues by predicting acoustic features within a unified framework, and thus effectively reducing the overall cognitive load imposed by synthetic speech. As the gap between human and synthetic speech diminishes with advancements in TTS technology, continuous evaluation using pupillometry remains essential for optimizing TTS systems for low cognitive load. Although pupillometry demands advanced analysis techniques and is time-consuming, the meaningful insights it provides into the cognitive load of synthetic speech contribute to an enhanced user experience and better TTS system development. Overall, this work successfully establishes pupillometry as a viable and effective method for measuring cognitive load of synthetic speech, propelling synthetic speech evaluation beyond traditional metrics. By gaining a deeper understanding of synthetic speech's interaction with the human cognitive processing system, researchers and developers can work towards creating TTS systems that offer improved user experiences with reduced cognitive load, ultimately enhancing the overall usability and acceptance of such technologies. Note: There was a 2-year break in the work reported in this thesis where an initial pilot was performed in early 2020 and was then suspended due to the covid-19 pandemic. Experiments were therefore rerun in 2022/23 with the most recent state-of-the-art models so that we could determine whether the increased cognitive load result is still applicable. This thesis was thus concluded by answering whether such cognitive load methods developed in this thesis are still useful, practical and/or relevant for current state-of-the-art text-to-speech systems

    COMPARISON METRICS AND PERFORMANCE ESTIMATIONS FOR DEEP BEAMFORMING DEEP NEURAL NETWORK BASED AUTOMATIC SPEECH RECOGNITION SYSTEMS USING MICROPHONE-ARRAYS

    Get PDF
    Automatic Speech Recognition (ASR) functionality, the automatic translation of speech into text, is on the rise today and is required for various use-cases, scenarios, and applications. An ASR engine by itself faces difficulties when encountering live input of audio data, regardless of how sophisticated and advanced it may be. That is especially true, under the circumstances such as a noisy ambient environment, multiple speakers, or faulty microphones. These kinds of challenges characterize a realistic scenario for an ASR system. ASR functionality continues to evolve toward more comprehensive End-to-End (E2E) solutions. E2E solution development focuses on three significant characteristics. The solution has to be robust enough to show endurance against external interferences. Also, it has to maintain flexibility, so it can easily extend in expectation of adapting to new scenarios or in order to achieve better performance. Lastly, we expect the solution to be modular enough to fit into new applications conveniently. Such an E2E ASR solution may include several additional micro-modules of speech enhancements besides the ASR engine, which is very complicated by itself. Adding these micro-modules can enhance the robustness and improve the overall system’s performance. Examples of such possible micro-modules include noise cancellation and speech separation, multi-microphone arrays, and adaptive beamformer(s). Being a comprehensive solution built of numerous micro-modules is technologically challenging to implement and challenging to integrate into resource-limited mobile systems. By offloading the complex computations to a server on the cloud, the system can fit more easily in less capable computing devices. Nevertheless, that compute offloading comes with the cost of giving up on real-time analysis, and increasing the overall system bandwidth. In addition, offloading to a server must have connectivity to the cloud over the internet. To find the optimal trade-offs between performance, Hardware (HW) and Software (SW) requirements or limitations, maximal computation time allowed for real-time analysis, and the detection accuracy, one should first define the different metrics used for the evaluation of such an E2E ASR system. Secondly, one needs to determine the extent of correlation between those metrics, plus the ability to forecast the impact each variation has on the others. This research presents novel progress in optimally designing a robust E2E-ASR system targeted for mobile, resource-limited devices. First, we describe evaluation metrics for each domain of interest, spread over vast engineering subjects. Here, we emphasize any bindings between the metrics across domains and the degree of impact derived from a change in the system’s specifications or constraints. Second, we present the effectiveness of applying machine learning techniques that can generalize and provide results of improved overall performance and robustness. Third, we present an approach of substituting architectures, changing algorithms, and approximating complex computations by utilizing a custom dedicated hardware acceleration in order to replace the traditional state-of-the-art SW-based solutions, thus providing real-time analysis capabilities to resource-limited systems

    A Review of Deep Learning Techniques for Speech Processing

    Full text link
    The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field

    Unsupervised Learning Algorithm for Noise Suppression and Speech Enhancement Applications

    Get PDF
    Smart and intelligent devices are being integrated more and more into day-to-day life to perform a multitude of tasks. These tasks include, but are not limited to, job automation, smart utility management, etc., with the aim to improve quality of life and to make normal day-to-day chores as effortless as possible. These smart devices may or may not be connected to the internet to accomplish tasks. Additionally, human-machine interaction with such devices may be touch-screen based or based on voice commands. To understand and act upon received voice commands, these devices require to enhance and distinguish the (clean) speech signal from the recorded noisy signal (that is contaminated by interference and background noise). The enhanced speech signal is then analyzed locally or in cloud to extract the command. This speech enhancement task may effectively be achieved if the number of recording microphones is large. But incorporating many microphones is only possible in large and expensive devices. With multiple microphones present, the computational complexity of speech enhancement algorithms is high, along with its power consumption requirements. However, if the device under consideration is small with limited power and computational capabilities, having multiple microphones is not possible. For example, hearing aids and cochlear implant devices. Thus, most of these devices have been developed with a single microphone. As a result of this handicap, developing a speech enhancement algorithm for assisted learning devices with a single microphone, while keeping computational complexity and power consumption of the said algorithm low, is a challenging problem. There has been considerable research to solve this problem with good speech enhancement performance. However, most real-time speech enhancement algorithms lose their effectiveness if the level of noise present in the recorded speech is high. This dissertation deals with this problem, i.e., the objective is to develop a method that enhances performance by reducing the input signal noise level. To this end, it is proposed to include a pre-processing step before applying speech enhancement algorithms. This pre-processing performs noise suppression in the transformed domain by generating an approximation of the noisy signals’ short-time Fourier transform. The approximated signal with improved input signal to noise ratio is then used by other speech enhancement algorithms to recover the underlying clean signal. This approximation is performed by using the proposed Block-Principal Component Analysis (Block-PCA) algorithm. To illustrate efficacy of the methodology, a detailed performance analysis under multiple noise types and noise levels is followed, which demonstrates that the inclusion of the pre-processing step improves considerably the performance of speech enhancement algorithms when compared to other approaches with no pre-processing steps

    Hearing the message and seeing the messenger: The role of talker information in spoken language comprehension

    Get PDF
    The acoustic signal consists of various layers of information that we often process unconsciously. Most importantly, they contain both linguistic and indexical information, which are the two fundamental components within the sound input. Even though the meaning of the word does not change when spoken by multiple speakers, the same word never sounds exactly the same. That is because individuals introduce all kinds of variation to the speech input. Hence, through segmental and suprasegmental information, listeners can discern the nativeness (native vs. non-native) of the talker and the age of the talker (adult vs. child). Both non-native talkers and child talkers deviate from the standard norms of pronunciation of native adults and show variation both within and between talkers. The main difference between non-native adults and native children is that, for non-native talkers, variation is driven by their native language, meaning that the phonological structures of their native language interact with their second language; therefore, they maintain a foreign accent. For children, however, variation is driven by development, such that children's competencies in their motor skills depend on their current stage of language development. While there has been extensive research on foreign-accented speech, there is little knowledge about child speech. Especially the processing of child speech has only been investigated by a few studies so far. Hence, the central question of the dissertation is "What is the role of talker information in spoken language comprehension?" This question was investigated from three distinct angles: The first project examined talker information from an auditory-only perspective, the second project investigated talker information from an audio-visual perspective, and the third project studied the impact of talker information on listeners' credibility ratings in the socio-linguistic context

    Fictional gay men and gayspeak in Twenty-First Century british drama

    Get PDF
    This research lies in the field of Language and Sexuality Studies, and examines how playwrights have characterised fictional gay men in 21st century British drama. It analyses a corpus of 61 plays staged between 2000 and 2020, portraying 187 gay male characters. This work explores the corpus from three different perspectives and in the light of methodological triangulation, proceeding from the general to the particular. It starts with a brief excursus on 20th and 21st century British drama portraying gay characters, considering stage censorship and the laws regulating gay rights in the UK. General trends in the representation of homosexuality in 21st century British drama are traced diachronically. The second section investigates how the 187 fictional gay men in the corpus are characterised in present-day British drama. The gay characters are classified using variables common to all sociolinguistic studies – e.g. age, social class, linguistic variety – but also variables specific to Language and Sexuality Studies, such as the level of secrecy/out-of-theclosetedness and their own version of gayspeak. The final section takes an eclectic approach, and provides a multi-faceted picture of the fictional gayspeak included in the corpus. The variety is analysed both manually and taking a corpus-assisted approach using the software #Lancsbox. Based on previous research, a linguistic framework for analysing present-day fictional gayspeak is presented. The main aim of this section is to assess whether the features of gayspeak examined in past studies (see Sonenschein, 1969; Stanley, 1970; Lakoff, 1975; Hayes, 1976; Zwicky, 1997; Harvey, 1998, 2000, 2002, to name a few) are still found in the corpus. This thesis contributes to the existing literature for at least three reasons: (a) to my knowledge and belief, there is no academic research on British drama that deals exclusively with the portrayal of gay characters in the last twenty years; (b) there are, to my knowledge, no recent academic studies reassessing the purely linguistic features of gayspeak; (c) thirdly, this study intends to contribute to the field of Language and Sexuality Studies by applying also the methodologies of Corpus Linguistics, which is still relatively rare in this field of research

    CLARIN

    Get PDF
    The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium

    The role of correct pronunciation and intonation in teaching Italian as a foreign language through blended learning: a guide to the core sounds of the Italian language for English native speakers

    Get PDF
    Although many scholars have emphasised the value of pronunciation and intonation training as fundamental in FL (foreign language) teaching, it seems that the practise of these skills is still neglected by practitioners. Segmentals and suprasegmentals are often absent in Italian FL courses based on the claim that the phonology of Italian is rather easy, and students are expected to pick it up along the way. Proceeding from the recognition that a difference exists between the theory and the practice of integrating segmentals and suprasegmentals training in FL courses, this qualitative study investigates learners’ views about pronunciation and intonation’s role in learning foreign languages, in particular Italian FL, and the use of new specific materials and technological tools deployed for the teaching of these phonological skills online. Findings confirm the potential of materials devised appositively for pronunciation and intonation acquisition, and the benefits of using specific online voice recording tools to promote the development of phonological skills and boost students’ phonological and cultural awareness; however, they reveal that such potential often remains unrealised in the FL classroom. The role of teachers in terms of beliefs about, knowledge of and approach to teaching pronunciation and intonation in FL courses emerges as crucial. Findings also highlight the need of a deeper understanding of how pronunciation and intonation training can positively affect the students’ learning outcomes and how these skills should be systematically and appropriately addressed to in the FL class

    Conference Proceedings of the Euroregio / BNAM 2022 Joint Acoustic Conference

    Get PDF
    • …
    corecore