185 research outputs found

    A Review of Deep Learning Techniques for Speech Processing

    Full text link
    The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field

    Backward transfer of Glaswegian English on Indian English and Hindi: a case of simultaneous bilingual and bidialectal contact and interaction in Indian immigrants in Glasgow

    Get PDF
    In the wider context of Second Language Acquisition, much evidence has been found for phonological backward transfer across languages, but there are still various facets of it that remain unknown. This thesis investigates three such aspects: (1) the role of systemic similarity between linguistic varieties in affecting backward transfer, (2) differences between backward transfer across languages and backward transfer across dialects, and (3) the role of multiple sociolinguistic and psycholinguistic factors in affecting backward transfer. To this end, this study examined the first-generation bilingual adult Indian immigrant community in Glasgow ‘Glaswasians’ (n = 38), who were bilingual in Hindi and Indian English prior to arriving in Glasgow and are now in contact with the dominant host variety in Glasgow, Glaswegian English. In addition to Glaswasians, two control groups were recruited: ‘Glaswegians’ (n = 34), native speakers of Glaswegian English who reside in Glasgow, and ‘Indians’, (n = 31), native speakers of Indian English and Hindi, who reside in India and have never been in contact with Glaswegian English. To investigate the first aspect, an XAB similarity judgement task was carried out to determine if in addition to typological similarity, Indian English is also perceptually more similar to Glaswegian English as compared to Hindi, and therefore more vulnerable to transfer from Glaswegian English. The two control groups participated in this task and the results did not indicate a pattern of consistent similarity between Indian English and Glaswegian English phones, as compared to Hindi phones. To examine phonological backward transfer across languages versus dialects, the three speaker groups participated in a speech production task. Multiple phone categories were examined for various phonetic cues: (1) /l/ for F2-F1 difference, (2) GOOSE vowel for F1, F2, F3, (3) /t/ for Voice Onset Time (VOT), (4) Voiced stops /b d g/ for VOT, Voicing During Closure (VCD) and Relative Burst Intensity (RBI). The results, which were mixed, were interpreted with respect to Flege’s Speech Learning Model (1995b; Flege & Bohn, 2021) and its predictions of assimilation and dissimilation. Out of the three occasions of differences in the amount of transfer exhibited by Hindi and English, English underwent quantitatively more assimilation than Hindi on two occasions (VOT in /t/ and /d/), whereas Hindi underwent quantitatively more dissimilation than English on one occasion (F2-F1 difference in /l/). Finally, to examine the role of sociolinguistic and psycholinguistic factors in affecting backward transfer, data was collected from Glaswasians. A questionnaire task was used to collect data on gender, age of entry and length of residence in Glasgow, language proficiency and dominance, contact and identity, perceived discrimination. Multiple psychometric tasks were used to collect data on language switching ability and inhibitory skills. The results indicated that most of these factors influenced backward transfer and had a general effect across phones and corresponding features. For instance, higher Age of Entry and Length of Residence in Glasgow, Indian Identity, Indian Contact and higher inhibition were generally associated with more native-like or exaggeratedly native like shifts, whereas higher Glaswegian Contact and Glaswegian Identity were related to shifts towards Glaswegian English. There were, however, exceptions to the general effects of these predictors, such as for the phone categories /t/ and /g/. This finding is discussed in relation to the salience of these categories in the respective native and host linguistic varieties. The results of this study are discussed with reference to patterns of transfer and influence of factors found in previous research. Additionally, their implications about the nature of the adult bilingual-bidialectal system, its flexibility and the apparent lack of strong correspondence between perceptual similarity and backward transfer effects, are discussed. These findings also contribute to the knowledge on transfer effects across languages versus dialects and add to what was previously known about Indian English, Hindi and Glaswegian English. A model of backward transfer, the ‘Proximity Modulated Transfer Hypothesis’, is proposed to understand the manner of interaction between Glaswegian English and Hindi and Indian English in this situation of simultaneous bilingual and bidialectal interaction in relation to backward effects discovered across the various phones and corresponding features

    Introduction to Psycholiguistics

    Get PDF

    Investigating supra-intelligibility aspects of speech

    Get PDF
    158 p.Synthetic and recorded speech form a great part of oureveryday listening experience, and much of our exposure tothese forms of speech occurs in potentially noisy settings such as on public transport, in the classroom or workplace, while driving, and in our homes. Optimising speech output to ensure that salient information is both correctly and effortlessly received is a main concern for the designers of applications that make use of the speech modality. Most of the focus in adapting speech output to challenging listening conditions has been on intelligibility, and specifically on enhancing intelligibility by modifying speech prior to presentation. However, the quality of the generated speech is not always satisfying for the recipient, which might lead to fatigue, or reluctance in using this communication modality. Consequently, a sole focus on intelligibility enhancement provides an incomplete picture of a listener¿s experience since the effect of modified or synthetic speech on other characteristics risks being ignored. These concerns motivate the study of 'supra-intelligibility' factors such as the additional cognitive demand that modified speech may well impose upon listeners, as well as quality, naturalness, distortion and pleasantness. This thesis reports on an investigation into two supra-intelligibility factors: listening effort and listener preferences. Differences in listening effort across four speech types (plain natural, Lombard, algorithmically-enhanced, and synthetic speech) were measured using existing methods, including pupillometry, subjective judgements, and intelligibility scores. To explore the effects of speech features on listener preferences, a new tool, SpeechAdjuster, was developed. SpeechAdjuster allows the manipulation of virtually any aspect of speech and supports the joint elicitation of listener preferences and intelligibility measures. The tool reverses the roles of listener and experimenter by allowing listeners direct control of speech characteristics in real-time. Several experiments to explore the effects of speech properties on listening preferences and intelligibility using SpeechAdjuster were conducted. Participants were permitted to change a speech feature during an open-ended adjustment phase, followed by a test phase in which they identified speech presented with the feature value selected at the end of the adjustment phase. Experiments with native normal-hearing listeners measured the consequences of allowing listeners to change speech rate, fundamental frequency, and other features which led to spectral energy redistribution. Speech stimuli were presented in both quiet and masked conditions. Results revealed that listeners prefer feature modifications similar to those observed in naturally modified speech in noise (Lombard speech). Further, Lombard speech required the least listening effort compared to either plain natural, algorithmically-enhanced, or synthetic speech. For stationary noise, as noise level increased listeners chose slower speech rates and flatter tilts compared to the original speech. Only the choice of fundamental frequency was not consistent with that observed in Lombard speech. It is possible that features such as fundamental frequency that talkers naturally modify are by-products of the speech type (e.g. hyperarticulated speech) and might not be advantageous for the listener.Findings suggest that listener preferences provide information about the processing of speech over and above that measured by intelligibility. One of the listeners¿ concerns was to maximise intelligibility. In noise, listeners preferred the feature values for which more information survived masking, choosing speech rates that led to a contrast with the modulation rate of the masker, or modifications that led to a shift of spectral energy concentration to higher frequencies compared to those of the masker. For all features being modified by listeners, preferences were evident even when intelligibility was at or close to ceiling levels. Such preferences might result from a desire to reduce the cognitive effort of understanding speech, or from a desire to reproduce the sound of typical speech features experienced in real-world noisy conditions, or to optimise the quality of the modified signal. Investigation of supra-intelligibility aspects of speech promises to improve the quality of speech enhancement algorithms, bringing with it the potential of reducing the effort of understanding artificially-modified or generated forms of speech

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies

    The evolution of language: Proceedings of the Joint Conference on Language Evolution (JCoLE)

    Get PDF

    Models and analysis of vocal emissions for biomedical applications: 5th International Workshop: December 13-15, 2007, Firenze, Italy

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies. The Workshop has the sponsorship of: Ente Cassa Risparmio di Firenze, COST Action 2103, Biomedical Signal Processing and Control Journal (Elsevier Eds.), IEEE Biomedical Engineering Soc. Special Issues of International Journals have been, and will be, published, collecting selected papers from the conference

    Quality of experience in telemeetings and videoconferencing: a comprehensive survey

    Get PDF
    Telemeetings such as audiovisual conferences or virtual meetings play an increasingly important role in our professional and private lives. For that reason, system developers and service providers will strive for an optimal experience for the user, while at the same time optimizing technical and financial resources. This leads to the discipline of Quality of Experience (QoE), an active field originating from the telecommunication and multimedia engineering domains, that strives for understanding, measuring, and designing the quality experience with multimedia technology. This paper provides the reader with an entry point to the large and still growing field of QoE of telemeetings, by taking a holistic perspective, considering both technical and non-technical aspects, and by focusing on current and near-future services. Addressing both researchers and practitioners, the paper first provides a comprehensive survey of factors and processes that contribute to the QoE of telemeetings, followed by an overview of relevant state-of-the-art methods for QoE assessment. To embed this knowledge into recent technology developments, the paper continues with an overview of current trends, focusing on the field of eXtended Reality (XR) applications for communication purposes. Given the complexity of telemeeting QoE and the current trends, new challenges for a QoE assessment of telemeetings are identified. To overcome these challenges, the paper presents a novel Profile Template for characterizing telemeetings from the holistic perspective endorsed in this paper
    • …
    corecore