351 research outputs found

    Augmenting automatic speech recognition and search models for spoken content retrieval

    Get PDF
    Spoken content retrieval (SCR) is a process to provide a user with spoken documents in which the user is potentially interested. Unlike textual documents, searching through speech is not trivial due to its representation. Generally, automatic speech recognition (ASR) is used to transcribe spoken content such as user-generated videos and podcast episodes into transcripts before search operations are performed. Despite recent improvements in ASR, transcription errors can still be present in automatic transcripts. This is in particular when ASR is applied to out-of-domain data or speech with background noise. This thesis explores improvement of ASR systems and search models for enhanced SCR on user-generated spoken content. There are three topics explored in this thesis. Firstly, the use of multimodal signals for ASR is investigated. This is motivated to integrate background contexts of spoken content into ASR. Integration of visual signals and document metadata into ASR is hypothesised to produce transcripts more aligned to background contexts of speech. Secondly, the use of semi-supervised training and content genre information from metadata are exploited for ASR. This approach is motivated to mitigate the transcription errors caused by recognition of out-of-domain speech. Thirdly, the use of neural models and the model extension using N-best ASR transcripts are investigated. Using ASR N-best transcripts instead of 1-best for search models is motivated because "key terms" missed in 1-best can be present in the N-best transcripts. A series of experiments are conducted to examine those approaches to improvement of ASR systems and search models. The findings suggest that semi-supervised training bring practical improvement of ASR systems for SCR and the use of neural ranking models in particular with N-best transcripts improve the result of known-item search over the baseline BM25 model

    A Review of Deep Learning Techniques for Speech Processing

    Full text link
    The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field

    Albayzin 2018 Evaluation: The IberSpeech-RTVE Challenge on Speech Technologies for Spanish Broadcast Media

    Get PDF
    The IberSpeech-RTVE Challenge presented at IberSpeech 2018 is a new Albayzin evaluation series supported by the Spanish Thematic Network on Speech Technologies (Red Temática en Tecnologías del Habla (RTTH)). That series was focused on speech-to-text transcription, speaker diarization, and multimodal diarization of television programs. For this purpose, the Corporacion Radio Television Española (RTVE), the main public service broadcaster in Spain, and the RTVE Chair at the University of Zaragoza made more than 500 h of broadcast content and subtitles available for scientists. The dataset included about 20 programs of different kinds and topics produced and broadcast by RTVE between 2015 and 2018. The programs presented different challenges from the point of view of speech technologies such as: the diversity of Spanish accents, overlapping speech, spontaneous speech, acoustic variability, background noise, or specific vocabulary. This paper describes the database and the evaluation process and summarizes the results obtained

    An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

    Get PDF
    Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets and objective functions. In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance
    corecore