96 research outputs found

    Lightly supervised GMM VAD to use audiobook for speech synthesiser

    Get PDF
    Audiobooks have been focused on as promising data for training Text-to-Speech (TTS) systems. However, they usually do not have a correspondence between audio and text data. Moreover, they are usually divided only into chapter units. In practice, we have to make a correspondence of audio and text data before we use them for building TTS synthesisers. However aligning audio and text data is time-consuming and involves manual labor. It also requires persons skilled in speech processing. Previously, we have proposed to use graphemes for automatically aligning speech and text data. This paper further integrates a lightly supervised voice activity detection (VAD) technique to detect sentence boundaries as a pre-processing step before the grapheme approach. This lightly supervised technique requires time stamps of speech and silence only for the first fifty sentences. Combining those, we can semiautomatically build TTS systems from audiobooks with minimum manual intervention. From subjective evaluations we analyse how the grapheme-based aligner and/or the proposed VAD technique impact the quality of HMM-based speech synthesisers trained on audiobooks. Index Terms — voice activity detection, lightly supervised, audiobook, HMM-based speech synthesis 1

    An Automatic Real-time Synchronization of Live speech with Its Transcription Approach

    Get PDF
    Most studies in automatic synchronization of speech and transcription focus on the synchronization at the sentence level or the phrase level. Nevertheless, in some languages, like Thai, boundaries of such levels are difficult to linguistically define, especially in case of the synchronization of speech and its transcription. Consequently, the synchronization at a finer level like the syllabic level is promising. In this article, an approach to synchronize live speech with its corresponding transcription in real time at the syllabic level is proposed. Our approach employs the modified real-time syllable detection procedure from our previous work and the transcription verification procedure then adopts to verify correctness and to recover errors caused by the real-time syllable detection procedure. In experiments, the acoustic features and the parameters are customized empirically. Results are compared with two baselines which have been applied to the Thai scenario. Experimental results indicate that, our approach outperforms two baselines with error rate reduction of 75.9% and 41.9% respectively and also can provide results in the real-time situation. Besides, our approach is applied to the practical application, namely ChulaDAISY. Practical experiments show that ChulaDAISY applied with our approach could reduce time consumption for producing audio books

    A SYSTEM FOR AUTOMATIC ALIGNMENT OF BROADCAST MEDIA CAPTIONS USING WEIGHTED FINITE-STATE TRANSDUCERS

    Get PDF
    ABSTRACT We describe our system for alignment of broadcast media captions in the 2015 MGB Challenge. A precise time alignment of previously-generated subtitles to media data is important in the process of caption generation by broadcasters. However, this task is challenging due to the highly diverse, often noisy content of the audio, and because the subtitles are frequently not a verbatim representation of the actual words spoken. Our system employs a two-pass approach with appropriately constrained weighted finite state transducers (WFSTs) to enable good alignment even when the audio quality would be challenging for conventional ASR. The system achieves an f-score of 0.8965 on the MGB Challenge development set

    Text to Audio Alignment

    Get PDF
    Tato bakaláƙskĂĄ prĂĄce se zabĂœvĂĄ vĂœzkumem nĂĄstroje pro synchronizaci textu a audia na Ășrovni jednotlivĂœch grafĂ©mĆŻ a fonĂ©mĆŻ. V prĂĄci jsou takĂ© diskutovĂĄny moĆŸnĂ© pƙístupy k synchronizaci a pƙípadnĂĄ omezenĂ­ a problĂ©my, kterĂœm je tƙeba čelit. ZkoumanĂœ nĂĄstroj vyuĆŸĂ­vĂĄ pƙístup vychĂĄzejĂ­cĂ­ z grapheme-to-phoneme konverze s pouĆŸitĂ­m joint-sequence modelĆŻ. Pro experimenty jsou pouĆŸity data z televiznĂ­ho vysĂ­lĂĄnĂ­, kterĂĄ byla pƙevzata z Multi-Genre Broadcast Challenge 2015.This bachelor thesis studies a tool for automatic text to audio alignment at the level of single phonemes and graphemes. It also discusses possible techniques used in alignment and possible limitations and difficulties that need to be taken into account. Studied tool uses approach based on grapheme-to-phoneme conversion using joint-sequence models. Data used in experiments are TV broadcast recordings from Multi-Genre Broadcast Challenge 2015.

    A Review of Deep Learning Techniques for Speech Processing

    Full text link
    The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field

    Proceedings of the 7th Sound and Music Computing Conference

    Get PDF
    Proceedings of the SMC2010 - 7th Sound and Music Computing Conference, July 21st - July 24th 2010

    Sonic Phantoms Compositional explorations of perceptual phantom patterns

    Get PDF
    I use the term ‘Sonic Phantoms’ to refer as a whole to a cohesive collection of sound compositions that I have developed over the past five years (2009-2014; fifty pieces, structured in four separate collections / series), dealing at a fundamental level with perceptual auditory illusions. For the creation of this compositional body of work, I have developed a syncretic approach that encompasses and coalesces all kinds of sources, materials, techniques and compositional tools: voices (real and synthetic), field recordings (involving wilderness expeditions worldwide), instrument manipulation (including novel ways of ‘preparation’), object amplification, improvisation and recording studio techniques. This manifests a sonic-based and perceptive-based understanding of the compositional work, as an implicitly proposed paradigm for any equivalent work in terms of its trans-technological, phenomena-based nature. By means of the collection of pieces created and the research and contextualisation presented, my work with ‘Sonic Phantoms’ aims at bringing into focus, shaping and defining a specific and dedicated compositional realm that considers auditory illusions as essential components of the work and not simply mere side effects. I play with sonic materials that are either naturally ambiguous or have been composed to attain this quality, in order to exploit the potential for apophenia to manifest, bringing with it the ‘phantasmatic’ presence. Both my compositions and research work integrate and synergise a considerable number of disparate musical traditions (Western and non-Western), techno-historical moments (from ancient / archaic to electronic / computer-age techniques), cultural frameworks (from ‘serious’ to ‘popular’), and fields of interest / expertise (from the psychological to the musical), into a personal and cohesive compositional whole. All these diverse elements are not simply mentioned or referenced, but have rather defined, structured and formed the resulting compositional work

    Recent Advances in Signal Processing

    Get PDF
    The signal processing task is a very critical issue in the majority of new technological inventions and challenges in a variety of applications in both science and engineering fields. Classical signal processing techniques have largely worked with mathematical models that are linear, local, stationary, and Gaussian. They have always favored closed-form tractability over real-world accuracy. These constraints were imposed by the lack of powerful computing tools. During the last few decades, signal processing theories, developments, and applications have matured rapidly and now include tools from many areas of mathematics, computer science, physics, and engineering. This book is targeted primarily toward both students and researchers who want to be exposed to a wide variety of signal processing techniques and algorithms. It includes 27 chapters that can be categorized into five different areas depending on the application at hand. These five categories are ordered to address image processing, speech processing, communication systems, time-series analysis, and educational packages respectively. The book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity
    • 

    corecore