256 research outputs found

    Proceedings of the 9th PLA Symposium

    Get PDF
    This is the combined conference proceedings of the 9th annual Purdue Linguistics Association (PLA) Symposium

    Iterative pseudo-forced alignment by acoustic CTC loss for self-supervised ASR domain adaptation

    Full text link
    High-quality data labeling from specific domains is costly and human time-consuming. In this work, we propose a self-supervised domain adaptation method, based upon an iterative pseudo-forced alignment algorithm. The produced alignments are employed to customize an end-to-end Automatic Speech Recognition (ASR) and iteratively refined. The algorithm is fed with frame-wise character posteriors produced by a seed ASR, trained with out-of-domain data, and optimized throughout a Connectionist Temporal Classification (CTC) loss. The alignments are computed iteratively upon a corpus of broadcast TV. The process is repeated by reducing the quantity of text to be aligned or expanding the alignment window until finding the best possible audio-text alignment. The starting timestamps, or temporal anchors, are produced uniquely based on the confidence score of the last aligned utterance. This score is computed with the paths of the CTC-alignment matrix. With this methodology, no human-revised text references are required. Alignments from long audio files with low-quality transcriptions, like TV captions, are filtered out by confidence score and ready for further ASR adaptation. The obtained results, on both the Spanish RTVE2022 and CommonVoice databases, underpin the feasibility of using CTC-based systems to perform: highly accurate audio-text alignments, domain adaptation and semi-supervised training of end-to-end ASR.Comment: 5 pages, 4 figures, IberSPEECH202

    On the automatic segmentation of transcribed words

    Get PDF

    Linguistically-motivated sub-word modeling with applications to speech recognition

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Includes bibliographical references (p. 173-185).Despite the proliferation of speech-enabled applications and devices, speech-driven human-machine interaction still faces several challenges. One of theses issues is the new word or the out-of-vocabulary (OOV) problem, which occurs when the underlying automatic speech recognizer (ASR) encounters a word it does not "know". With ASR being deployed in constantly evolving domains such as restaurant ratings, or music querying, as well as on handheld devices, the new word problem continues to arise.This thesis is concerned with the OOV problem, and in particular with the process of modeling and learning the lexical properties of an OOV word through a linguistically-motivated sub-syllabic model. The linguistic model is designed using a context-free grammar which describes the sub-syllabic structure of English words, and encapsulates phonotactic and phonological constraints. The context-free grammar is supported by a probability model, which captures the statistics of the parses generated by the grammar and encodes spatio-temporal context. The two main outcomes of the grammar design are: (1) sub-word units, which encode pronunciation information, and can be viewed as clusters of phonemes; and (2) a high-quality alignment between graphemic and sub-word units, which results in hybrid entities denoted as spellnemes. The spellneme units are used in the design of a statistical bi-directional letter-to-sound (L2S) model, which plays a significant role in automatically learning the spelling and pronunciation of a new word.The sub-word units and the L2S model are assessed on the task of automatic lexicon generation. In a first set of experiments, knowledge of the spelling of the lexicon is assumed. It is shown that the phonemic pronunciations associated with the lexicon can be successfully learned using the L2S model as well as a sub-word recognizer.(cont.) In a second set of experiments, the assumption of perfect spelling knowledge is relaxed, and an iterative and unsupervised algorithm, denoted as Turbo-style, makes use of spoken instances of both spellings and words to learn the lexical entries in a dictionary.Sub-word speech recognition is also embedded in a parallel fashion as a backoff mechanism for a word recognizer. The resulting hybrid model is evaluated in a lexical access application, whereby a word recognizer first attempts to recognize an isolated word. Upon failure of the word recognizer, the sub-word recognizer is manually triggered. Preliminary results show that such a hybrid set-up outperforms a large-vocabulary recognizer.Finally, the sub-word units are embedded in a flat hybrid OOV model for continuous ASR. The hybrid ASR is deployed as a front-end to a song retrieval application, which is queried via spoken lyrics. Vocabulary compression and open-ended query recognition are achieved by designing a hybrid ASR. The performance of the frontend recognition system is reported in terms of sentence, word, and sub-word error rates. The hybrid ASR is shown to outperform a word-only system over a range of out-of-vocabulary rates (1%-50%). The retrieval performance is thoroughly assessed as a fmnction of ASR N-best size, language model order, and the index size. Moreover, it is shown that the sub-words outperform alternative linguistically-motivated sub-lexical units such as phonemes. Finally, it is observed that a dramatic vocabulary compression - by more than a factor of 10 - is accompanied by a minor loss in song retrieval performance.by Ghinwa F. Choueiter.Ph.D

    A characterization of the problem of new, out-of-vocabulary words in continuous-speech recognition and understanding

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1995.Includes bibliographical references (p. 167-173).by Irvine Lee Hetherington.Ph.D

    Prosody-Based Automatic Segmentation of Speech into Sentences and Topics

    Get PDF
    A crucial step in processing speech audio data for information extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (information gleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden Markov modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora, Broadcast News and Switchboard. Results show that the prosodic model alone performs on par with, or better than, word-based statistical language models -- for both true and automatically recognized words in news speech. The prosodic model achieves comparable performance with significantly less training data, and requires no hand-labeling of prosodic events. Across tasks and corpora, we obtain a significant improvement over word-only models using a probabilistic combination of prosodic and lexical information. Inspection reveals that the prosodic models capture language-independent boundary indicators described in the literature. Finally, cue usage is task and corpus dependent. For example, pause and pitch features are highly informative for segmenting news speech, whereas pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2), Special Issue on Accessing Information in Spoken Audio, September 200

    Nodalida 2005 - proceedings of the 15th NODALIDA conference

    Get PDF
    corecore