17,242 research outputs found

    Transforming services for children with hearing difficulty and their families: a good practice guide

    Get PDF

    Many uses, many annotations for large speech corpora: Switchboard and TDT as case studies

    Full text link
    This paper discusses the challenges that arise when large speech corpora receive an ever-broadening range of diverse and distinct annotations. Two case studies of this process are presented: the Switchboard Corpus of telephone conversations and the TDT2 corpus of broadcast news. Switchboard has undergone two independent transcriptions and various types of additional annotation, all carried out as separate projects that were dispersed both geographically and chronologically. The TDT2 corpus has also received a variety of annotations, but all directly created or managed by a core group. In both cases, issues arise involving the propagation of repairs, consistency of references, and the ability to integrate annotations having different formats and levels of detail. We describe a general framework whereby these issues can be addressed successfully.Comment: 7 pages, 2 figure

    Speaker Normalization Using Cortical Strip Maps: A Neural Model for Steady State vowel Categorization

    Full text link
    Auditory signals of speech are speaker-dependent, but representations of language meaning are speaker-independent. The transformation from speaker-dependent to speaker-independent language representations enables speech to be learned and understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitch-independent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by Adaptive Resonance Theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624

    A Formal Framework for Linguistic Annotation

    Get PDF
    `Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions -- audio, video and/or physiological recordings -- or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, `named entity' identification, co-reference annotation, and so on. While there are several ongoing efforts to provide formats and tools for such annotations and to publish annotated linguistic databases, the lack of widely accepted standards is becoming a critical problem. Proposed standards, to the extent they exist, have focussed on file formats. This paper focuses instead on the logical structure of linguistic annotations. We survey a wide variety of existing annotation formats and demonstrate a common conceptual core, the annotation graph. This provides a formal framework for constructing, maintaining and searching linguistic annotations, while remaining consistent with many alternative data structures and file formats.Comment: 49 page

    WAKE WORD DETECTION AND ITS APPLICATIONS

    Get PDF
    Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input. Novel methods are proposed to train a wake word detection system from partially labeled training data, and to use it in on-line applications. In the system, the prerequisite of frame-level alignment is removed, permitting the use of un-transcribed training examples that are annotated only for the presence/absence of the wake word. Also, an FST-based decoder is presented to perform online detection. The suite of methods greatly improve the wake word detection performance across several datasets. A novel neural network for acoustic modeling in wake word detection is also investigated. Specifically, the performance of several variants of chunk-wise streaming Transformers tailored for wake word detection is explored, including looking-ahead to the next chunk, gradient stopping, different positional embedding methods and adding same-layer dependency between chunks. Experiments demonstrate that the proposed Transformer model outperforms the baseline convolutional network significantly with a comparable model size, while still maintaining linear complexity w.r.t. the input length. For the application of the detected wake word in ASR, the problem of improving speech recognition with the help of the detected wake word is investigated. Voice-controlled house-hold devices face the difficulty of performing speech recognition of device-directed speech in the presence of interfering background speech. Two end-to-end models are proposed to tackle this problem with information extracted from the anchored segment. The anchored segment refers to the wake word segment of the audio stream, which contains valuable speaker information that can be used to suppress interfering speech and background noise. A multi-task learning setup is also explored where the ideal mask, obtained from a data synthesis procedure, is used to guide the model training. In addition, a way to synthesize "noisy" speech from "clean" speech is also proposed to mitigate the mismatch between training and test data. The proposed methods show large word error reduction for Amazon Alexa live data with interfering background speech, without sacrificing the performance on clean speech
    • …
    corecore