103,020 research outputs found

    Recognizing Voice Over IP: A Robust Front-End for Speech Recognition on the World Wide Web

    Get PDF
    The Internet Protocol (IP) environment poses two relevant sources of distortion to the speech recognition problem: lossy speech coding and packet loss. In this paper, we propose a new front-end for speech recognition over IP networks. Specifically, we suggest extracting the recognition feature vectors directly from the encoded speech (i.e., the bit stream) instead of decoding it and subsequently extracting the feature vectors. This approach offers two significant benefits. First, the recognition system is only affected by the quantization distortion of the spectral envelope. Thus, we are avoiding the influence of other sources of distortion due to the encoding-decoding process. Second, when packet loss occurs, our front-end becomes more effective since it is not constrained to the error handling mechanism of the codec. We have considered the ITU G.723.1 standard codec, which is one of the most preponderant coding algorithms in voice over IP (VoIP) and compared the proposed front-end with the conventional approach in two automatic speech recognition (ASR) tasks, namely, speaker-independent isolated digit recognition and speaker-independent continuous speech recognition. In general, our approach outperforms the conventional procedure, for a variety of simulated packet loss rates. Furthermore, the improvement is higher as network conditions worsen.Publicad

    Recognition of Harmonic Sounds in Polyphonic Audio using a Missing Feature Approach: Extended Report

    Get PDF
    A method based on local spectral features and missing feature techniques is proposed for the recognition of harmonic sounds in mixture signals. A mask estimation algorithm is proposed for identifying spectral regions that contain reliable information for each sound source and then bounded marginalization is employed to treat the feature vector elements that are determined as unreliable. The proposed method is tested on musical instrument sounds due to the extensive availability of data but it can be applied on other sounds (i.e. animal sounds, environmental sounds), whenever these are harmonic. In simulations the proposed method clearly outperformed a baseline method for mixture signals

    Follow-up question handling in the IMIX and Ritel systems: A comparative study

    Get PDF
    One of the basic topics of question answering (QA) dialogue systems is how follow-up questions should be interpreted by a QA system. In this paper, we shall discuss our experience with the IMIX and Ritel systems, for both of which a follow-up question handling scheme has been developed, and corpora have been collected. These two systems are each other's opposites in many respects: IMIX is multimodal, non-factoid, black-box QA, while Ritel is speech, factoid, keyword-based QA. Nevertheless, we will show that they are quite comparable, and that it is fruitful to examine the similarities and differences. We shall look at how the systems are composed, and how real, non-expert, users interact with the systems. We shall also provide comparisons with systems from the literature where possible, and indicate where open issues lie and in what areas existing systems may be improved. We conclude that most systems have a common architecture with a set of common subtasks, in particular detecting follow-up questions and finding referents for them. We characterise these tasks using the typical techniques used for performing them, and data from our corpora. We also identify a special type of follow-up question, the discourse question, which is asked when the user is trying to understand an answer, and propose some basic methods for handling it

    The management of context-sensitive features: A review of strategies

    Get PDF
    In this paper, we review five heuristic strategies for handling context- sensitive features in supervised machine learning from examples. We discuss two methods for recovering lost (implicit) contextual information. We mention some evidence that hybrid strategies can have a synergetic effect. We then show how the work of several machine learning researchers fits into this framework. While we do not claim that these strategies exhaust the possibilities, it appears that the framework includes all of the techniques that can be found in the published literature on context-sensitive learning

    Parsing Speech: A Neural Approach to Integrating Lexical and Acoustic-Prosodic Information

    Full text link
    In conversational speech, the acoustic signal provides cues that help listeners disambiguate difficult parses. For automatically parsing spoken utterances, we introduce a model that integrates transcribed text and acoustic-prosodic features using a convolutional neural network over energy and pitch trajectories coupled with an attention-based recurrent neural network that accepts text and prosodic features. We find that different types of acoustic-prosodic features are individually helpful, and together give statistically significant improvements in parse and disfluency detection F1 scores over a strong text-only baseline. For this study with known sentence boundaries, error analyses show that the main benefit of acoustic-prosodic features is in sentences with disfluencies, attachment decisions are most improved, and transcription errors obscure gains from prosody.Comment: Accepted in NAACL HLT 201
    • …
    corecore