2,900 research outputs found

    Markov Models of Telephone Speech Dialogues

    Get PDF
    Analogue speech signals are the most natural form of communication among humans. The contemporary methods adopted for the analysis of voice transmission by packet switching were designed mainly with respect to a Poisson stream of input packets, for which the probability of an active packet on each input port of the router is a constant value in time. An assumption that is not always valid, since the formation of speech packets during a dialogue is a non-stationary process, in which case mathematical modeling becomes an effective method of analysis, through which necessary estimates of a network node being designed for packet transmission of speech may be obtained. This paper presents the result of analysis of mathematical models of Markov chain based speech packet sources vis-Ă -vis the peculiarities of telephone dialogue models. The derived models can be employed in the design and development of methods of statistical multiplexing of packet switching network nodes

    Three New Corpora at the Bavarian Archive for Speech Signals - and a First Step Towards Distributed Web-Based Recording

    Get PDF
    The Bavarian Archive for Speech Signals has released three new speech corpora for both industrial and academic use: a) Hempels Sofa contains recordings of up to 60 seconds of non-scripted telephone speech, b) ZipTel is a corpus with telephone speech covering postal addresses and telephone numbers from a real world application, and c) RVG-J, an extension of the original Regional Variants of German corpus with juvenile speakers. All three corpora were transcribed orthographically according to the SpeechDat annotation guidelines using the WWWTranscribe annotation software. Recently, BAS has begun to investigate performing large-scale audio recordings via the web, and RVG-J has become the testbed for this type of recording

    English Conversational Telephone Speech Recognition by Humans and Machines

    Full text link
    One of the most difficult speech recognition tasks is accurate recognition of human to human communication. Advances in deep learning over the last few years have produced major speech recognition improvements on the representative Switchboard conversational corpus. Word error rates that just a few years ago were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now believed to be within striking range of human performance. This then raises two issues - what IS human performance, and how far down can we still drive speech recognition error rates? A recent paper by Microsoft suggests that we have already achieved human performance. In trying to verify this statement, we performed an independent set of human performance measurements on two conversational tasks and found that human performance may be considerably better than what was earlier reported, giving the community a significantly harder goal to achieve. We also report on our own efforts in this area, presenting a set of acoustic and language modeling techniques that lowered the word error rate of our own English conversational telephone LVCSR system to the level of 5.5%/10.3% on the Switchboard/CallHome subsets of the Hub5 2000 evaluation, which - at least at the writing of this paper - is a new performance milestone (albeit not at what we measure to be human performance!). On the acoustic side, we use a score fusion of three models: one LSTM with multiple feature inputs, a second LSTM trained with speaker-adversarial multi-task learning and a third residual net (ResNet) with 25 convolutional layers and time-dilated convolutions. On the language modeling side, we use word and character LSTMs and convolutional WaveNet-style language models

    Fast Keyword Spotting in Telephone Speech

    Get PDF
    In the paper, we present a system designed for detecting keywords in telephone speech. We focus not only on achieving high accuracy but also on very short processing time. The keyword spotting system can run in three modes: a) an off-line mode requiring less than 0.1xRT, b) an on-line mode with minimum (2 s) latency, and c) a repeated spotting mode, in which pre-computed values allow for additional acceleration. Its performance is evaluated on recordings of Czech spontaneous telephone speech using rather large and complex keyword lists

    Application of shifted delta cepstral features for GMM language identification

    Get PDF
    Spoken language identifcation (LID) in telephone speech signals is an important and difficult classification task. Language identifcation modules can be used as front end signal routers for multilanguage speech recognition or transcription devices. Gaussian Mixture Models (GMM\u27s) can be utilized to effectively model the distribution of feature vectors present in speech signals for classification. Common feature vectors used for speech processing include Linear Prediction (LP-CC), Mel-Frequency (MF-CC), and Perceptual Linear Prediction derived Cepstral coefficients (PLP-CC). This thesis compares and examines the recently proposed type of feature vector called the Shifted Delta Cepstral (SDC) coefficients. Utilization of the Shifted Delta Cepstral coefficients has been shown to improve language identification performance. This thesis explores the use of different types of shifted delta cepstral feature vectors for spoken language identification of telephone speech using a simple Gaussian Mixture Models based classifier for a 3-language task. The OGI Multi-language Telephone Speech Corpus is used to evaluate the system

    Improving Speech Recognition for Interviews with both Clean and Telephone Speech

    Get PDF
    High quality automatic speech recognition (ASR) depends on the context of the speech. Cleanly recorded speech has better results than speech recorded over telephone lines. In telephone speech, the signal is band-pass filtered which limits frequencies available for computation. Consequently, the transmitted speech signal may be distorted by noise, causing higher word error rates (WER). The main goal of this research project is to examine approaches to improve recognition of telephone speech while maintaining or improving results for clean speech in mixed telephone-clean speech recordings, by reducing mismatches between the test data and the available models. The test data includes recorded interviews where the interviewer was near the hand-held, single-channel recorder and the interviewee was on a speaker phone with the speaker near the recorder. Available resources include the Eesen offline transcriber and two acoustic models based on clean training data or telephone training data (Switchboard). The Eesen offline transcriber is on a virtual machine available through the Speech Recognition Virtual Kitchen and uses an approach based on a deep recurrent neural network acoustic model and a weighted finite state transducer decoder to transcribe audio into text. This project addresses the problem of high WER that comes when telephone speech is tested on cleanly-trained models by 1) replacing the clean model with a telephone model and 2) analyzing and addressing errors through data cleaning, correcting audio segmentation, and adding words to the dictionary. These approaches reduced the overall WER. This paper includes an overview of the transcriber, acoustic models, and the methods used to improve speech recognition, as well as results of transcription performance. We expect these approaches to reduce the WER on the telephone speech. Future work includes applying a variety of filters to the speech signal could reduce both additive and convolutional noise resulting from the telephone channel

    Las fortalezas castellanas de la Orden de Calatrava en el siglo XII

    Get PDF
    In this paper, we present a spectro-temporal feature extraction technique using sub-band Hilbert envelopes of relatively long segments of speech signal. Hilbert envelopes of the sub-bands are estimated using Frequency Domain Linear Prediction (FDLP). Spectral features are derived by integrating the sub-band Hilbert envelopes in short-term frames and the temporal features are formed by converting the FDLP envelopes into modulation frequency components. These are then combined at the phoneme posterior level and are used as the input features for a phoneme recognition system. In order to improve the robustness of the proposed features to telephone speech, the sub-band temporal envelopes are gain normalized prior to feature extraction. Phoneme recognition experiments on telephone speech in the HTIMIT database show significant performance improvements for the proposed features when compared to other robust feature techniques (average relative reduction of 11%11\% in phoneme error rate)

    Real-Time Hardware Implementation of Telephone Speech Enhancement Algorithm

    Get PDF
    Engineering: 3rd Place (The Ohio State University Denman Undergraduate Research Forum)Hearing impairment detrimentally affects communication over the telephone. Since phone lines reduce bandwidth and dynamic range, the poor quality speech signal can cause hard of hearing (HoH) listeners to experience extreme frustration and inefficient communication. One possible solution has been developed at the Ohio State University to help combat this problem. The Telephone Speech Enhancement Algorithm (TSEA) has been created to improve telephone signals so that speech is more intelligible for HoH listeners. Tests for TSEA have been run on human subjects and proven the algorithm effective. However, a hardware implementation of TSEA has yet to be designed. In this thesis, the BeagleBoard-xM development board is used to run TSEA. The software for TSEA is modified so that it can be implemented on the BeagleBoard-xM and tested in a real-time environment. This hardware model runs TSEA but introduces noise into the system due to its analog nature. The model accepts analog audio signals, processes them using TSEA, and outputs the processed signal for transmission. A device such as this has the potential to improve communication in scenarios such as telemedicine clinics where a failure to communicate properly with their HoH customers could have potentially devastating consequences. Ideally if a commercial model was developed, TSEA could be implemented everywhere to help improve communications for the HoH community. This project is the next step in making it a reality.Academic Major: Electrical and Computer Engineerin
    • …
    corecore