1,644 research outputs found

    Robust speaker identification using artificial neural networks

    Full text link
    This research mainly focuses on recognizing the speakers through their speech samples. Numerous Text-Dependent or Text-Independent algorithms have been developed by people so far, to recognize the speaker from his/her speech. In this thesis, we concentrate on the recognition of the speaker from the fixed text i.e. Text-Dependent . Possibility of extending this method to variable text i.e. Text-Independent is also analyzed. Different feature extraction algorithms are employed and their performance with Artificial Neural Networks as a Data Classifier on a fixed training set is analyzed. We find a way to combine all these individual feature extraction algorithms by incorporating their interdependence. The efficiency of these algorithms is determined after the input speech is classified using Back Propagation Algorithm of Artificial Neural Networks. A special case of Back Propagation Algorithm which improves the efficiency of the classification is also discussed

    Reducing Audible Spectral Discontinuities

    Get PDF
    In this paper, a common problem in diphone synthesis is discussed, viz., the occurrence of audible discontinuities at diphone boundaries. Informal observations show that spectral mismatch is most likely the cause of this phenomenon.We first set out to find an objective spectral measure for discontinuity. To this end, several spectral distance measures are related to the results of a listening experiment. Then, we studied the feasibility of extending the diphone database with context-sensitive diphones to reduce the occurrence of audible discontinuities. The number of additional diphones is limited by clustering consonant contexts that have a similar effect on the surrounding vowels on the basis of the best performing distance measure. A listening experiment has shown that the addition of these context-sensitive diphones significantly reduces the amount of audible discontinuities

    Speech Enhancement Exploiting the Source-Filter Model

    Get PDF
    Imagining everyday life without mobile telephony is nowadays hardly possible. Calls are being made in every thinkable situation and environment. Hence, the microphone will not only pick up the user’s speech but also sound from the surroundings which is likely to impede the understanding of the conversational partner. Modern speech enhancement systems are able to mitigate such effects and most users are not even aware of their existence. In this thesis the development of a modern single-channel speech enhancement approach is presented, which uses the divide and conquer principle to combat environmental noise in microphone signals. Though initially motivated by mobile telephony applications, this approach can be applied whenever speech is to be retrieved from a corrupted signal. The approach uses the so-called source-filter model to divide the problem into two subproblems which are then subsequently conquered by enhancing the source (the excitation signal) and the filter (the spectral envelope) separately. Both enhanced signals are then used to denoise the corrupted signal. The estimation of spectral envelopes has quite some history and some approaches already exist for speech enhancement. However, they typically neglect the excitation signal which leads to the inability of enhancing the fine structure properly. Both individual enhancement approaches exploit benefits of the cepstral domain which offers, e.g., advantageous mathematical properties and straightforward synthesis of excitation-like signals. We investigate traditional model-based schemes like Gaussian mixture models (GMMs), classical signal processing-based, as well as modern deep neural network (DNN)-based approaches in this thesis. The enhanced signals are not used directly to enhance the corrupted signal (e.g., to synthesize a clean speech signal) but as so-called a priori signal-to-noise ratio (SNR) estimate in a traditional statistical speech enhancement system. Such a traditional system consists of a noise power estimator, an a priori SNR estimator, and a spectral weighting rule that is usually driven by the results of the aforementioned estimators and subsequently employed to retrieve the clean speech estimate from the noisy observation. As a result the new approach obtains significantly higher noise attenuation compared to current state-of-the-art systems while maintaining a quite comparable speech component quality and speech intelligibility. In consequence, the overall quality of the enhanced speech signal turns out to be superior as compared to state-of-the-art speech ehnahcement approaches.Mobiltelefonie ist aus dem heutigen Leben nicht mehr wegzudenken. Telefonate werden in beliebigen Situationen an beliebigen Orten geführt und dabei nimmt das Mikrofon nicht nur die Sprache des Nutzers auf, sondern auch die Umgebungsgeräusche, welche das Verständnis des Gesprächspartners stark beeinflussen können. Moderne Systeme können durch Sprachverbesserungsalgorithmen solchen Effekten entgegenwirken, dabei ist vielen Nutzern nicht einmal bewusst, dass diese Algorithmen existieren. In dieser Arbeit wird die Entwicklung eines einkanaligen Sprachverbesserungssystems vorgestellt. Der Ansatz setzt auf das Teile-und-herrsche-Verfahren, um störende Umgebungsgeräusche aus Mikrofonsignalen herauszufiltern. Dieses Verfahren kann für sämtliche Fälle angewendet werden, in denen Sprache aus verrauschten Signalen extrahiert werden soll. Der Ansatz nutzt das Quelle-Filter-Modell, um das ursprüngliche Problem in zwei Unterprobleme aufzuteilen, die anschließend gelöst werden, indem die Quelle (das Anregungssignal) und das Filter (die spektrale Einhüllende) separat verbessert werden. Die verbesserten Signale werden gemeinsam genutzt, um das gestörte Mikrofonsignal zu entrauschen. Die Schätzung von spektralen Einhüllenden wurde bereits in der Vergangenheit erforscht und zum Teil auch für die Sprachverbesserung angewandt. Typischerweise wird dabei jedoch das Anregungssignal vernachlässigt, so dass die spektrale Feinstruktur des Mikrofonsignals nicht verbessert werden kann. Beide Ansätze nutzen jeweils die Eigenschaften der cepstralen Domäne, die unter anderem vorteilhafte mathematische Eigenschaften mit sich bringen, sowie die Möglichkeit, Prototypen eines Anregungssignals zu erzeugen. Wir untersuchen modellbasierte Ansätze, wie z.B. Gaußsche Mischmodelle, klassische signalverarbeitungsbasierte Lösungen und auch moderne tiefe neuronale Netzwerke in dieser Arbeit. Die so verbesserten Signale werden nicht direkt zur Sprachsignalverbesserung genutzt (z.B. Sprachsynthese), sondern als sogenannter A-priori-Signal-zu-Rauschleistungs-Schätzwert in einem traditionellen statistischen Sprachverbesserungssystem. Dieses besteht aus einem Störleistungs-Schätzer, einem A-priori-Signal-zu-Rauschleistungs-Schätzer und einer spektralen Gewichtungsregel, die üblicherweise mit Hilfe der Ergebnisse der beiden Schätzer berechnet wird. Schließlich wird eine Schätzung des sauberen Sprachsignals aus der Mikrofonaufnahme gewonnen. Der neue Ansatz bietet eine signifikant höhere Dämpfung des Störgeräuschs als der bisherige Stand der Technik. Dabei wird eine vergleichbare Qualität der Sprachkomponente und der Sprachverständlichkeit gewährleistet. Somit konnte die Gesamtqualität des verbesserten Sprachsignals gegenüber dem Stand der Technik erhöht werden

    Improving the Speech Intelligibility By Cochlear Implant Users

    Get PDF
    In this thesis, we focus on improving the intelligibility of speech for cochlear implants (CI) users. As an auditory prosthetic device, CI can restore hearing sensations for most patients with profound hearing loss in both ears in a quiet background. However, CI users still have serious problems in understanding speech in noisy and reverberant environments. Also, bandwidth limitation, missing temporal fine structures, and reduced spectral resolution due to a limited number of electrodes are other factors that raise the difficulty of hearing in noisy conditions for CI users, regardless of the type of noise. To mitigate these difficulties for CI listener, we investigate several contributing factors such as the effects of low harmonics on tone identification in natural and vocoded speech, the contribution of matched envelope dynamic range to the binaural benefits and contribution of low-frequency harmonics to tone identification in quiet and six-talker babble background. These results revealed several promising methods for improving speech intelligibility for CI patients. In addition, we investigate the benefits of voice conversion in improving speech intelligibility for CI users, which was motivated by an earlier study showing that familiarity with a talker’s voice can improve understanding of the conversation. Research has shown that when adults are familiar with someone’s voice, they can more accurately – and even more quickly – process and understand what the person is saying. This theory identified as the “familiar talker advantage” was our motivation to examine its effect on CI patients using voice conversion technique. In the present research, we propose a new method based on multi-channel voice conversion to improve the intelligibility of transformed speeches for CI patients

    A Two-Phase Damped-Exponential Model for Speech Synthesis

    Get PDF
    It is well known that there is room for improvement in the resultant quality of speech synthesizers in use today. This research focuses on the improvement of speech synthesis by analyzing various models for speech signals. An improvement in synthesis quality will benefit any system incorporating speech synthesis. Many synthesizers in use today use linear predictive coding (LPC) techniques and only use one set of vocal tract parameters per analysis frame or pitch period for pitch-synchronous synthesizers. This work is motivated by the two-phase analysis-synthesis model proposed by Krishnamurthy. In lieu of electroglottograph data for vocal tract model transition point determination, this work estimates this point directly from the speech signal. The work then evaluates the potential of the two-phase damped-exponential model for synthetic speech quality improvement. LPC and damped-exponential models are used for synthesis. Statistical analysis of data collected in a subjective listening test indicates a statistically significant improvement (at the 0.05 significance level) in quality using this two-phase damped-exponential model over single-phase LPC, single-phase damped-exponential and two-phase LPC for the speakers, sentences, and model orders used. This subjective test shows the potential for quality improvement of synthesized speech and supports the need for further research and testing

    Computer speech synthesis: a systematic method to extract synthesis parameters for formant synthesizers.

    Get PDF
    by Yu Wai Leung.Thesis (M.Phil.)--Chinese University of Hong Kong, 1993.Includes bibliographical references (leaves 94-96).Abstract --- p.1Introduction --- p.2Chapter 1. --- Human speech and its production modelChapter 1.1 --- The human vocal system --- p.4Chapter 1.2 --- Speech production mechanism --- p.5Chapter 1.3 --- Acoustic properties of human speech --- p.5Chapter 1.4 --- Modeling the speech production process --- p.6Chapter 1.5 --- Speech as the spoken form of a language --- p.7Chapter 2. --- Speech analysis techniquesChapter 2.1 --- Short time speech analysis and speech segmentation --- p.9Chapter 2.2 --- Pre-emphasis --- p.9Chapter 2.3 --- Linear predictive analysis --- p.10Chapter 2.4 --- Formant tracking --- p.13Chapter 2.5 --- Pitch determination --- p.20Chapter 3. --- Speech synthesis technologyChapter 3.1 --- Overview --- p.24Chapter 3.2 --- Articulatory synthesis --- p.24Chapter 3.3 --- Concatenation synthesis --- p.24Chapter 3.4 --- LPC synthesis --- p.27Chapter 3.5 --- Formant speech synthesis --- p.28Chapter 3.6 --- Synthesis by rule --- p.29Chapter 4. --- LSYNTH: A parallel formant synthesizerChapter 4.1 --- OverviewChapter 4.2 --- Synthesizer configuration: cascade and parallel --- p.32Chapter 4.3 --- Structure ofLSYNTH --- p.33Chapter 5. --- Automatic formant parameter extraction for parallel formant synthesizersChapter 5.1 --- Introduction --- p.47Chapter 5.2 --- The idea of a feedback analysis system --- p.48Chapter 5.3 --- Overview of the feedback analysis system --- p.49Chapter 5.4 --- Iterative spectral matching algorithm --- p.52Chapter 5.5 --- Results and discussions --- p.65Chapter 6. --- Generate formant trajectories in synthesis-by-rule systemsChapter 6.1 --- Formant trajectories generation in synthesis-by-rule systems --- p.70Chapter 6.2 --- Modeling formant transitions --- p.71Chapter 6.3 --- Conventional formant transition calculation --- p.72Chapter 6.4 --- The 4-point Bezier curve model --- p.73Chapter 6.5 --- Modeling of formant transitions for Cantonese --- p.77Chapter 7. --- Some listening test resultsChapter 7.1 --- Introduction --- p.87Chapter 7.2 --- Tone recognition test --- p.87Chapter 7.3 --- Cantonese final recognition test --- p.89Chapter 7.4 --- Problems and discussions --- p.91Conclusion --- p.92References --- p.94Appendix A: The Cantonese phonetic system --- p.97"Appendix B: TPIT, A tone trajectory generator for Cantonese" --- p.10
    • …
    corecore