40 research outputs found

    Speech Recognition Using the Mellin Transform

    Get PDF
    The purpose of this research was to improve performance in speech recognition. Specifically, a new approach was investigating by applying an integral transform known as the Mellin transform (MT) on the output of an auditory model to improve the recognition rate of phonemes through the scale-invariance property of the Mellin transform. Scale-invariance means that as a time-domain signal is subjected to dilations, the distribution of the signal in the MT domain remains unaffected. An auditory model was used to transform speech waveforms into images representing how the brain sees a sound. The MT was applied and features were extracted. The features were used in a speech recognizer based on Hidden Markov Models. The results from speech recognition experiments showed an increase in recognition rates for some phonemes compared to traditional methods

    Robust Automatic Transcription of Lectures

    Get PDF
    Die automatische Transkription von VortrĂ€gen, Vorlesungen und PrĂ€sentationen wird immer wichtiger und ermöglicht erst die Anwendungen der automatischen Übersetzung von Sprache, der automatischen Zusammenfassung von Sprache, der gezielten Informationssuche in Audiodaten und somit die leichtere ZugĂ€nglichkeit in digitalen Bibliotheken. Im Idealfall arbeitet ein solches System mit einem Mikrofon das den Vortragenden vom Tragen eines Mikrofons befreit was der Fokus dieser Arbeit ist

    Efficient speaker recognition for mobile devices

    Get PDF

    Restoration of Atmospheric Turbulence Degraded Video using Kurtosis Minimization and Motion Compensation

    Get PDF
    In this thesis work, the background of atmospheric turbulence degradation in imaging was reviewed and two aspects are highlighted: blurring and geometric distortion. The turbulence burring parameter is determined by the atmospheric turbulence condition that is often unknown; therefore, a blur identification technique was developed that is based on a higher order statistics (HOS). It was observed that the kurtosis generally increases as an image becomes blurred (smoothed). Such an observation was interpreted in the frequency domain in terms of phase correlation. Kurtosis minimization based blur identification is built upon this observation. It was shown that kurtosis minimization is effective in identifying the blurring parameter directly from the degraded image. Kurtosis minimization is a general method for blur identification. It has been tested on a variety of blurs such as Gaussian blur, out of focus blur as well as motion blur. To compensate for the geometric distortion, earlier work on the turbulent motion compensation was extended to deal with situations in which there is camera/object motion. Trajectory smoothing is used to suppress the turbulent motion while preserving the real motion. Though the scintillation effect of atmospheric turbulence is not considered separately, it can be handled the same way as multiple frame denoising while motion trajectories are built.Ph.D.Committee Chair: Mersereau, Russell; Committee Co-Chair: Smith, Mark; Committee Member: Lanterman, Aaron; Committee Member: Wang, May; Committee Member: Tannenbaum, Allen; Committee Member: Williams, Dougla

    Analysis/Synthesis Comparison of Vocoders Utilized in Statistical Parametric Speech Synthesis

    Get PDF
    TÀssÀ työssÀ esitetÀÀn kirjallisuuskatsaus ja kokeellinen osio tilastollisessa parametrisessa puhesynteesissÀ kÀytetyistÀ vokoodereista. Kokeellisessa osassa kolmen valitun vokooderin (GlottHMM, STRAIGHT ja Harmonic/Stochastic Model) analyysi-synteesi -ominaisuuksia tarkastellaan usealla tavalla. Suoritetut kokeet olivat vokooderiparametrien tilastollisten jakaumien analysointi, puheen tunnetilan tilastollinen vaikutus vokooderiparametrien jakaumiin sekÀ subjektiivinen kuuntelukoe jolla mitattiin vokooderien suhteellista analyysi-synteesi -laatua. Tulokset osoittavat ettÀ STRAIGHT-vokooderi omaa eniten Gaussiset parametrijakaumat ja tasaisimman synteesilaadun. GlottHMM-vokooderin parametrit osoittivat suurinta herkkyyttÀ puheen tunnetilan funktiona ja vokooderi sai parhaan, mutta laadultaan vaihtelevan kuuntelukoetuloksen. HSM-vokooderin LSF-parametrien havaittiin olevan Gaussisempia kuin GlottHMM-vokooderin LSF parametrit, mutta vokooderin havaittiin kÀrsivÀn kohinaherkkyydestÀ, ja se sai huonoimman kuuntelukoetuloksen.This thesis presents a literature study followed by an experimental part on the state-of-the-art vocoders utilized in statistical parametric speech synthesis. In the experimental part, the analysis/synthesis properties of three selected vocoders (GlottHMM, STRAIGHT and Harmonic/Stochastic Model) are examined. The performed tests were the analysis of vocoder parameter distributions, statistical testing on the effect of emotions to the vocoder parameter distributions, and a subjective listening test evaluating the vocoders' relative analysis/synthesis quality. The results indicate that the STRAIGHT vocoder has the most Gaussian parameter distributions and most robust synthesis quality, whereas the GlottHMM vocoder has the most emotion sensitive parameters and best but unreliable synthesis quality. The HSM vocoder's LSF parameters were found to be more Gaussian than the GlottHMM vocoder's LSF parameters. HSM was found to be sensitive to noise, and it scored the lowest score on the subjective listening test

    Recent Advances in Signal Processing

    Get PDF
    The signal processing task is a very critical issue in the majority of new technological inventions and challenges in a variety of applications in both science and engineering fields. Classical signal processing techniques have largely worked with mathematical models that are linear, local, stationary, and Gaussian. They have always favored closed-form tractability over real-world accuracy. These constraints were imposed by the lack of powerful computing tools. During the last few decades, signal processing theories, developments, and applications have matured rapidly and now include tools from many areas of mathematics, computer science, physics, and engineering. This book is targeted primarily toward both students and researchers who want to be exposed to a wide variety of signal processing techniques and algorithms. It includes 27 chapters that can be categorized into five different areas depending on the application at hand. These five categories are ordered to address image processing, speech processing, communication systems, time-series analysis, and educational packages respectively. The book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Spoken Term Detection on Low Resource Languages

    Get PDF
    Developing efficient speech processing systems for low-resource languages is an immensely challenging problem. One potentially effective approach to address the lack of resources for any particular language, is to employ data from multiple languages for building speech processing sub-systems. This thesis investigates possible methodologies for Spoken Term Detection (STD) from low- resource Indian languages. The task of STD intend to search for a query keyword, given in text form, from a considerably large speech database. This is usually done by matching templates of feature vectors, representing sequence of phonemes from the query word and the continuous speech from the database. Typical set of features used to represent speech signals in most of the speech processing systems are the mel frequency cepstral coefficients (MFCC). As speech is a very complexsignal, holding information about the textual message, speaker identity, emotional and health state of the speaker, etc., the MFCC features derived from it will also contain information about all these factors. For eficient template matching, we need to neutralize the speaker variability in features and stabilize them to represent the speech variability alone
    corecore