51 research outputs found

    Speech spectrum non-stationarity detection based on line spectrum frequencies and related applications

    Get PDF
    Ankara : Department of Electrical and Electronics Engineering and The Institute of Engineering and Sciences of Bilkent University, 1998.Thesis (Master's) -- Bilkent University, 1998.Includes bibliographical references leaves 124-132In this thesis, two new speech variation measures for speech spectrum nonstationarity detection are proposed. These measures are based on the Line Spectrum Frequencies (LSF) and the spectral values at the LSF locations. They are formulated to be subjectively meaningful, mathematically tractable, and also have low computational complexity property. In order to demonstrate the usefulness of the non-stationarity detector, two applications are presented: The first application is an implicit speech segmentation system which detects non-stationary regions in speech signal and obtains the boundaries of the speech segments. The other application is a Variable Bit-Rate Mixed Excitation Linear Predictive (VBR-MELP) vocoder utilizing a novel voice activity detector to detect silent regions in the speech. This voice activity detector is designed to be robust to non-stationary background noise and provides efficient coding of silent sections and unvoiced utterances to decrease the bit-rate. Simulation results are also presented.Ertan, Ali ErdemM.S

    A Parametric Approach for Efficient Speech Storage, Flexible Synthesis and Voice Conversion

    Get PDF
    During the past decades, many areas of speech processing have benefited from the vast increases in the available memory sizes and processing power. For example, speech recognizers can be trained with enormous speech databases and high-quality speech synthesizers can generate new speech sentences by concatenating speech units retrieved from a large inventory of speech data. However, even in today's world of ever-increasing memory sizes and computational resources, there are still lots of embedded application scenarios for speech processing techniques where the memory capacities and the processor speeds are very limited. Thus, there is still a clear demand for solutions that can operate with limited resources, e.g., on low-end mobile devices. This thesis introduces a new segmental parametric speech codec referred to as the VLBR codec. The novel proprietary sinusoidal speech codec designed for efficient speech storage is capable of achieving relatively good speech quality at compression ratios beyond the ones offered by the standardized speech coding solutions, i.e., at bitrates of approximately 1 kbps and below. The efficiency of the proposed coding approach is based on model simplifications, mode-based segmental processing, and the method of adaptive downsampling and quantization. The coding efficiency is also further improved using a novel flexible multi-mode matrix quantizer structure and enhanced dynamic codebook reordering. The compression is also facilitated using a new perceptual irrelevancy removal method. The VLBR codec is also applied to text-to-speech synthesis. In particular, the codec is utilized for the compression of unit selection databases and for the parametric concatenation of speech units. It is also shown that the efficiency of the database compression can be further enhanced using speaker-specific retraining of the codec. Moreover, the computational load is significantly decreased using a new compression-motivated scheme for very fast and memory-efficient calculation of concatenation costs, based on techniques and implementations used in the VLBR codec. Finally, the VLBR codec and the related speech synthesis techniques are complemented with voice conversion methods that allow modifying the perceived speaker identity which in turn enables, e.g., cost-efficient creation of new text-to-speech voices. The VLBR-based voice conversion system combines compression with the popular Gaussian mixture model based conversion approach. Furthermore, a novel method is proposed for converting the prosodic aspects of speech. The performance of the VLBR-based voice conversion system is also enhanced using a new approach for mode selection and through explicit control of the degree of voicing. The solutions proposed in the thesis together form a complete system that can be utilized in different ways and configurations. The VLBR codec itself can be utilized, e.g., for efficient compression of audio books, and the speech synthesis related methods can be used for reducing the footprint and the computational load of concatenative text-to-speech synthesizers to levels required in some embedded applications. The VLBR-based voice conversion techniques can be used to complement the codec both in storage applications and in connection with speech synthesis. It is also possible to only utilize the voice conversion functionality, e.g., in games or other entertainment applications

    Residual-excited linear predictive (RELP) vocoder system with TMS320C6711 DSK and vowel characterization

    Get PDF
    The area of speech recognition by machine is one of the most popular and complicated subjects in the current multimedia field. Linear predictive coding (LPC) is a useful technique for voice coding in speech analysis and synthesis. The first objective of this research was to establish a prototype of the residual-excited linear predictive (RELP) vocoder system in a real-time environment. Although its transmission rate is higher, the quality of synthesized speech of the RELP vocoder is superior to that of other vocoders. As well, it is rather simple and robust to implement. The RELP vocoder uses residual signals as excitation rather than periodic pulse or white noise. The RELP vocoder was implemented with Texas Instruments TMS320C6711 DSP starter kit (DSK) using C. Identifying vowel sounds is an important element in recognizing speech contents. The second objective of research was to explore a method of characterizing vowels by means of parameters extracted by the RELP vocoder, which was not known to have been used in speech recognition, previously. Five English vowels were chosen for the experimental sample. Utterances of individual vowel sounds and of the vowel sounds in one-syllable-words were recorded and saved as WAVE files. A large sample of 20-ms vowel segments was obtained from these utterances. The presented method utilized 20 samples of a segment's frequency response, taken equally in logarithmic scale, as a LPC frequency response vector. The average of each vowel's vectors was calculated. The Euclidian distances between the average vectors of the five vowels and an unknown vector were compared to classify the unknown vector into a certain vowel group. The results indicate that, when a vowel is uttered alone, the distance to its average vector is smaller than to the other vowels' average vectors. By examining a given vowel frequency response against all known vowels' average vectors, individually, one can determine to which vowel group the given vowel belongs. When a vowel is uttered with consonants, however, variances and covariances increase. In some cases, distinct differences may not be recognized among the distances to a vowel's own average vector and the distances to the other vowels' average vectors. Overall, the results of vowel characterization did indicate an ability of the RELP vocoder to identify and classify single vowel sounds

    Proceedings: Voice Technology for Interactive Real-Time Command/Control Systems Application

    Get PDF
    Speech understanding among researchers and managers, current developments in voice technology, and an exchange of information concerning government voice technology efforts are discussed

    Phase entrainment and perceptual cycles in audition and vision

    Get PDF
    Des travaux récents indiquent qu'il existe des différences fondamentales entre les systèmes visuel et auditif: tandis que le premier semble échantillonner le flux d'information en provenance de l'environnement, en passant d'un "instantané" à un autre (créant ainsi des cycles perceptifs), la plupart des expériences destinées à examiner ce phénomène de discrétisation dans le système auditif ont mené à des résultats mitigés. Dans cette thèse, au travers de deux expériences de psychophysique, nous montrons que le sous-échantillonnage de l'information à l'entrée des systèmes perceptifs est en effet plus destructif pour l'audition que pour la vision. Cependant, nous révélons que des cycles perceptifs dans le système auditif pourraient exister à un niveau élevé du traitement de l'information. En outre, nos résultats suggèrent que du fait des fluctuations rapides du flot des sons en provenance de l'environnement, le système auditif tend à avoir son activité alignée sur la structure rythmique de ce flux. En synchronisant la phase des oscillations neuronales, elles-mêmes correspondant à différents états d'excitabilité, le système auditif pourrait optimiser activement le moment d'arrivée de ses "instantanés" et ainsi favoriser le traitement des informations pertinentes par rapport aux événements de moindre importance. Non seulement nos résultats montrent que cet entrainement de la phase des oscillations neuronales a des conséquences importantes sur la façon dont sont perçus deux flux auditifs présentés simultanément ; mais de plus, ils démontrent que l'entraînement de phase par un flux langagier inclut des mécanismes de haut niveau. Dans ce but, nous avons créé des stimuli parole/bruit dans lesquels les fluctuations de l'amplitude et du contenu spectral de la parole ont été enlevés, tout en conservant l'information phonétique et l'intelligibilité. Leur utilisation nous a permis de démontrer, au travers de plusieurs expériences, que le système auditif se synchronise à ces stimuli. Plus précisément, la perception, estimée par la détection d'un clic intégré dans les stimuli parole/bruit, et les oscillations neuronales, mesurées par Electroencéphalographie chez l'humain et à l'aide d'enregistrements intracrâniens dans le cortex auditif chez le singe, suivent la rythmique "de haut niveau" liée à la parole. En résumé, les résultats présentés ici suggèrent que les oscillations neuronales sont un mécanisme important pour la discrétisation des informations en provenance de l'environnement en vue de leur traitement par le cerveau, non seulement dans la vision, mais aussi dans l'audition. Pourtant, il semble exister des différences fondamentales entre les deux systèmes: contrairement au système visuel, il est essentiel pour le système auditif de se synchroniser (par entraînement de phase) à son environnement, avec un échantillonnage du flux des informations vraisemblablement réalisé à un niveau hiérarchique élevé.Recent research indicates fundamental differences between the auditory and visual systems: Whereas the visual system seems to sample its environment, cycling between "snapshots" at discrete moments in time (creating perceptual cycles), most attempts at discovering discrete perception in the auditory system failed. Here, we show in two psychophysical experiments that subsampling the very input to the visual and auditory systems is indeed more disruptive for audition; however, the existence of perceptual cycles in the auditory system is possible if they operate on a relatively high level of auditory processing. Moreover, we suggest that the auditory system, due to the rapidly fluctuating nature of its input, might rely to a particularly strong degree on phase entrainment, the alignment between neural activity and the rhythmic structure of its input: By using the low and high excitability phases of neural oscillations, the auditory system might actively control the timing of its "snapshots" and thereby amplify relevant information whereas irrelevant events are suppressed. Not only do our results suggest that the oscillatory phase has important consequences on how simultaneous auditory inputs are perceived; additionally, we can show that phase entrainment to speech sound does entail an active high-level mechanism. We do so by using specifically constructed speech/noise sounds in which fluctuations in low-level features (amplitude and spectral content) of speech have been removed, but intelligibility and high-level features (including, but not restricted to phonetic information) have been conserved. We demonstrate, in several experiments, that the auditory system can entrain to these stimuli, as both perception (the detection of a click embedded in the speech/noise stimuli) and neural oscillations (measured with electroencephalography, EEG, and in intracranial recordings in primary auditory cortex of the monkey) follow the conserved "high-level" rhythm of speech. Taken together, the results presented here suggest that, not only in vision, but also in audition, neural oscillations are an important tool for the discretization and processing of the brain's input. However, there seem to be fundamental differences between the two systems: In contrast to the visual system, it is critical for the auditory system to adapt (via phase entrainment) to its environment, and input subsampling is done most likely on a hierarchically high level of stimulus processing

    Engineering data compendium. Human perception and performance, volume 3

    Get PDF
    The concept underlying the Engineering Data Compendium was the product of a research and development program (Integrated Perceptual Information for Designers project) aimed at facilitating the application of basic research findings in human performance to the design of military crew systems. The principal objective was to develop a workable strategy for: (1) identifying and distilling information of potential value to system design from existing research literature, and (2) presenting this technical information in a way that would aid its accessibility, interpretability, and applicability by system designers. The present four volumes of the Engineering Data Compendium represent the first implementation of this strategy. This is Volume 3, containing sections on Human Language Processing, Operator Motion Control, Effects of Environmental Stressors, Display Interfaces, and Control Interfaces (Real/Virtual)

    Quantization codebook optimization for color image under psychophysical constraints

    Get PDF
    The information contained into an image is spatially, spectrally and perceptually redundant . In the context of the color Vector Quantization (VQ) compression scheme, this redundancy is a handicap in terms of performances (complexity, quality and compression rate) . Combining perceptual and classification criteria, one can improve the codebook quality when reduce construction time associated . In this paper, we propose a training set reduction method . Associating to vectors of the training set a perceptually relevant measure, one can extract a subset from it . A classification step is then applied on this subset . Finally, the codebook is construct using the LBG algorithm on each obtained cluster, and joining together all code vectors . Psychophysics and statistical measures of image quality allow us to validate our method in terms of construction time, reconstructed image quality and compression rate .L'information contenue dans une image couleur est redondante spatialement, spectralement et perceptuellement. Dans le contexte de la compression d'image couleur par quantification vectorielle, cette redondance devient rapidement un handicap en termes de performances (complexité, qualité, taux de compression). En combinant des éléments de perception et de classification, il est possible d'améliorer la qualité du dictionnaire tout en réduisant le temps de construction associé. Dans cet article, nous proposons une méthode de réduction de la base d'apprentissage. En associant aux éléments de cette base une mesure perceptuellement significative, nous en extrayons un sous-ensemble représentatif. Une étape de classification reposant sur un modèle paramétrique de la mesure est ensuite appliquée sur ce sous-ensemble. Pour chacune des classes obtenues, un dictionnaire est déterminé en appliquant l'algorithme LBG. L'ensemble de ces dictionnaires constitue le dictionnaire final. L'utilisation de tests psychophysiques et de mesures statistiques de la qualité nous ont permis de valider notre approche en termes de temps de calcul, de qualité des images reconstruites et du taux de compression

    Tagungsband der 12. Tagung Phonetik und Phonologie im deutschsprachigen Raum

    Get PDF

    Sparsity in Linear Predictive Coding of Speech

    Get PDF
    nrpages: 197status: publishe
    • …
    corecore