279 research outputs found

    Capacity and Complexity of HMM Duration Modeling Techniques

    Get PDF
    The ability of a standard hidden Markov model (HMM) or expanded state HMM (ESHMM) to accurately model duration distributions of phonemes is compared with specific duration-focused approaches such as semi-Markov models or variable transition probabilities. It is demonstrated that either a three-state ESHMM or a standard HMM with an increased number of states is capable of closely matching both Gamma distributions and duration distributions of phonemes from the TIMIT corpus, as measured by Bhattacharyya distance to the true distributions. Standard HMMs are easily implemented with off-the-shelf tools, whereas duration models require substantial algorithmic development and have higher computational costs when implemented, suggesting that a simple adjustment to HMM topologies is perhaps a more efficient solution to the problem of duration than more complex approaches

    Multiple acoustic cues for Korean stops and automatic speech recognition

    Get PDF
    The purpose of this thesis is to analyse acoustic characteristics of Korean stops by way of multivariate statistical tests, and to apply the results of the analysis in Automatic Speech Recognition (ASR) of Korean. Three acoustic cues that differentiate three types of KoĀ¬ rean oral stops are closure duration, Voice Onset Time (VOT) and fundamental frequency (FO) of a vowel after a stop. We review the characteristics of these parameters previously reported in various phonetic studies and test their usefulness for differentiating the three types of stops on two databases, one with controlled contexts, as in other phonetic studĀ¬ ies, and the other a continuous speech database designed for ASR. Statistical tests on both databases confirm that the three types of stops can be differentiated by the three acoustic parameters. In order to exploit these parameters for ASR, a context dependent Hidden Markov Model (HMM) based baseline system with a short pause model is built, which results in great improvement of performance compared to other systems. For modĀ¬ elling of the three acoustic parameters, an automatic segmentation technique for closure and VOT is developed. Samples of each acoustic parameter are modelled with univariate and multivariate probability distribution functions. Stop probability from these models is integrated by a post-processing technique. Our results show that integration of stop probĀ¬ ability does not make much improvement over the results of a baseline system. However, the results suggest that stop probabilities will be useful in determining the correct hyĀ¬ pothesis with a larger lexicon containing more minimal pairs of words that differ by the identity of just one stop

    The listening talker: A review of human and algorithmic context-induced modifications of speech

    Get PDF
    International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output

    Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation

    Get PDF
    We present a probabilistic model that uses both prosodic and lexical cues for the automatic segmentation of speech into topically coherent units. We propose two methods for combining lexical and prosodic information using hidden Markov models and decision trees. Lexical information is obtained from a speech recognizer, and prosodic features are extracted automatically from speech waveforms. We evaluate our approach on the Broadcast News corpus, using the DARPA-TDT evaluation metrics. Results show that the prosodic model alone is competitive with word-based segmentation methods. Furthermore, we achieve a significant reduction in error by combining the prosodic and word-based knowledge sources.Comment: 27 pages, 8 figure

    Modelling the effects of speech rate variation for automatic speech recognition

    Get PDF
    Wrede B. Modelling the effects of speech rate variation for automatic speech recognition. Bielefeld (Germany): Bielefeld University; 2002.In automatic speech recognition it is a widely observed phenomenon that variations in speech rate cause severe degradations of the speech recognition performance. This is due to the fact that standard stochastic based speech recognition systems specialise on average speech rate. Although many approaches to modelling speech rate variation have been made, an integrated approach in a substantial system still has be to developed. General approaches to rate modelling are based on rate dependent models which are trained with rate specific subsets of the training data. During decoding a signal based rate estimation is performed according to which the set of rate dependent models is selected. While such approaches are able to reduce the word error rate significantly, they suffer from shortcomings such as the reduction of training data and the expensive training and decoding procedure. However, phonetic investigations show that there is a systematic relationship between speech rate and the acoustic characteristics of speech. In fast speech a tendency of reduction can be observed which can be described in more detail as a centralisation effect and an increase in coarticulation. Centralisation means that the formant frequencies of vowels tend to shift towards the vowel space center while increased coarticulation denotes the tendency of the spectral features of a vowel to shift towards those of its phonemic neighbour. The goal of this work is to investigate the possibility to incorporate the knowledge of the systematic nature of the influence of speech rate variation on the acoustic features in speech rate modelling. In an acoustic-phonetic analysis of a large corpus of spontaneous speech it was shown that an increased degree of the two effects of centralisation and coarticulation can be found in fast speech. Several measures for these effects were developed and used in speech recognition experiments with rate dependent models. A thorough investigation of rate dependent models showed that with duration and coarticulation based measures significant increases of the performance could be achieved. It was shown that by the use of different measures the models were adapted either to centralisation or coarticulation. Further experiments showed that by a more detailed modelling with more rate classes a further improvement can be achieved. It was also observed that a general basis for the models is needed before rate adaptation can be performed. In a comparison to other sources of acoustic variation it was shown that the effects of speech rate are as severe as those of speaker variation and environmental noise. All these results show that for a more substantial system that models rate variations accurately it is necessary to focus on both, durational and spectral effects. The systematic nature of the effects indicates that a continuous modelling is possible

    Hierarchical duration modeling for a speech recognition system

    Get PDF
    Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1997.Includes bibliographical references (p. 102-105).by Grace Chung.M.S

    Exploiting prior knowledge during automatic key and chord estimation from musical audio

    Get PDF
    Chords and keys are two ways of describing music. They are exemplary of a general class of symbolic notations that musicians use to exchange information about a music piece. This information can range from simple tempo indications such as ā€œallegroā€ to precise instructions for a performer of the music. Concretely, both keys and chords are timed labels that describe the harmony during certain time intervals, where harmony refers to the way music notes sound together. Chords describe the local harmony, whereas keys offer a more global overview and consequently cover a sequence of multiple chords. Common to all music notations is that certain characteristics of the music are described while others are ignored. The adopted level of detail depends on the purpose of the intended information exchange. A simple description such as ā€œmenuetā€, for example, only serves to roughly describe the character of a music piece. Sheet music on the other hand contains precise information about the pitch, discretised information pertaining to timing and limited information about the timbre. Its goal is to permit a performer to recreate the music piece. Even so, the information about timing and timbre still leaves some space for interpretation by the performer. The opposite of a symbolic notation is a music recording. It stores the music in a way that allows for a perfect reproduction. The disadvantage of a music recording is that it does not allow to manipulate a single aspect of a music piece in isolation, or at least not without degrading the quality of the reproduction. For instance, it is not possible to change the instrumentation in a music recording, even though this would only require the simple change of a few symbols in a symbolic notation. Despite the fundamental differences between a music recording and a symbolic notation, the two are of course intertwined. Trained musicians can listen to a music recording (or live music) and write down a symbolic notation of the played piece. This skill allows one, in theory, to create a symbolic notation for each recording in a music collection. In practice however, this would be too labour intensive for the large collections that are available these days through online stores or streaming services. Automating the notation process is therefore a necessity, and this is exactly the subject of this thesis. More specifically, this thesis deals with the extraction of keys and chords from a music recording. A database with keys and chords opens up applications that are not possible with a database of music recordings alone. On one hand, chords can be used on their own as a compact representation of a music piece, for example to learn how to play an accompaniment for singing. On the other hand, keys and chords can also be used indirectly to accomplish another goal, such as finding similar pieces. Because music theory has been studied for centuries, a great body of knowledge about keys and chords is available. It is known that consecutive keys and chords form sequences that are all but random. People happen to have certain expectations that must be fulfilled in order to experience music as pleasant. Keys and chords are also strongly intertwined, as a given key implies that certain chords will likely occur and a set of given chords implies an encompassing key in return. Consequently, a substantial part of this thesis is concerned with the question whether musicological knowledge can be embedded in a technical framework in such a way that it helps to improve the automatic recognition of keys and chords. The technical framework adopted in this thesis is built around a hidden Markov model (HMM). This facilitates an easy separation of the different aspects involved in the automatic recognition of keys and chords. Most experiments reviewed in the thesis focus on taking into account musicological knowledge about the musical context and about the expected chord duration. Technically speaking, this involves a manipulation of the transition probabilities in the HMMs. To account for the interaction between keys and chords, every HMM state is actually representing the combination of a key and a chord label. In the first part of the thesis, a number of alternatives for modelling the context are proposed. In particular, separate key change and chord change models are defined such that they closely mirror the way musicians conceive harmony. Multiple variants are considered that differ in the size of the context that is accounted for and in the knowledge source from which they were compiled. Some models are derived from a music corpus with key and chord notations whereas others follow directly from music theory. In the second part of the thesis, the contextual models are embedded in a system for automatic key and chord estimation. The features used in that system are so-called chroma profiles, which represent the saliences of the pitch classes in the audio signal. These chroma profiles are acoustically modelled by means of templates (idealised profiles) and a distance measure. In addition to these acoustic models and the contextual models developed in the first part, durational models are also required. The latter ensure that the chord and key estimations attain specified mean durations. The resulting system is then used to conduct experiments that provide more insight into how each system component contributes to the ultimate key and chord output quality. During the experimental study, the system complexity gets gradually increased, starting from a system containing only an acoustic model of the features that gets subsequently extended, first with duration models and afterwards with contextual models. The experiments show that taking into account the mean key and mean chord duration is essential to arrive at acceptable results for both key and chord estimation. The effect of using contextual information, however, is highly variable. On one hand, the chord change model has only a limited positive impact on the chord estimation accuracy (two to three percentage points), but this impact is fairly stable across different model variants. On the other hand, the chord change model has a much larger potential to improve the key output quality (up to seventeen percentage points), but only on the condition that the variant of the model is well adapted to the tested music material. Lastly, the key change model has only a negligible influence on the system performance. In the final part of this thesis, a couple of extensions to the formerly presented system are proposed and assessed. First, the global mean chord duration is replaced by key-chord specific values, which has a positive effect on the key estimation performance. Next, the HMM system is modified such that the prior chord duration distribution is no longer a geometric distribution but one that better approximates the observed durations in an appropriate data set. This modification leads to a small improvement of the chord estimation performance, but of course, it requires the availability of a suitable data set with chord notations from which to retrieve a target durational distribution. A final experiment demonstrates that increasing the scope of the contextual model only leads to statistically insignificant improvements. On top of that, the required computational load increases greatly

    Beyong lexical meaning : probabilistic models for sign language recognition

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    • ā€¦
    corecore