419,358 research outputs found

    Deriving and Exploiting Situational Information in Speech: Investigations in a Simulated Search and Rescue Scenario

    Get PDF
    The need for automatic recognition and understanding of speech is emerging in tasks involving the processing of large volumes of natural conversations. In application domains such as Search and Rescue, exploiting automated systems for extracting mission-critical information from speech communications has the potential to make a real difference. Spoken language understanding has commonly been approached by identifying units of meaning (such as sentences, named entities, and dialogue acts) for providing a basis for further discourse analysis. However, this fine-grained identification of fundamental units of meaning is sensitive to high error rates in the automatic transcription of noisy speech. This thesis demonstrates that topic segmentation and identification techniques can be employed for information extraction from spoken conversations by being robust to such errors. Two novel topic-based approaches are presented for extracting situational information within the search and rescue context. The first approach shows that identifying the changes in the context and content of first responders' report over time can provide an estimation of their location. The second approach presents a speech-based topological map estimation technique that is inspired, in part, by automatic mapping algorithms commonly used in robotics. The proposed approaches are evaluated on a goal-oriented conversational speech corpus, which has been designed and collected based on an abstract communication model between a first responder and a task leader during a search process. Results have confirmed that a highly imperfect transcription of noisy speech has limited impact on the information extraction performance compared with that obtained on the transcription of clean speech data. This thesis also shows that speech recognition accuracy can benefit from rescoring its initial transcription hypotheses based on the derived high-level location information. A new two-pass speech decoding architecture is presented. In this architecture, the location estimation from a first decoding pass is used to dynamically adapt a general language model which is used for rescoring the initial recognition hypotheses. This decoding strategy has resulted in a statistically significant gain in the recognition accuracy of the spoken conversations in high background noise. It is concluded that the techniques developed in this thesis can be extended to more application domains that deal with large volumes of natural spoken conversations

    Large vocabulary continuous speech recognition using linguistic features and constraints

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (leaves 111-123).Automatic speech recognition (ASR) is a process of applying constraints, as encoded in the computer system (the recognizer), to the speech signal until ambiguity is satisfactorily resolved to the extent that only one sequence of words is hypothesized. Such constraints fall naturally into two categories. One deals with the ordering of words (syntax) and organization of their meanings (semantics, pragmatics, etc). The other governs how speech signals are related to words, a process often termed as lexical access". This thesis studies the Huttenlocher-Zue lexical access model, its implementation in a modern probabilistic speech recognition framework and its application to continuous speech from an open vocabulary. The Huttenlocher-Zue model advocates a two-pass lexical access paradigm. In the first pass, the lexicon is effectively pruned using broad linguistic constraints. In the original Huttenlocher-Zue model, the authors had proposed six linguistic features motivated by the manner of pronunciation. The first pass classifies speech signals into a sequence of linguistic features, and only words that match this sequence - the cohort - are activated. The second pass performs a detailed acoustic phonetic analysis within the cohort to decide the identity of the word. This model differs from the lexical access model nowadays commonly employed in speech recognizers where detailed acoustic phonetic analysis is performed directly and lexical items are retrieved in one pass. The thesis first studies the implementation issues of the Huttenlocher-Zue model. A number of extensions to the original proposal are made to take advantage of the existing facilities of a probabilistic, graph-based recognition framework and, more importantly, to model the broad linguistic features in a data-driven approach. First, we analyze speech signals along the two diagonal dimensions of manner and place of articulation, rather than the manner dimension alone. Secondly, we adopt a set of feature-based landmarks optimized for data-driven modeling as the basic recognition units, and Gaussian mixture models are trained for these units. We explore information fusion techniques to integrate constraints from both the manner and place dimensions, as well as examining how to integrate constraints from the feature-based first pass with the second pass of detailed acoustic phonetic analysis. Our experiments on a large-vocabulary isolated word recognition task show that, while constraints from each individual feature dimension provide only limited help in this lexical access model, the utilization of both dimensions and information fusion techniques leads to significant performance gain over a one-pass phonetic system. The thesis then proposes to generalize the original Huttenlocher-Zue model, which limits itself to only isolated word tasks, to handle continuous speech. With continuous speech, the search space for both stages is infinite if all possible word sequences are allowed. We generalize the original cohort idea from the Huttenlocher-Zue proposal and use the bag of words of the N-best list of the first pass as cohorts for continuous speech. This approach transfers the constraints of broad linguistic features into a much reduced search space for the second stage. The thesis also studies how to recover from errors made by the first pass, which is not discussed in the original Huttenlocher- Zue proposal. In continuous speech recognition, a way of recovering from errors made in the first pass is vital to the performance of the over-all system. We find empirical evidence that such errors tend to occur around function words, possibly due to the lack of prominence, in meaning and henceforth in linguistic features, of such words. This thesis proposes an error-recovery mechanism based on empirical analysis on a development set for the two-pass lexical access model. Our experiments on a medium- sized, telephone-quality continuous speech recognition task achieve higher accuracy than a state-of-the-art one-pass baseline system. The thesis applies the generalized two-pass lexical access model to the challenge of recognizing continuous speech from an open vocabulary. Telephony information query systems often need to deal with a large list of words that are not observed in the training data, for example the city names in a weather information query system. The large portion of vocabulary unseen in the training data - the open vocabulary - poses a serious data-sparseness problem to both acoustic and language modeling. A two-pass lexical access model provides a solution by activating a small cohort within the open vocabulary in the first pass, thus significantly reducing the data- sparseness problem. Also, the broad linguistic constraints in the first pass generalize better to unseen data compared to finer, context-dependent acoustic phonetic models. This thesis also studies a data-driven analysis of acoustic similarities among open vocabulary items. The results are used for recovering possible errors in the first pass. This approach demonstrates an advantage over a two-pass approach based on specific semantic constraints. In summary, this thesis implements the original Huttenlocher-Zue two-pass lexical access model in a modern probabilistic speech recognition framework. This thesis also extends the original model to recognize continuous speech from an open vocabulary, with our two-stage model achieving a better performance than the baseline system. In the future, sub-lexical linguistic hierarchy constraints, such as syllables, can be introduced into this two-pass model to further improve the lexical access performance.by Min Tang.Ph.D

    Toward an interpretive framework of two-dimensional speech-signal processing

    Get PDF
    Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 177-179).Traditional representations of speech are derived from short-time segments of the signal and result in time-frequency distributions of energy such as the short-time Fourier transform and spectrogram. Speech-signal models of such representations have had utility in a variety of applications such as speech analysis, recognition, and synthesis. Nonetheless, they do not capture spectral, temporal, and joint spectrotemporal energy fluctuations (or "modulations") present in local time-frequency regions of the time-frequency distribution. Inspired by principles from image processing and evidence from auditory neurophysiological models, a variety of twodimensional (2-D) processing techniques have been explored in the literature as alternative representations of speech; however, speech-based models are lacking in this framework. This thesis develops speech-signal models for a particular 2-D processing approach in which 2-D Fourier transforms are computed on local time-frequency regions of the canonical narrowband or wideband spectrogram; we refer to the resulting transformed space as the Grating Compression Transform (GCT). We argue for a 2-D sinusoidal-series amplitude modulation model of speech content in the spectrogram domain that relates to speech production characteristics such as pitch/noise of the source, pitch dynamics, formant structure and dynamics, and offset/onset content. Narrowband- and wideband-based models are shown to exhibit important distinctions in interpretation and oftentimes "dual" behavior. In the transformed GCT space, the modeling results in a novel taxonomy of signal behavior based on the distribution of formant and onset/offset content in the transformed space via source characteristics. Our formulation provides a speech-specific interpretation of the concept of "modulation" in 2-D processing in contrast to existing approaches that have done so either phenomenologically through qualitative analyses and/or implicitly through data-driven machine learning approaches. One implication of the proposed taxonomy is its potential for interpreting transformations of other time-frequency distributions such as the auditory spectrogram which is generally viewed as being "narrowband"/"wideband" in its low/high-frequency regions. The proposed signal model is evaluated in several ways. First, we perform analysis of synthetic speech signals to characterize its properties and limitations. Next, we develop an algorithm for analysis/synthesis of spectrograms using the model and demonstrate its ability to accurately represent real speech content. As an example application, we further apply the models in cochannel speaker separation, exploiting the GCT's ability to distribute speaker-specific content and often recover overlapping information through demodulation and interpolation in the 2-D GCT space. Specifically, in multi-pitch estimation, we demonstrate the GCT's ability to accurately estimate separate and crossing pitch tracks under certain conditions. Finally, we demonstrate the model's ability to separate mixtures of speech signals using both prior and estimated pitch information. Generalization to other speech-signal processing applications is proposed.by Tianyu Tom Wang.Ph.D

    Reconnaissance et transformation de locuteurs

    Get PDF
    This PhD thesis tries to understand how to analyse, decompose, model and transform the vocal identity of a human when seen through an automatic speaker recognition application. It starts with an introduction explaining the properties of the speech signal and the basis of the automatic speaker recognition. Then, the errors of an operating speaker recognition application are analysed. From the deficiencies and mistakes noticed in the running application, some observations cm be made which will imply a re-evaluation of the characteristic parameters of a speaker, and to reconsider some parts of the automatic speaker recognition chain. In order to determine what are the characterising parameters of a speaker, these are extracted from the speech signal with an analysis and synthesis harmonic plus noise model (H+N). The analysis and re-synthesis of the harmonic and noise parts indicate those which are speech or speaker dependent. It is then shown that the speaker discriminating information can be found in the residual of the subtraction from the original signal of the H+N modeled signal. Then, a study of the impostors phenomenon, essential in the tuning of a speaker recognition system, is carried out. The impostors are simulated in two ways: first by a transformation of the speech of a source speaker (the impostor) to the speech of a target speaker (the client) using the parameters extracted from the H+N model. This way of transforming the parameters is efficient as the false acceptance rate grows from 4% to 23%. Second, an automatic imposture by speech sepent concatenation is carried out. In this case the false acceptance rate grows to 30%. A way to become less sensitive to the spectral modification impostures is to remove the harmonic part or even the noise part modeled by the H+N from the original signal. Using such a subtraction decreases the false acceptance rate to 8% even if transformed impostors are used. To overcome the lack of training data — one of the main cause of modeling errors in speaker recognition — a decomposition of the recognition task into a set of binary classifiers is proposed. A classifier matrix is built and each of its elements has to classify word by word the data coming from the client and another speaker (named here an anti-speaker, randomly chosen from an extemal database). With such an approach it is possible to weight the results according to the vocabulary or the neighbours of the client in the parameter (acoustic) space. The output of the mamx classifiers are then weighted and mixed in order to produce a single output score. The weights are estimated on validation data, and if the weighting is done properly, the binary pair speaker recognition system gives better results than a state of the an HMM based system. In order to set a point of operation (i.e. a point on the COR cuwe) for the speaker recognition application, an a priori threshold has to be determined. Theoretically the threshold should be speaker independent when stochastic models are used. However, practical experiments show that this is not the case, as due to modeling mismatch the threshold becomes speaker and utterance length dependant. A theoretical framework showing how to adjust the threshold using the local likelihood ratio is then developed. Finally, a last modeling error correction method using decision fusion is proposed. Some practical experiments show the advantages and drawbacks of the fusion approach in speaker recognition applications

    The application of the self-generation effect to the learning of Blissymbols by persons presenting with severe aphasia

    Get PDF
    A severe aphasia following a cerebral vascular accident is characterised by generalised deficits in most speech-language domains. The clinical dilemma remains focused on the extensive verbal speech impairment and in most cases little possibility of regaining verbal speech production. Many individuals living with severe aphasia use augmentative and alternative communication strategies to assist them in getting their communication needs met in their everyday lives. The Blissymbol system is one of the graphic symbol systems that can be used to supplement existing communication and speech strategies of the individual with little or no speech. Although the use of AAC strategies is gaining momentum in its application to severe aphasia, however, there still remain questions on how best to help these individuals learn and retain such strategies. Not only are individuals with severe aphasia faced with a memory task when learning AAC strategies such as Blissymbols, additional complexity to AAC interventions is derived from clinical presentation of severe aphasia. The presence of extensive damage to the neural centers responsible for linguistic processing and semantic retrieval makes learning of new AAC strategies all the more complicated. Research studies have looked at whether individuals with severe aphasia can learn to recognise and retain Blissymbols. Although these studies have successfully shown that individuals with severe aphasia can learn Blissymbols, there is little information available regarding how these symbols can best be taught and retained over time individuals with severe aphasia. Recently the research that has looked at the application of symbol learning with persons presenting with severe aphasia using computer technology and sophisticated application software has highlighted the importance of therapeutic methods that may enhance the learning of such software. This study looks at the application of the self-generation effect as a viable method for enhancing the recognition of Blissymbols in persons presenting with severe aphasia. The self-generation effect is the finding of superior retention and recall for stimuli constructed or generated by an individual. Memory for stimuli such as words, numbers and pictures were found to be enhanced by the extent to which the individual was involved in its construction. Using a 2X2X3 factorial design, this study compared the recognition levels for Blissymbols taught using two treatment approaches which was the self-generation condition and the non self-generation condition. During three experimental sessions which included two withdrawal periods participants were taught using both treatments to recognise a set of Blissymbols. Recognition levels were tested during recognition probes and retention probes. The results from these probes were compared in order to identify which treatment produced superior recognition levels. The data analysis conducted showed that although there was no recognition advantage for the self-generation effect seen during the three recognition probes some advantage for the self-generation effect was seen during the retention probes conducted. The self-generation effect began to emerge by the final retention probe following a withdrawal period of seven days. The self-generation treatment showed better retention of symbol recognition over time. Previous studies have shown that the self-generation effect failed to emerge with stimuli that were new or unfamiliar. This trend was also seen in this study. The results provide support for a semantic-association theory for the self-generation effect.Thesis (PhD)--University of Pretoria, 2010.Centre for Augmentative and Alternative Communication (CAAC)unrestricte

    Voice-QA: evaluating the impact of misrecognized words on passage retrieval

    Full text link
    Question Answering is an Information Retrieval task where the query is posed using natural language and the expected result is a concise answer. Voice-activated Question Answering systems represent an interesting application, where the question is formulated by speech. In these systems, an Automatic Speech Recognition module can be used to transcribe the question. Thus, recognition errors may be introduced, producing a significant effect on the answer retrieval process. In this work we study the relationship between some features of misrecognized words and the retrieval results. The features considered are the redundancy of a word in the result set and its inverse document frequency calculated over the collection. The results show that the redundancy of a word may be an important clue on whether an error on it would deteriorate the retrieval results, at least if a closed model is used for speech recognition.This work was carried out in the framework of TextEnterprise (TIN2009-13391-C04-03), Timpano (TIN2011-28169-C05-01), WIQEI IRSES (grant no. 269180) within the FP 7 Marie Curie People, FPU Grant AP2010-4193 from the Spanish Ministerio de Educaci´on (first author), and the Microcluster VLC/Campus on Multimodal Intelligent Systems (third author)Calvo Lance, M.; Buscaldi, D.; Rosso, P. (2012). Voice-QA: evaluating the impact of misrecognized words on passage retrieval. En Advances in Artificial Intelligence - IBERAMIA 2012. Springer Verlag (Germany). 462-471. https://doi.org/10.1007/978-3-642-34654-5_47S462471Buscaldi, D., Gómez, J.M., Rosso, P., Sanchis, E.: N-Gram vs. Keyword-Based Passage Retrieval for Question Answering. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 377–384. Springer, Heidelberg (2007)Harabagiu, S., Moldovan, D., Picone, J.: Open-Domain Voice-Activated Question Answering. In: 19th International Conference on Computational Linguistics (COLING 2002), pp. 1–7 (2002)Jones, K.: Index Term Weighting. Information Storage and Retrieval 9(11), 619–633 (1973)Moldovan, D., Paşca, M., Harabagiu, S., Surdeanu, M.: Performance Issues and Error Analysis in an Open-Domain Question Answering System. ACM Transactions on Information Systems (TOIS) 21(2), 133–154 (2003)Rosso, P., Hurtado, L.F., Segarra, E., Sanchis, E.: On the Voice-Activated Question Answering. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 42(1), 75–85 (2012)Sanderson, M., Paramita, M.L., Clough, P., Kanoulas, E.: Do User Preferences and Evaluation Measures Line Up? In: 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2010), pp. 555–562. ACM, New York (2010)Turmo, J., Comas, P., Rosset, S., Galibert, O., Moreau, N., Mostefa, D., Rosso, P., Buscaldi, D.: Overview of QAST 2009. In: Peters, C., Di Nunzio, G.M., Kurimo, M., Mandl, T., Mostefa, D., Peñas, A., Roda, G. (eds.) CLEF 2009. LNCS, vol. 6241, pp. 197–211. Springer, Heidelberg (2010

    An Enhanced Boyer-Moore Algorithm for WorstCase Running Time

    Get PDF
    This article adderesses the exact string matching problem which consists in finding all occurrences of a given pattern in a text.It is an extensively studied problem in the field of computer science mainly due to despite its popularity in diverse area of application such as cluster computing, image and signal processing, speech analysis and recognition, information retrieval, data compression,computational biology,intrusion detection and virus scanning detection.In the last decade several new algorithm has been proposed.In this paper we compares all improved of the Boyer-Moore algorithm with my enhanced Boyer-Moore algorithm practically and theoretically result.It is not only generate the largest distance but also produces the minimum shifting and frequency of comparisons steps.By this enhanced algorithm we can reduce the number of comparisons frequency and number of shifting steps during the searching process.Moreover result of this enhanced Boyer-Moore algorithm reveals the efficiency is higher than of previous improved Boyer-Moore algorithms and time complexity is reduced in the concept of worst case analysis and lower than BM algorithm.Our enhanced algorithm 16% boost-up than previous improved Boyer-Moore algorithm when executed on the CPU.This enhanced Boyer-Moore algorithm can be plays an important role in finding extremely fast genetic moleculer and complex sequence pattern of interested database alignment of DNA

    Eigenface Based Recognition of Emotion Variant Faces

    Get PDF
    In present, the automatic machine based face recognition has become significant due to its urgency in potential application and current scientific challenges of industries. However, most of the existing systems designed up to now can only effectively distinguish the faces when source images are collected under numerous constrained conditions. The success rate or positive impact of face recognition systems depend on a variety of information imposed in images of human faces. Pose of face, facial expression, angle, occlusion and state of structural components are some of those. Emotions can be expressed in different ways that can be seen such as facial expression, speech, written text and gestures. This model propose an efficient approach for the recognition of expression or emotion variant faces since there are very few emotion recognition software tools to handle such problems and there is a significant importance to this research area in the field of face recognition. Especially an approach proposed here to face recognition where the facial expression in the training image set and in the testing image set diverge and only one sample image per class is existing in the system. The input to the system is a frontal neutral expression oriented face image with unique background. In this image the hair is tied away from the face and facial hair should be removed. Principal Component Analysis approach was used as a primary mechanism in the proposed model. This approach has been applied purely on a set of face images in order to extract a set of eigenface images as the output. Here weights of the representation or image are used for recognition of emotions. One of the distance metric approaches Euclidean Distance used to discover the distance with the weight vectors which was associated with each of the training images for the existence of classification task. Keywords: Face Recognition, Emotion-variant faces, Image Processing, Principal Component Analysis, Euclidean Distanc
    • …
    corecore