616 research outputs found

    Prosody-Based Automatic Segmentation of Speech into Sentences and Topics

    Get PDF
    A crucial step in processing speech audio data for information extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (information gleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden Markov modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora, Broadcast News and Switchboard. Results show that the prosodic model alone performs on par with, or better than, word-based statistical language models -- for both true and automatically recognized words in news speech. The prosodic model achieves comparable performance with significantly less training data, and requires no hand-labeling of prosodic events. Across tasks and corpora, we obtain a significant improvement over word-only models using a probabilistic combination of prosodic and lexical information. Inspection reveals that the prosodic models capture language-independent boundary indicators described in the literature. Finally, cue usage is task and corpus dependent. For example, pause and pitch features are highly informative for segmenting news speech, whereas pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2), Special Issue on Accessing Information in Spoken Audio, September 200

    Temporal Action Segmentation: An Analysis of Modern Techniques

    Full text link
    Temporal action segmentation (TAS) in videos aims at densely identifying video frames in minutes-long videos with multiple action classes. As a long-range video understanding task, researchers have developed an extended collection of methods and examined their performance using various benchmarks. Despite the rapid growth of TAS techniques in recent years, no systematic survey has been conducted in these sectors. This survey analyzes and summarizes the most significant contributions and trends. In particular, we first examine the task definition, common benchmarks, types of supervision, and prevalent evaluation measures. In addition, we systematically investigate two essential techniques of this topic, i.e., frame representation and temporal modeling, which have been studied extensively in the literature. We then conduct a thorough review of existing TAS works categorized by their levels of supervision and conclude our survey by identifying and emphasizing several research gaps. In addition, we have curated a list of TAS resources, which is available at https://github.com/nus-cvml/awesome-temporal-action-segmentation.Comment: 19 pages, 9 figures, 8 table

    SEGMENTATION, RECOGNITION, AND ALIGNMENT OF COLLABORATIVE GROUP MOTION

    Get PDF
    Modeling and recognition of human motion in videos has broad applications in behavioral biometrics, content-based visual data analysis, security and surveillance, as well as designing interactive environments. Significant progress has been made in the past two decades by way of new models, methods, and implementations. In this dissertation, we focus our attention on a relatively less investigated sub-area called collaborative group motion analysis. Collaborative group motions are those that typically involve multiple objects, wherein the motion patterns of individual objects may vary significantly in both space and time, but the collective motion pattern of the ensemble allows characterization in terms of geometry and statistics. Therefore, the motions or activities of an individual object constitute local information. A framework to synthesize all local information into a holistic view, and to explicitly characterize interactions among objects, involves large scale global reasoning, and is of significant complexity. In this dissertation, we first review relevant previous contributions on human motion/activity modeling and recognition, and then propose several approaches to answer a sequence of traditional vision questions including 1) which of the motion elements among all are the ones relevant to a group motion pattern of interest (Segmentation); 2) what is the underlying motion pattern (Recognition); and 3) how two motion ensembles are similar and how we can 'optimally' transform one to match the other (Alignment). Our primary practical scenario is American football play, where the corresponding problems are 1) who are offensive players; 2) what are the offensive strategy they are using; and 3) whether two plays are using the same strategy and how we can remove the spatio-temporal misalignment between them due to internal or external factors. The proposed approaches discard traditional modeling paradigm but explore either concise descriptors, hierarchies, stochastic mechanism, or compact generative model to achieve both effectiveness and efficiency. In particular, the intrinsic geometry of the spaces of the involved features/descriptors/quantities is exploited and statistical tools are established on these nonlinear manifolds. These initial attempts have identified new challenging problems in complex motion analysis, as well as in more general tasks in video dynamics. The insights gained from nonlinear geometric modeling and analysis in this dissertation may hopefully be useful toward a broader class of computer vision applications

    Human action segmentation via controlled use of missing data in HMMs

    Full text link
    Segmentation of individual actions from a stream of human motion is an open problem in computer vision. This paper approaches the problem of segmenting higher-level activities into their component sub-actions using Hidden Markov Models modified to handle missing data in the observation vector. By controlling the use of missing data, action labels can be inferred from the observation vector during inferencing, thus performing segmentation and classification simultaneously. The approach is able to segment both prominent and subtle actions, even when subtle actions are grouped together. The advantage of this method over sliding windows and Viterbi state sequence interrogation is that segmentation is performed as a trainable task, and the temporal relationship between actions is encoded in the model and used as evidence for action labelling

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Efficient duration and hierarchical modeling for human activity recognition

    Get PDF
    A challenge in building pervasive and smart spaces is to learn and recognize human activities of daily living (ADLs). In this paper, we address this problem and argue that in dealing with ADLs, it is beneficial to exploit both their typical duration patterns and inherent hierarchical structures. We exploit efficient duration modeling using the novel Coxian distribution to form the Coxian hidden semi-Markov model (CxHSMM) and apply it to the problem of learning and recognizing ADLs with complex temporal dependencies.The Coxian duration model has several advantages over existing duration parameterization using multinomial or exponential family distributions, including its denseness in the space of non negative distributions, low number of parameters, computational efficiency and the existence of closed-form estimation solutions. Further we combine both hierarchical and duration extensions of the hidden Markov model (HMM) to form the novel switching hidden semi-Markov model (SHSMM), and empirically compare its performance with existing models. The model can learn what an occupant normally does during the day from unsegmented training data and then perform online activity classification, segmentation and abnormality detection. Experimental results show that Coxian modeling outperforms a range of baseline models for the task of activity segmentation. We also achieve arecognition accuracy competitive to the current state-of-the-art multinomial duration model, while gaining a significant reduction in computation. Furthermore, cross-validation model selection on the number of phases K in the Coxian indicates that only a small Kis required to achieve the optimal performance. Finally, our models are further tested in a more challenging setting in which the tracking is often lost and the activities considerably overlap. With a small amount of labels supplied during training in a partially supervised learning mode, our models are again able to deliver reliable performance, again with a small number of phases, making our proposed framework an attractive choice for activity modeling

    Hidden Markov Models

    Get PDF
    Hidden Markov Models (HMMs), although known for decades, have made a big career nowadays and are still in state of development. This book presents theoretical issues and a variety of HMMs applications in speech recognition and synthesis, medicine, neurosciences, computational biology, bioinformatics, seismology, environment protection and engineering. I hope that the reader will find this book useful and helpful for their own research
    • …
    corecore