190 research outputs found

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Searching Spontaneous Conversational Speech:Proceedings of ACM SIGIR Workshop (SSCS2008)

    Get PDF

    Multiple Media Correlation: Theory and Applications

    Get PDF
    This thesis introduces multiple media correlation, a new technology for the automatic alignment of multiple media objects such as text, audio, and video. This research began with the question: what can be learned when multiple multimedia components are analyzed simultaneously? Most ongoing research in computational multimedia has focused on queries, indexing, and retrieval within a single media type. Video is compressed and searched independently of audio, text is indexed without regard to temporal relationships it may have to other media data. Multiple media correlation provides a framework for locating and exploiting correlations between multiple, potentially heterogeneous, media streams. The goal is computed synchronization, the determination of temporal and spatial alignments that optimize a correlation function and indicate commonality and synchronization between media objects. The model also provides a basis for comparison of media in unrelated domains. There are many real-world applications for this technology, including speaker localization, musical score alignment, and degraded media realignment. Two applications, text-to-speech alignment and parallel text alignment, are described in detail with experimental validation. Text-to-speech alignment computes the alignment between a textual transcript and speech-based audio. The presented solutions are effective for a wide variety of content and are useful not only for retrieval of content, but in support of automatic captioning of movies and video. Parallel text alignment provides a tool for the comparison of alternative translations of the same document that is particularly useful to the classics scholar interested in comparing translation techniques or styles. The results presented in this thesis include (a) new media models more useful in analysis applications, (b) a theoretical model for multiple media correlation, (c) two practical application solutions that have wide-spread applicability, and (d) Xtrieve, a multimedia database retrieval system that demonstrates this new technology and demonstrates application of multiple media correlation to information retrieval. This thesis demonstrates that computed alignment of media objects is practical and can provide immediate solutions to many information retrieval and content presentation problems. It also introduces a new area for research in media data analysis

    Knowledge Reasoning with Graph Neural Networks

    Get PDF
    Knowledge reasoning is the process of drawing conclusions from existing facts and rules, which requires a range of capabilities including but not limited to understanding concepts, applying logic, and calibrating or validating architecture based on existing knowledge. With the explosive growth of communication techniques and mobile devices, much of collective human knowledge resides on the Internet today, in unstructured and semi-structured forms such as text, tables, images, videos, etc. It is overwhelmingly difficult for human to navigate the gigantic Internet knowledge without the help of intelligent systems such as search engines and question answering systems. To serve various information needs, in this thesis, we develop methods to perform knowledge reasoning over both structured and unstructured data. This thesis attempts to answer the following research questions on the topic of knowledge reasoning: (1) How to perform multi-hop reasoning over knowledge graphs? How should we leverage graph neural networks to learn graph-aware representations efficiently? And, how to systematically handle the noise in human questions? (2) How to combine deep learning and symbolic reasoning in a consistent probabilistic framework? How to make the inference efficient and scalable for large-scale knowledge graphs? Can we strike a balance between the representation power and the simplicity of the model? (3) What is the reasoning pattern of graph neural networks for knowledge-aware QA tasks? Can those elaborately designed GNN modules really perform complex reasoning process? Are they under- or over-complicated? Can we design a much simpler yet effective model to achieve comparable performance? (4) How to build an open-domain question answering system that can reason over multiple retrieved documents? How to efficiently rank and filter the retrieved documents to reduce the noise for the downstream answer prediction module? How to propagate and assemble the information among multiple retrieved documents? (5) How to answer the questions that require numerical reasoning over textual passages? How to enable pre-trained language models to perform numerical reasoning? We explored the research questions above and discovered that graph neural networks can be leveraged as a powerful tool for various knowledge reasoning tasks over both structured and unstructured knowledge sources. On structured graph-based knowledge source, we build graph neural networks on top of the graph structure to capture the topology information for downstream reasoning tasks. On unstructured text-based knowledge source, we first identify graph-structured information such as entity co-occurrence and entity-number binding, and then employ graph neural networks to reason over the constructed graphs, working together with pre-trained language models to handle unstructured part of the knowledge source.Ph.D

    Acta Cybernetica : Volume 16. Number 4.

    Get PDF

    Unsupervised speech processing with applications to query-by-example spoken term detection

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (p. 163-173).This thesis is motivated by the challenge of searching and extracting useful information from speech data in a completely unsupervised setting. In many real world speech processing problems, obtaining annotated data is not cost and time effective. We therefore ask how much can we learn from speech data without any transcription. To address this question, in this thesis, we chose the query-by-example spoken term detection as a specific scenario to demonstrate that this task can be done in the unsupervised setting without any annotations. To build the unsupervised spoken term detection framework, we contributed three main techniques to form a complete working flow. First, we present two posteriorgram-based speech representations which enable speaker-independent, and noisy spoken term matching. The feasibility and effectiveness of both posteriorgram features are demonstrated through a set of spoken term detection experiments on different datasets. Second, we show two lower-bounding based methods for Dynamic Time Warping (DTW) based pattern matching algorithms. Both algorithms greatly outperform the conventional DTW in a single-threaded computing environment. Third, we describe the parallel implementation of the lower-bounded DTW search algorithm. Experimental results indicate that the total running time of the entire spoken detection system grows linearly with corpus size. We also present the training of large Deep Belief Networks (DBNs) on Graphical Processing Units (GPUs). The phonetic classification experiment on the TIMIT corpus showed a speed-up of 36x for pre-training and 45x for back-propagation for a two-layer DBN trained on the GPU platform compared to the CPU platform.by Yaodong Zhang.Ph.D
    corecore