16,962 research outputs found

    Unsupervised pattern discovery in speech : applications to word acquisition and speaker segmentation

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, February 2007.Includes bibliographical references (p. 167-176).We present a novel approach to speech processing based on the principle of pattern discovery. Our work represents a departure from traditional models of speech recognition, where the end goal is to classify speech into categories defined by a pre-specified inventory of lexical units (i.e. phones or words). Instead, we attempt to discover such an inventory in an unsupervised manner by exploiting the structure of repeating patterns within the speech signal. We show how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream. Our approach to unsupervised word acquisition utilizes a segmental variant of a widely used dynamic programming technique, which allows us to find matching acoustic patterns between spoken utterances. By aggregating information about these matching patterns across audio streams, we demonstrate how to group similar acoustic sequences together to form clusters corresponding to lexical entities such as words and short multi-word phrases. On a corpus of academic lecture material, we demonstrate that clusters found using this technique exhibit high purity and that many of the corresponding lexical identities are relevant to the underlying audio stream.(cont.) We demonstrate two applications of our pattern discovery procedure. First, we propose and evaluate two methods for automatically identifying sound clusters generated through pattern discovery. Our results show that high identification accuracy can be achieved for single word clusters using a constrained isolated word recognizer. Second, we apply acoustic pattern matching to the problem of speaker segmentation by attempting to find word-level speech patterns that are repeated by the same speaker. When used to segment a ten hour corpus of multi-speaker lectures, we found that our approach is able to generate segmentations that correlate well to independently generated human segmentations.by Alex Seungryong Park.Ph.D

    MODIS: an audio motif discovery software

    Get PDF
    International audienceMODIS is a free speech and audio motif discovery software developed at IRISA Rennes. Motif discovery is the task of discovering and collecting occurrences of repeating patterns in the absence of prior knowledge, or training material. MODIS is based on a generic approach to mine repeating audio sequences, with tolerance to motif variability. The algorithm implementation allows to process large audio streams at a reasonable speed where motif discovery often requires huge amount of time

    Searching Acoustic Patterns in Speech Data without Recognition

    Get PDF
    Tato práce se zabývá metodami vyhledávání slov, slovních frází a delších úseků v rozsáhlých řečových datech bez předchozích znalostí těchto dat. V úvodu je seznámení s danou problematikou a principy moderních metod pro vyhledávání opakujících se objektů. Dále je popsána reprezentace a segmentace vstupních dat, techniky pro vyhledání objektu v mluveném projevu a popis modelování nalezených objektů. Následně je popsána metoda pro vyhledávání objektů podle předem defi novaného vzoru. V dalším kroku jsou defi nována data pro experimenty, ve kterých byly použity metody pro detekci mluvených výrazů podle vzoru. Následuje popis systémových požadavků. V závěru je zhodnocení práce a návrhy na další vývoj.This work investigates into methods for words, word phrases and longer segments detection in large speech data sets in an unsupervised way. At first, basics for the given topic and principles of modern methods for searching of repeating objects are introduced. The representation and segmentation of the input data are described. Techniques for object detection in speech are presented. The description of found motifs modelling follows. The next step defi nes data sets for experiments in which spoken term detection by an example is performed. The system requirements are described. In the conclusion, the work is summarised and suggestions for further development are discussed.

    The Effect of Explicit Structure Encoding of Deep Neural Networks for Symbolic Music Generation

    Full text link
    With recent breakthroughs in artificial neural networks, deep generative models have become one of the leading techniques for computational creativity. Despite very promising progress on image and short sequence generation, symbolic music generation remains a challenging problem since the structure of compositions are usually complicated. In this study, we attempt to solve the melody generation problem constrained by the given chord progression. This music meta-creation problem can also be incorporated into a plan recognition system with user inputs and predictive structural outputs. In particular, we explore the effect of explicit architectural encoding of musical structure via comparing two sequential generative models: LSTM (a type of RNN) and WaveNet (dilated temporal-CNN). As far as we know, this is the first study of applying WaveNet to symbolic music generation, as well as the first systematic comparison between temporal-CNN and RNN for music generation. We conduct a survey for evaluation in our generations and implemented Variable Markov Oracle in music pattern discovery. Experimental results show that to encode structure more explicitly using a stack of dilated convolution layers improved the performance significantly, and a global encoding of underlying chord progression into the generation procedure gains even more.Comment: 8 pages, 13 figure

    Unsupervised mining of audiovisually consistent segments in videos with application to structure analysis

    Get PDF
    International audienceIn this paper, a multimodal event mining technique is proposed to discover repeating video segments exhibiting audio and visual consistency in a totally unsupervised manner. The mining strategy first exploits independent audio and visual cluster analysis to provide segments which are consistent in both their visual and audio modalities, thus likely corresponding to a unique underlying event. A subsequent modeling stage using discriminative models enables accurate detection of the underlying event throughout the video. Event mining is applied to unsupervised video structure analysis, using simple heuristics on occurrence patterns of the events discovered to select those relevant to the video structure. Results on TV programs ranging from news to talk shows and games, show that structurally relevant events are discovered with precisions ranging from 87% to 98% and recalls from 59% to 94%
    • …
    corecore