16,962 research outputs found
Unsupervised pattern discovery in speech : applications to word acquisition and speaker segmentation
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, February 2007.Includes bibliographical references (p. 167-176).We present a novel approach to speech processing based on the principle of pattern discovery. Our work represents a departure from traditional models of speech recognition, where the end goal is to classify speech into categories defined by a pre-specified inventory of lexical units (i.e. phones or words). Instead, we attempt to discover such an inventory in an unsupervised manner by exploiting the structure of repeating patterns within the speech signal. We show how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream. Our approach to unsupervised word acquisition utilizes a segmental variant of a widely used dynamic programming technique, which allows us to find matching acoustic patterns between spoken utterances. By aggregating information about these matching patterns across audio streams, we demonstrate how to group similar acoustic sequences together to form clusters corresponding to lexical entities such as words and short multi-word phrases. On a corpus of academic lecture material, we demonstrate that clusters found using this technique exhibit high purity and that many of the corresponding lexical identities are relevant to the underlying audio stream.(cont.) We demonstrate two applications of our pattern discovery procedure. First, we propose and evaluate two methods for automatically identifying sound clusters generated through pattern discovery. Our results show that high identification accuracy can be achieved for single word clusters using a constrained isolated word recognizer. Second, we apply acoustic pattern matching to the problem of speaker segmentation by attempting to find word-level speech patterns that are repeated by the same speaker. When used to segment a ten hour corpus of multi-speaker lectures, we found that our approach is able to generate segmentations that correlate well to independently generated human segmentations.by Alex Seungryong Park.Ph.D
MODIS: an audio motif discovery software
International audienceMODIS is a free speech and audio motif discovery software developed at IRISA Rennes. Motif discovery is the task of discovering and collecting occurrences of repeating patterns in the absence of prior knowledge, or training material. MODIS is based on a generic approach to mine repeating audio sequences, with tolerance to motif variability. The algorithm implementation allows to process large audio streams at a reasonable speed where motif discovery often requires huge amount of time
Searching Acoustic Patterns in Speech Data without Recognition
Tato práce se zabĂ˝vá metodami vyhledávánĂ slov, slovnĂch frázĂ a delšĂch ĂşsekĹŻ v rozsáhlĂ˝ch Ĺ™eÄŤovĂ˝ch datech bez pĹ™edchozĂch znalostĂ tÄ›chto dat. V Ăşvodu je seznámenĂ s danou problematikou a principy modernĂch metod pro vyhledávánĂ opakujĂcĂch se objektĹŻ. Dále je popsána reprezentace a segmentace vstupnĂch dat, techniky pro vyhledánĂ objektu v mluvenĂ©m projevu a popis modelovánĂ nalezenĂ˝ch objektĹŻ. NáslednÄ› je popsána metoda pro vyhledávánĂ objektĹŻ podle pĹ™edem defi novanĂ©ho vzoru. V dalšĂm kroku jsou defi nována data pro experimenty, ve kterĂ˝ch byly pouĹľity metody pro detekci mluvenĂ˝ch vĂ˝razĹŻ podle vzoru. Následuje popis systĂ©movĂ˝ch poĹľadavkĹŻ. V závÄ›ru je zhodnocenĂ práce a návrhy na dalšà vĂ˝voj.This work investigates into methods for words, word phrases and longer segments detection in large speech data sets in an unsupervised way. At first, basics for the given topic and principles of modern methods for searching of repeating objects are introduced. The representation and segmentation of the input data are described. Techniques for object detection in speech are presented. The description of found motifs modelling follows. The next step defi nes data sets for experiments in which spoken term detection by an example is performed. The system requirements are described. In the conclusion, the work is summarised and suggestions for further development are discussed.
Recommended from our members
Spatial grouping resolves ambiguity to drive temporal recalibration.
Cross-modal temporal recalibration describes a shift in the point of subjective simultaneity (PSS) between 2 events following repeated exposure to asynchronous cross-modal inputs-the adaptors. Previous research suggested that audiovisual recalibration is insensitive to the spatial relationship between the adaptors. Here we show that audiovisual recalibration can be driven by cross-modal spatial grouping. Twelve participants adapted to alternating trains of lights and tones. Spatial position was manipulated, with alternating sequences of a light then a tone, or a tone then a light, presented on either side of fixation (e.g., left tone-left light-right tone-right light, etc.). As the events were evenly spaced in time, in the absence of spatial-based grouping it would be unclear if tones were leading or lagging lights. However, any grouping of spatially colocalized cross-modal events would result in an unambiguous sense of temporal order. We found that adapting to these stimuli caused the PSS between subsequent lights and tones to shift toward the temporal relationship implied by spatial-based grouping. These data therefore show that temporal recalibration is facilitated by spatial grouping. (PsycINFO Database Record (c) 2011 APA, all rights reserved)
The Effect of Explicit Structure Encoding of Deep Neural Networks for Symbolic Music Generation
With recent breakthroughs in artificial neural networks, deep generative
models have become one of the leading techniques for computational creativity.
Despite very promising progress on image and short sequence generation,
symbolic music generation remains a challenging problem since the structure of
compositions are usually complicated. In this study, we attempt to solve the
melody generation problem constrained by the given chord progression. This
music meta-creation problem can also be incorporated into a plan recognition
system with user inputs and predictive structural outputs. In particular, we
explore the effect of explicit architectural encoding of musical structure via
comparing two sequential generative models: LSTM (a type of RNN) and WaveNet
(dilated temporal-CNN). As far as we know, this is the first study of applying
WaveNet to symbolic music generation, as well as the first systematic
comparison between temporal-CNN and RNN for music generation. We conduct a
survey for evaluation in our generations and implemented Variable Markov Oracle
in music pattern discovery. Experimental results show that to encode structure
more explicitly using a stack of dilated convolution layers improved the
performance significantly, and a global encoding of underlying chord
progression into the generation procedure gains even more.Comment: 8 pages, 13 figure
Recommended from our members
A comparative evaluation of algorithms for discovering translational patterns in Baroque keyboard works
We consider the problem of intra-opus pattern discovery, that is, the task of discovering patterns of a specified type within a piece of music. A music analyst undertook this task for works by Domenico Scarlattti and Johann Sebastian Bach, forming a benchmark of 'target' patterns. The performance of two existing algorithms and one of our own creation, called SIACT, is evaluated by comparison with this benchmark. SIACT out-performs the existing algorithms with regard to recall and, more often than not, precision. It is demonstrated that in all but the most carefully selected excerpts of music, the two existing algorithms can be affected by what is termed the 'problem of isolated membership'. Central to the relative success of SIACT is our intention that it should address this particular problem. The paper contrasts string-based and geometric approaches to pattern discovery, with an introduction to the latter. Suggestions for future work are given
Unsupervised mining of audiovisually consistent segments in videos with application to structure analysis
International audienceIn this paper, a multimodal event mining technique is proposed to discover repeating video segments exhibiting audio and visual consistency in a totally unsupervised manner. The mining strategy first exploits independent audio and visual cluster analysis to provide segments which are consistent in both their visual and audio modalities, thus likely corresponding to a unique underlying event. A subsequent modeling stage using discriminative models enables accurate detection of the underlying event throughout the video. Event mining is applied to unsupervised video structure analysis, using simple heuristics on occurrence patterns of the events discovered to select those relevant to the video structure. Results on TV programs ranging from news to talk shows and games, show that structurally relevant events are discovered with precisions ranging from 87% to 98% and recalls from 59% to 94%
- …