Search CORE

16,962 research outputs found

Unsupervised pattern discovery in speech : applications to word acquisition and speaker segmentation

Author: Park Alex S. (Alex Seungryong), 1979-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2007
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, February 2007.Includes bibliographical references (p. 167-176).We present a novel approach to speech processing based on the principle of pattern discovery. Our work represents a departure from traditional models of speech recognition, where the end goal is to classify speech into categories defined by a pre-specified inventory of lexical units (i.e. phones or words). Instead, we attempt to discover such an inventory in an unsupervised manner by exploiting the structure of repeating patterns within the speech signal. We show how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream. Our approach to unsupervised word acquisition utilizes a segmental variant of a widely used dynamic programming technique, which allows us to find matching acoustic patterns between spoken utterances. By aggregating information about these matching patterns across audio streams, we demonstrate how to group similar acoustic sequences together to form clusters corresponding to lexical entities such as words and short multi-word phrases. On a corpus of academic lecture material, we demonstrate that clusters found using this technique exhibit high purity and that many of the corresponding lexical identities are relevant to the underlying audio stream.(cont.) We demonstrate two applications of our pattern discovery procedure. First, we propose and evaluate two methods for automatically identifying sound clusters generated through pattern discovery. Our results show that high identification accuracy can be achieved for single word clusters using a constrained isolated word recognizer. Second, we apply acoustic pattern matching to the problem of speaker segmentation by attempting to find word-level speech patterns that are repeated by the same speaker. When used to segment a ten hour corpus of multi-speaker lectures, we found that our approach is able to generate segmentations that correlate well to independently generated human segmentations.by Alex Seungryong Park.Ph.D

DSpace@MIT

MODIS: an audio motif discovery software

Author: Bimbot Frédéric
Campion Sébastien
Catanese Laurence
Gravier Guillaume
Qu Bingqing
Souviraà-Labastie Nathan
Vincent Emmanuel
Publication venue: HAL CCSD
Publication date: 25/08/2013
Field of study

International audienceMODIS is a free speech and audio motif discovery software developed at IRISA Rennes. Motif discovery is the task of discovering and collecting occurrences of repeating patterns in the absence of prior knowledge, or training material. MODIS is based on a generic approach to mine repeating audio sequences, with tolerance to motif variability. The algorithm implementation allows to process large audio streams at a reasonable speed where motif discovery often requires huge amount of time

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Searching Acoustic Patterns in Speech Data without Recognition

Author: Skácel Miroslav
Publication venue: Vysoké učení technické v Brně. Fakulta informačních technologií
Publication date: 01/01/2012
Field of study

Tato práce se zabývá metodami vyhledávání slov, slovních frází a delších úseků v rozsáhlých řečových datech bez předchozích znalostí těchto dat. V úvodu je seznámení s danou problematikou a principy moderních metod pro vyhledávání opakujících se objektů. Dále je popsána reprezentace a segmentace vstupních dat, techniky pro vyhledání objektu v mluveném projevu a popis modelování nalezených objektů. Následně je popsána metoda pro vyhledávání objektů podle předem defi novaného vzoru. V dalším kroku jsou defi nována data pro experimenty, ve kterých byly použity metody pro detekci mluvených výrazů podle vzoru. Následuje popis systémových požadavků. V závěru je zhodnocení práce a návrhy na další vývoj.This work investigates into methods for words, word phrases and longer segments detection in large speech data sets in an unsupervised way. At first, basics for the given topic and principles of modern methods for searching of repeating objects are introduced. The representation and segmentation of the input data are described. Techniques for object detection in speech are presented. The description of found motifs modelling follows. The next step defi nes data sets for experiments in which spoken term detection by an example is performed. The system requirements are described. In the conclusion, the work is summarised and suggestions for further development are discussed.

Digital library of Brno University of Technology

National Repository of Grey Literature

A framework for the automatic description of musical structure using MPEG-7 audio

Author: Curran K
Lunney TF
McKevitt P
Smyth E
Publication venue: Ulster University
Publication date: 01/09/2005
Field of study

Ulster University's Research Portal

Recommended from our members

Spatial grouping resolves ambiguity to drive temporal recalibration.

Author: Arnold D. H.
Roseboom W.
Yarrow K.
Publication venue: 'American Psychological Association (APA)'
Publication date: 01/01/2011
Field of study

Cross-modal temporal recalibration describes a shift in the point of subjective simultaneity (PSS) between 2 events following repeated exposure to asynchronous cross-modal inputs-the adaptors. Previous research suggested that audiovisual recalibration is insensitive to the spatial relationship between the adaptors. Here we show that audiovisual recalibration can be driven by cross-modal spatial grouping. Twelve participants adapted to alternating trains of lights and tones. Spatial position was manipulated, with alternating sequences of a light then a tone, or a tone then a light, presented on either side of fixation (e.g., left tone-left light-right tone-right light, etc.). As the events were evenly spaced in time, in the absence of spatial-based grouping it would be unclear if tones were leading or lagging lights. However, any grouping of spatially colocalized cross-modal events would result in an unambiguous sense of temporal order. We found that adapting to these stimuli caused the PSS between subsequent lights and tones to shift toward the temporal relationship implied by spatial-based grouping. These data therefore show that temporal recalibration is facilitated by spatial grouping. (PsycINFO Database Record (c) 2011 APA, all rights reserved)

City Research Online

Crossref

University of Queensland eSpace

The Effect of Explicit Structure Encoding of Deep Neural Networks for Symbolic Music Generation

Author: Chen Ke
Dubnov Shlomo
Li Wei
Xia Gus
Zhang Weilin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 24/01/2019
Field of study

With recent breakthroughs in artificial neural networks, deep generative models have become one of the leading techniques for computational creativity. Despite very promising progress on image and short sequence generation, symbolic music generation remains a challenging problem since the structure of compositions are usually complicated. In this study, we attempt to solve the melody generation problem constrained by the given chord progression. This music meta-creation problem can also be incorporated into a plan recognition system with user inputs and predictive structural outputs. In particular, we explore the effect of explicit architectural encoding of musical structure via comparing two sequential generative models: LSTM (a type of RNN) and WaveNet (dilated temporal-CNN). As far as we know, this is the first study of applying WaveNet to symbolic music generation, as well as the first systematic comparison between temporal-CNN and RNN for music generation. We conduct a survey for evaluation in our generations and implemented Variable Markov Oracle in music pattern discovery. Experimental results show that to encode structure more explicitly using a stack of dilated convolution layers improved the performance significantly, and a global encoding of underlying chord progression into the generation procedure gains even more.Comment: 8 pages, 13 figure

arXiv.org e-Print Archive

Crossref

Recommended from our members

A comparative evaluation of algorithms for discovering translational patterns in Baroque keyboard works

Author: Collins Tom
Garthwaite Paul
Laney Robin
Thurlow Jeremy
Willis Alistair
Publication venue
Publication date: 01/01/2010
Field of study

We consider the problem of intra-opus pattern discovery, that is, the task of discovering patterns of a specified type within a piece of music. A music analyst undertook this task for works by Domenico Scarlattti and Johann Sebastian Bach, forming a benchmark of 'target' patterns. The performance of two existing algorithms and one of our own creation, called SIACT, is evaluated by comparison with this benchmark. SIACT out-performs the existing algorithms with regard to recall and, more often than not, precision. It is demonstrated that in all but the most carefully selected excerpts of music, the two existing algorithms can be affected by what is termed the 'problem of isolated membership'. Central to the relative success of SIACT is our intention that it should address this particular problem. The paper contrasts string-based and geometric approaches to pattern discovery, with an introduction to the latter. Suggestions for future work are given

Open Research Online (The Open University)

De Montfort University Open Research Archive

Unsupervised mining of audiovisually consistent segments in videos with application to structure analysis

Author: Ben Mathieu
Gravier Guillaume
Publication venue: HAL CCSD
Publication date: 01/01/2011
Field of study

International audienceIn this paper, a multimodal event mining technique is proposed to discover repeating video segments exhibiting audio and visual consistency in a totally unsupervised manner. The mining strategy first exploits independent audio and visual cluster analysis to provide segments which are consistent in both their visual and audio modalities, thus likely corresponding to a unique underlying event. A subsequent modeling stage using discriminative models enables accurate detection of the underlying event throughout the video. Event mining is applied to unsupervised video structure analysis, using simple heuristics on occurrence patterns of the events discovered to select those relevant to the video structure. Results on TV programs ranging from news to talk shows and games, show that structurally relevant events are discovered with precisions ranging from 87% to 98% and recalls from 59% to 94%

HAL-CentraleSupelec

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1