4,586 research outputs found
The Zero Resource Speech Challenge 2017
We describe a new challenge aimed at discovering subword and word units from
raw speech. This challenge is the followup to the Zero Resource Speech
Challenge 2015. It aims at constructing systems that generalize across
languages and adapt to new speakers. The design features and evaluation metrics
of the challenge are presented and the results of seventeen models are
discussed.Comment: IEEE ASRU (Automatic Speech Recognition and Understanding) 2017.
Okinawa, Japa
A sticky HDP-HMM with application to speaker diarization
We consider the problem of speaker diarization, the problem of segmenting an
audio recording of a meeting into temporal segments corresponding to individual
speakers. The problem is rendered particularly difficult by the fact that we
are not allowed to assume knowledge of the number of people participating in
the meeting. To address this problem, we take a Bayesian nonparametric approach
to speaker diarization that builds on the hierarchical Dirichlet process hidden
Markov model (HDP-HMM) of Teh et al. [J. Amer. Statist. Assoc. 101 (2006)
1566--1581]. Although the basic HDP-HMM tends to over-segment the audio
data---creating redundant states and rapidly switching among them---we describe
an augmented HDP-HMM that provides effective control over the switching rate.
We also show that this augmentation makes it possible to treat emission
distributions nonparametrically. To scale the resulting architecture to
realistic diarization problems, we develop a sampling algorithm that employs a
truncated approximation of the Dirichlet process to jointly resample the full
state sequence, greatly improving mixing rates. Working with a benchmark NIST
data set, we show that our Bayesian nonparametric architecture yields
state-of-the-art speaker diarization results.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS395 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Capsule Routing for Sound Event Detection
The detection of acoustic scenes is a challenging problem in which
environmental sound events must be detected from a given audio signal. This
includes classifying the events as well as estimating their onset and offset
times. We approach this problem with a neural network architecture that uses
the recently-proposed capsule routing mechanism. A capsule is a group of
activation units representing a set of properties for an entity of interest,
and the purpose of routing is to identify part-whole relationships between
capsules. That is, a capsule in one layer is assumed to belong to a capsule in
the layer above in terms of the entity being represented. Using capsule
routing, we wish to train a network that can learn global coherence implicitly,
thereby improving generalization performance. Our proposed method is evaluated
on Task 4 of the DCASE 2017 challenge. Results show that classification
performance is state-of-the-art, achieving an F-score of 58.6%. In addition,
overfitting is reduced considerably compared to other architectures.Comment: Paper accepted for 26th European Signal Processing Conference
(EUSIPCO 2018
Singing voice resynthesis using concatenative-based techniques
Tese de Doutoramento. Engenharia Informática. Faculdade de Engenharia. Universidade do Porto. 201
Access to Audio-visual Contents, Exclusivity and Anticommons in New Media Markets
Media markets are characterized by strong peculiarities that often call for longterm exclusivity contracts between content providers and distributors or for vertical or horizontal integration. This paper analyzes and compares the economic effects of existing alternative systems of access to valuable content for new media platforms, under the lens of technological convergence and 'network neutrality'. The analysis suggests the increasing need for a coordinated regulatory framework aimed at balancing costs and benefits of the different models in order to ensure that the development of new markets and new technologies in the age of media convergence is not hindered by "anticommons" tragedies.contents, new media, anticommons, antitrust.
Multimodal video abstraction into a static document using deep learning
Abstraction is a strategy that gives the essential points of a document in a short period of time. The video abstraction approach proposed in this research is based on multi-modal video data, which comprises both audio and visual data. Segmenting the input video into scenes and obtaining a textual and visual summary for each scene are the major video abstraction procedures to summarize the video events into a static document. To recognize the shot and scene boundary from a video sequence, a hybrid features method was employed, which improves detection shot performance by selecting strong and flexible features. The most informative keyframes from each scene are then incorporated into the visual summary. A hybrid deep learning model was used for abstractive text summarization. The BBC archive provided the testing videos, which comprised BBC Learning English and BBC News. In addition, a news summary dataset was used to train a deep model. The performance of the proposed approaches was assessed using metrics like Rouge for textual summary, which achieved a 40.49% accuracy rate. While precision, recall, and F-score used for visual summary have achieved (94.9%) accuracy, which performed better than the other methods, according to the findings of the experiments
- …