80,273 research outputs found
Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations
The increasing accuracy of automatic chord estimation systems, the
availability of vast amounts of heterogeneous reference annotations, and
insights from annotator subjectivity research make chord label personalization
increasingly important. Nevertheless, automatic chord estimation systems are
historically exclusively trained and evaluated on a single reference
annotation. We introduce a first approach to automatic chord label
personalization by modeling subjectivity through deep learning of a harmonic
interval-based chord label representation. After integrating these
representations from multiple annotators, we can accurately personalize chord
labels for individual annotators from a single model and the annotators' chord
label vocabulary. Furthermore, we show that chord personalization using
multiple reference annotations outperforms using a single reference annotation.Comment: Proceedings of the First International Conference on Deep Learning
and Music, Anchorage, US, May, 2017 (arXiv:1706.08675v1 [cs.NE]
Recommended from our members
When users generate music playlists: When words leave off, music begins?
Music systems that generate playlists are gaining increasing popularity, yet ways to select songs to be acceptable to users is still elusive. We present the results of an explorative study that focused on the language of musically untrained end users for playlist choices, in a variety of listening contexts. Our results indicate that there are a number of opportunities for playlist recommendation or retrieval systems, particularly by taking context into account
ModDrop: adaptive multi-modal gesture recognition
We present a method for gesture detection and localisation based on
multi-scale and multi-modal deep learning. Each visual modality captures
spatial information at a particular spatial scale (such as motion of the upper
body or a hand), and the whole system operates at three temporal scales. Key to
our technique is a training strategy which exploits: i) careful initialization
of individual modalities; and ii) gradual fusion involving random dropping of
separate channels (dubbed ModDrop) for learning cross-modality correlations
while preserving uniqueness of each modality-specific representation. We
present experiments on the ChaLearn 2014 Looking at People Challenge gesture
recognition track, in which we placed first out of 17 teams. Fusing multiple
modalities at several spatial and temporal scales leads to a significant
increase in recognition rates, allowing the model to compensate for errors of
the individual classifiers as well as noise in the separate channels.
Futhermore, the proposed ModDrop training technique ensures robustness of the
classifier to missing signals in one or several channels to produce meaningful
predictions from any number of available modalities. In addition, we
demonstrate the applicability of the proposed fusion scheme to modalities of
arbitrary nature by experiments on the same dataset augmented with audio.Comment: 14 pages, 7 figure
Fast and Accurate OOV Decoder on High-Level Features
This work proposes a novel approach to out-of-vocabulary (OOV) keyword search
(KWS) task. The proposed approach is based on using high-level features from an
automatic speech recognition (ASR) system, so called phoneme posterior based
(PPB) features, for decoding. These features are obtained by calculating
time-dependent phoneme posterior probabilities from word lattices, followed by
their smoothing. For the PPB features we developed a special novel very fast,
simple and efficient OOV decoder. Experimental results are presented on the
Georgian language from the IARPA Babel Program, which was the test language in
the OpenKWS 2016 evaluation campaign. The results show that in terms of maximum
term weighted value (MTWV) metric and computational speed, for single ASR
systems, the proposed approach significantly outperforms the state-of-the-art
approach based on using in-vocabulary proxies for OOV keywords in the indexed
database. The comparison of the two OOV KWS approaches on the fusion results of
the nine different ASR systems demonstrates that the proposed OOV decoder
outperforms the proxy-based approach in terms of MTWV metric given the
comparable processing speed. Other important advantages of the OOV decoder
include extremely low memory consumption and simplicity of its implementation
and parameter optimization.Comment: Interspeech 2017, August 2017, Stockholm, Sweden. 201
Embedding-Based Speaker Adaptive Training of Deep Neural Networks
An embedding-based speaker adaptive training (SAT) approach is proposed and
investigated in this paper for deep neural network acoustic modeling. In this
approach, speaker embedding vectors, which are a constant given a particular
speaker, are mapped through a control network to layer-dependent element-wise
affine transformations to canonicalize the internal feature representations at
the output of hidden layers of a main network. The control network for
generating the speaker-dependent mappings is jointly estimated with the main
network for the overall speaker adaptive acoustic modeling. Experiments on
large vocabulary continuous speech recognition (LVCSR) tasks show that the
proposed SAT scheme can yield superior performance over the widely-used
speaker-aware training using i-vectors with speaker-adapted input features
Access to recorded interviews: A research agenda
Recorded interviews form a rich basis for scholarly inquiry. Examples include oral histories, community memory projects, and interviews conducted for broadcast media. Emerging technologies offer the potential to radically transform the way in which recorded interviews are made accessible, but this vision will demand substantial investments from a broad range of research communities. This article reviews the present state of practice for making recorded interviews available and the state-of-the-art for key component technologies. A large number of important research issues are identified, and from that set of issues, a coherent research agenda is proposed
Multimedia and e-Learning integration for supporting training programs in agriculture by MOODLE
The NODES project aims at facilitating, for adult training / lifelong
training, the use of multimedia knowledge to improve competitiveness
employability and mobility of handicapped adults (physical and sensorial) and
of adults victims of the digital divide or of some of its components such as
distance, initial level of knowledge, language, use of complex technologies.
The NODES project is focused, on the wide sense, on the production and
diffusion of knowledge created within public and private organizations
dedicated to adult training or by individuals, through Europe. Within the
project the MOODLE e-Learning system was selected and more multimedia
content will be integrated into the knowledge base. The EU-Index
metadatabase collects content sources for the project partners. Another target is
to integrate video files into the systems. This parts are integrated by the logical
and physical architectures of the NODES
- âŠ