874 research outputs found
Deep Canonical Time Warping for simultaneous alignment and representation learning of sequences
Machine learning algorithms for the analysis of time-series often depend on the assumption that utilised data are temporally aligned. Any temporal discrepancies arising in the data is certain to lead to ill-generalisable models, which in turn fail to correctly capture properties of the task at hand. The temporal alignment of time-series is thus a crucial challenge manifesting in a multitude of applications. Nevertheless, the vast majority of algorithms oriented towards temporal alignment are either applied directly on the observation space or simply utilise linear projections - thus failing to capture complex, hierarchical non-linear representations that may prove beneficial, especially when dealing with multi-modal data (e.g., visual and acoustic information). To this end, we present Deep Canonical Time Warping (DCTW), a method that automatically learns non-linear representations of multiple time-series that are (i) maximally correlated in a shared subspace, and (ii) temporally aligned. Furthermore, we extend DCTW to a supervised setting, where during training, available labels can be utilised towards enhancing the alignment process. By means of experiments on four datasets, we show that the representations learnt significantly outperform state-of-the-art methods in temporal alignment, elegantly handling scenarios with heterogeneous feature sets, such as the temporal alignment of acoustic and visual information
A Convolutional-Attentional Neural Framework for Structure-Aware Performance-Score Synchronization
Performance-score synchronization is an integral task in signal processing,
which entails generating an accurate mapping between an audio recording of a
performance and the corresponding musical score. Traditional synchronization
methods compute alignment using knowledge-driven and stochastic approaches, and
are typically unable to generalize well to different domains and modalities. We
present a novel data-driven method for structure-aware performance-score
synchronization. We propose a convolutional-attentional architecture trained
with a custom loss based on time-series divergence. We conduct experiments for
the audio-to-MIDI and audio-to-image alignment tasks pertained to different
score modalities. We validate the effectiveness of our method via ablation
studies and comparisons with state-of-the-art alignment approaches. We
demonstrate that our approach outperforms previous synchronization methods for
a variety of test settings across score modalities and acoustic conditions. Our
method is also robust to structural differences between the performance and
score sequences, which is a common limitation of standard alignment approaches.Comment: Published in IEEE Signal Processing Letters, Volume 29, December 202
TimewarpVAE: Simultaneous Time-Warping and Representation Learning of Trajectories
Human demonstrations of trajectories are an important source of training data
for many machine learning problems. However, the difficulty of collecting human
demonstration data for complex tasks makes learning efficient representations
of those trajectories challenging. For many problems, such as for handwriting
or for quasistatic dexterous manipulation, the exact timings of the
trajectories should be factored from their spatial path characteristics. In
this work, we propose TimewarpVAE, a fully differentiable manifold-learning
algorithm that incorporates Dynamic Time Warping (DTW) to simultaneously learn
both timing variations and latent factors of spatial variation. We show how the
TimewarpVAE algorithm learns appropriate time alignments and meaningful
representations of spatial variations in small handwriting and fork
manipulation datasets. Our results have lower spatial reconstruction test error
than baseline approaches and the learned low-dimensional representations can be
used to efficiently generate semantically meaningful novel trajectories.Comment: 17 pages, 12 figure
Robust correlated and individual component analysis
© 1979-2012 IEEE.Recovering correlated and individual components of two, possibly temporally misaligned, sets of data is a fundamental task in disciplines such as image, vision, and behavior computing, with application to problems such as multi-modal fusion (via correlated components), predictive analysis, and clustering (via the individual ones). Here, we study the extraction of correlated and individual components under real-world conditions, namely i) the presence of gross non-Gaussian noise and ii) temporally misaligned data. In this light, we propose a method for the Robust Correlated and Individual Component Analysis (RCICA) of two sets of data in the presence of gross, sparse errors. We furthermore extend RCICA in order to handle temporal incongruities arising in the data. To this end, two suitable optimization problems are solved. The generality of the proposed methods is demonstrated by applying them onto 4 applications, namely i) heterogeneous face recognition, ii) multi-modal feature fusion for human behavior analysis (i.e., audio-visual prediction of interest and conflict), iii) face clustering, and iv) thetemporal alignment of facial expressions. Experimental results on 2 synthetic and 7 real world datasets indicate the robustness and effectiveness of the proposed methodson these application domains, outperforming other state-of-the-art methods in the field
Non-Parallel Articulatory-to-Acoustic Conversion Using Multiview-based Time Warping
This work was supported in part by the Spanish State Research Agency (SRA) grant
number PID2019-108040RB-C22/SRA/10.13039/501100011033, and the FEDER/Junta de AndalucíaConsejería de Transformación Económica, Industria, Conocimiento y Universidades project no.
B-SEJ-570-UGR20.In this paper, we propose a novel algorithm called multiview temporal alignment by dependence maximisation in the latent space (TRANSIENCE) for the alignment of time series consisting of sequences of feature vectors with different length and dimensionality of the feature vectors. The proposed algorithm, which is based on the theory of multiview learning, can be seen as an extension of the well-known dynamic time warping (DTW) algorithm but, as mentioned, it allows the sequences to have different dimensionalities. Our algorithm attempts to find an optimal temporal alignment between pairs of nonaligned sequences by first projecting their feature vectors into a common latent space where both views are maximally similar. To do this, powerful, nonlinear deep neural network (DNN) models are employed. Then, the resulting sequences of embedding vectors are aligned using DTW. Finally, the alignment paths obtained in the previous step are applied to the original sequences to align them. In the paper, we explore several variants of the algorithm that mainly differ in the way the DNNs are trained. We evaluated the proposed algorithm on a articulatory-to-acoustic (A2A) synthesis task involving the generation of audible speech from motion data captured from the lips and tongue of healthy speakers using a technique known as permanent magnet articulography (PMA). In this task, our algorithm is applied during the training stage to align pairs of nonaligned speech and PMA recordings that are later used to train DNNs able to synthesis speech from PMA data. Our results show the quality of speech generated in the nonaligned scenario is comparable to that obtained in the parallel scenario.Spanish State Research Agency (SRA) PID2019-108040RB-C22/SRA/10.13039/501100011033FEDER/Junta de AndalucíaConsejería de Transformación Económica, Industria, Conocimiento y Universidades project no.
B-SEJ-570-UGR20
- …