41,724 research outputs found
Unsupervised Feature Learning of Human Actions as Trajectories in Pose Embedding Manifold
An unsupervised human action modeling framework can provide useful
pose-sequence representation, which can be utilized in a variety of pose
analysis applications. In this work we propose a novel temporal pose-sequence
modeling framework, which can embed the dynamics of 3D human-skeleton joints to
a continuous latent space in an efficient manner. In contrast to end-to-end
framework explored by previous works, we disentangle the task of individual
pose representation learning from the task of learning actions as a trajectory
in pose embedding space. In order to realize a continuous pose embedding
manifold with improved reconstructions, we propose an unsupervised, manifold
learning procedure named Encoder GAN, (or EnGAN). Further, we use the pose
embeddings generated by EnGAN to model human actions using a bidirectional RNN
auto-encoder architecture, PoseRNN. We introduce first-order gradient loss to
explicitly enforce temporal regularity in the predicted motion sequence. A
hierarchical feature fusion technique is also investigated for simultaneous
modeling of local skeleton joints along with global pose variations. We
demonstrate state-of-the-art transfer-ability of the learned representation
against other supervisedly and unsupervisedly learned motion embeddings for the
task of fine-grained action recognition on SBU interaction dataset. Further, we
show the qualitative strengths of the proposed framework by visualizing
skeleton pose reconstructions and interpolations in pose-embedding space, and
low dimensional principal component projections of the reconstructed pose
trajectories.Comment: Accepted at WACV 201
Unsupervised Neural Machine Translation
In spite of the recent success of neural machine translation (NMT) in
standard benchmarks, the lack of large parallel corpora poses a major practical
problem for many language pairs. There have been several proposals to alleviate
this issue with, for instance, triangulation and semi-supervised learning
techniques, but they still require a strong cross-lingual signal. In this work,
we completely remove the need of parallel data and propose a novel method to
train an NMT system in a completely unsupervised manner, relying on nothing but
monolingual corpora. Our model builds upon the recent work on unsupervised
embedding mappings, and consists of a slightly modified attentional
encoder-decoder model that can be trained on monolingual corpora alone using a
combination of denoising and backtranslation. Despite the simplicity of the
approach, our system obtains 15.56 and 10.21 BLEU points in WMT 2014
French-to-English and German-to-English translation. The model can also profit
from small parallel corpora, and attains 21.81 and 15.24 points when combined
with 100,000 parallel sentences, respectively. Our implementation is released
as an open source project.Comment: Published as a conference paper at ICLR 201
Effectiveness of self-supervised pre-training for speech recognition
We compare self-supervised representation learning algorithms which either
explicitly quantize the audio data or learn representations without
quantization. We find the former to be more accurate since it builds a good
vocabulary of the data through vq-wav2vec [1] to enable learning of effective
representations in subsequent BERT training. Different to previous work, we
directly fine-tune the pre-trained BERT models on transcribed speech using a
Connectionist Temporal Classification (CTC) loss instead of feeding the
representations into a task-specific model. We also propose a BERT-style model
learning directly from the continuous audio data and compare pre-training on
raw audio to spectral features. Fine-tuning a BERT model on 10 hour of labeled
Librispeech data with a vq-wav2vec vocabulary is almost as good as the best
known reported system trained on 100 hours of labeled data on testclean, while
achieving a 25% WER reduction on test-other. When using only 10 minutes of
labeled data, WER is 25.2 on test-other and 16.3 on test-clean. This
demonstrates that self-supervision can enable speech recognition systems
trained on a near-zero amount of transcribed data
A Probabilistic Semi-Supervised Approach to Multi-Task Human Activity Modeling
Human behavior is a continuous stochastic spatio-temporal process which is
governed by semantic actions and affordances as well as latent factors.
Therefore, video-based human activity modeling is concerned with a number of
tasks such as inferring current and future semantic labels, predicting future
continuous observations as well as imagining possible future label and feature
sequences. In this paper we present a semi-supervised probabilistic deep latent
variable model that can represent both discrete labels and continuous
observations as well as latent dynamics over time. This allows the model to
solve several tasks at once without explicit fine-tuning. We focus here on the
tasks of action classification, detection, prediction and anticipation as well
as motion prediction and synthesis based on 3D human activity data recorded
with Kinect. We further extend the model to capture hierarchical label
structure and to model the dependencies between multiple entities, such as a
human and objects. Our experiments demonstrate that our principled approach to
human activity modeling can be used to detect current and anticipate future
semantic labels and to predict and synthesize future label and feature
sequences. When comparing our model to state-of-the-art approaches, which are
specifically designed for e.g. action classification, we find that our
probabilistic formulation outperforms or is comparable to these task specific
models
A Structured Variational Autoencoder for Contextual Morphological Inflection
Statistical morphological inflectors are typically trained on fully
supervised, type-level data. One remaining open research question is the
following: How can we effectively exploit raw, token-level data to improve
their performance? To this end, we introduce a novel generative latent-variable
model for the semi-supervised learning of inflection generation. To enable
posterior inference over the latent variables, we derive an efficient
variational inference procedure based on the wake-sleep algorithm. We
experiment on 23 languages, using the Universal Dependencies corpora in a
simulated low-resource setting, and find improvements of over 10% absolute
accuracy in some cases.Comment: Published at ACL 201
Introspective Generative Modeling: Decide Discriminatively
We study unsupervised learning by developing introspective generative
modeling (IGM) that attains a generator using progressively learned deep
convolutional neural networks. The generator is itself a discriminator, capable
of introspection: being able to self-evaluate the difference between its
generated samples and the given training data. When followed by repeated
discriminative learning, desirable properties of modern discriminative
classifiers are directly inherited by the generator. IGM learns a cascade of
CNN classifiers using a synthesis-by-classification algorithm. In the
experiments, we observe encouraging results on a number of applications
including texture modeling, artistic style transferring, face modeling, and
semi-supervised learning.Comment: 10 pages, 9 figure
Multi-Stream Dynamic Video Summarization
With vast amounts of video content being uploaded to the Internet every
minute, video summarization becomes critical for efficient browsing, searching,
and indexing of visual content. Nonetheless, the spread of social and
egocentric cameras creates an abundance of sparse scenarios captured by several
devices, and ultimately required to be jointly summarized. In this paper, we
discuss the problem of summarizing videos recorded simultaneously by several
dynamic cameras that intermittently share the field of view. We present a
robust framework that (a) identifies a diverse set of important events among
moving cameras that often are not capturing the same scene, and (b) selects the
most representative view(s) at each event to be included in a universal
summary. Due to the lack of an applicable alternative, we collected a new
multi-view egocentric dataset, Multi-Ego. Our dataset is recorded
simultaneously by three cameras, covering a wide variety of real-life
scenarios. The footage is annotated by multiple individuals under various
summarization configurations, with a consensus analysis ensuring a reliable
ground truth. We conduct extensive experiments on the compiled dataset in
addition to three other standard benchmarks that show the robustness and the
advantage of our approach in both supervised and unsupervised settings.
Additionally, we show that our approach learns collectively from data of varied
number-of-views and orthogonal to other summarization methods, deeming it
scalable and generic. Our materials are made publicly available
A Call for More Rigor in Unsupervised Cross-lingual Learning
We review motivations, definition, approaches, and methodology for
unsupervised cross-lingual learning and call for a more rigorous position in
each of them. An existing rationale for such research is based on the lack of
parallel data for many of the world's languages. However, we argue that a
scenario without any parallel data and abundant monolingual data is unrealistic
in practice. We also discuss different training signals that have been used in
previous work, which depart from the pure unsupervised setting. We then
describe common methodological issues in tuning and evaluation of unsupervised
cross-lingual models and present best practices. Finally, we provide a unified
outlook for different types of research in this area (i.e., cross-lingual word
embeddings, deep multilingual pretraining, and unsupervised machine
translation) and argue for comparable evaluation of these models.Comment: ACL 202
Supervised and Semi-Supervised Deep Neural Networks for CSI-Based Authentication
From the viewpoint of physical-layer authentication, spoofing attacks can be
foiled by checking channel state information (CSI). Existing CSI-based
authentication algorithms mostly require a deep knowledge of the channel to
deliver decent performance. In this paper, we investigate CSI-based
authenticators that can spare the effort to predetermine channel properties by
utilizing deep neural networks (DNNs). We first propose a convolutional neural
network (CNN)-enabled authenticator that is able to extract the local features
in CSI. Next, we employ the recurrent neural network (RNN) to capture the
dependencies between different frequencies in CSI. In addition, we propose to
use the convolutional recurrent neural network (CRNN)---a combination of the
CNN and the RNN---to learn local and contextual information in CSI for user
authentication. To effectively train these DNNs, one needs a large amount of
labeled channel records. However, it is often expensive to label large channel
observations in the presence of a spoofer. In view of this, we further study a
case in which only a small part of the the channel observations are labeled. To
handle it, we extend these DNNs-enabled approaches into semi-supervised ones.
This extension is based on a semi-supervised learning technique that employs
both the labeled and unlabeled data to train a DNN. To be specific, our
semi-supervised method begins by generating pseudo labels for the unlabeled
channel samples through implementing the K-means algorithm in a semi-supervised
manner. Subsequently, both the labeled and pseudo labeled data are exploited to
pre-train a DNN, which is then fine-tuned based on the labeled channel records.Comment: This paper has been submitted for possible publicatio
Toward Controlled Generation of Text
Generic generation and manipulation of text is challenging and has limited
success compared to recent deep generative modeling in visual domain. This
paper aims at generating plausible natural language sentences, whose attributes
are dynamically controlled by learning disentangled latent representations with
designated semantics. We propose a new neural generative model which combines
variational auto-encoders and holistic attribute discriminators for effective
imposition of semantic structures. With differentiable approximation to
discrete text samples, explicit constraints on independent attribute controls,
and efficient collaborative learning of generator and discriminators, our model
learns highly interpretable representations from even only word annotations,
and produces realistic sentences with desired attributes. Quantitative
evaluation validates the accuracy of sentence and attribute generation.Comment: Code adapted for text style transfer is released at:
https://github.com/asyml/texar/tree/master/examples/text_style_transfe
- …