3,748 research outputs found
Invariances and Data Augmentation for Supervised Music Transcription
This paper explores a variety of models for frame-based music transcription,
with an emphasis on the methods needed to reach state-of-the-art on human
recordings. The translation-invariant network discussed in this paper, which
combines a traditional filterbank with a convolutional neural network, was the
top-performing model in the 2017 MIREX Multiple Fundamental Frequency
Estimation evaluation. This class of models shares parameters in the
log-frequency domain, which exploits the frequency invariance of music to
reduce the number of model parameters and avoid overfitting to the training
data. All models in this paper were trained with supervision by labeled data
from the MusicNet dataset, augmented by random label-preserving pitch-shift
transformations.Comment: 6 page
3D Face Reconstruction by Learning from Synthetic Data
Fast and robust three-dimensional reconstruction of facial geometric
structure from a single image is a challenging task with numerous applications.
Here, we introduce a learning-based approach for reconstructing a
three-dimensional face from a single image. Recent face recovery methods rely
on accurate localization of key characteristic points. In contrast, the
proposed approach is based on a Convolutional-Neural-Network (CNN) which
extracts the face geometry directly from its image. Although such deep
architectures outperform other models in complex computer vision problems,
training them properly requires a large dataset of annotated examples. In the
case of three-dimensional faces, currently, there are no large volume data
sets, while acquiring such big-data is a tedious task. As an alternative, we
propose to generate random, yet nearly photo-realistic, facial images for which
the geometric form is known. The suggested model successfully recovers facial
shapes from real images, even for faces with extreme expressions and under
various lighting conditions.Comment: The first two authors contributed equally to this wor
Video Object Detection with an Aligned Spatial-Temporal Memory
We introduce Spatial-Temporal Memory Networks for video object detection. At
its core, a novel Spatial-Temporal Memory module (STMM) serves as the recurrent
computation unit to model long-term temporal appearance and motion dynamics.
The STMM's design enables full integration of pretrained backbone CNN weights,
which we find to be critical for accurate detection. Furthermore, in order to
tackle object motion in videos, we propose a novel MatchTrans module to align
the spatial-temporal memory from frame to frame. Our method produces
state-of-the-art results on the benchmark ImageNet VID dataset, and our
ablative studies clearly demonstrate the contribution of our different design
choices. We release our code and models at
http://fanyix.cs.ucdavis.edu/project/stmn/project.html
- …