10,319 research outputs found
Substructure and Boundary Modeling for Continuous Action Recognition
This paper introduces a probabilistic graphical model for continuous action
recognition with two novel components: substructure transition model and
discriminative boundary model. The first component encodes the sparse and
global temporal transition prior between action primitives in state-space model
to handle the large spatial-temporal variations within an action class. The
second component enforces the action duration constraint in a discriminative
way to locate the transition boundaries between actions more accurately. The
two components are integrated into a unified graphical structure to enable
effective training and inference. Our comprehensive experimental results on
both public and in-house datasets show that, with the capability to incorporate
additional information that had not been explicitly or efficiently modeled by
previous methods, our proposed algorithm achieved significantly improved
performance for continuous action recognition.Comment: Detailed version of the CVPR 2012 paper. 15 pages, 6 figure
Learning the dynamics and time-recursive boundary detection of deformable objects
We propose a principled framework for recursively segmenting deformable objects across a sequence
of frames. We demonstrate the usefulness of this method on left ventricular segmentation across a cardiac
cycle. The approach involves a technique for learning the system dynamics together with methods of
particle-based smoothing as well as non-parametric belief propagation on a loopy graphical model capturing
the temporal periodicity of the heart. The dynamic system state is a low-dimensional representation
of the boundary, and the boundary estimation involves incorporating curve evolution into recursive state
estimation. By formulating the problem as one of state estimation, the segmentation at each particular
time is based not only on the data observed at that instant, but also on predictions based on past and future
boundary estimates. Although the paper focuses on left ventricle segmentation, the method generalizes
to temporally segmenting any deformable object
Listening to features
This work explores nonparametric methods which aim at synthesizing audio from
low-dimensionnal acoustic features typically used in MIR frameworks. Several
issues prevent this task to be straightforwardly achieved. Such features are
designed for analysis and not for synthesis, thus favoring high-level
description over easily inverted acoustic representation. Whereas some previous
studies already considered the problem of synthesizing audio from features such
as Mel-Frequency Cepstral Coefficients, they mainly relied on the explicit
formula used to compute those features in order to inverse them. Here, we
instead adopt a simple blind approach, where arbitrary sets of features can be
used during synthesis and where reconstruction is exemplar-based. After testing
the approach on a speech synthesis from well known features problem, we apply
it to the more complex task of inverting songs from the Million Song Dataset.
What makes this task harder is twofold. First, that features are irregularly
spaced in the temporal domain according to an onset-based segmentation. Second
the exact method used to compute these features is unknown, although the
features for new audio can be computed using their API as a black-box. In this
paper, we detail these difficulties and present a framework to nonetheless
attempting such synthesis by concatenating audio samples from a training
dataset, whose features have been computed beforehand. Samples are selected at
the segment level, in the feature space with a simple nearest neighbor search.
Additionnal constraints can then be defined to enhance the synthesis
pertinence. Preliminary experiments are presented using RWC and GTZAN audio
datasets to synthesize tracks from the Million Song Dataset.Comment: Technical Repor
ImageSpirit: Verbal Guided Image Parsing
Humans describe images in terms of nouns and adjectives while algorithms
operate on images represented as sets of pixels. Bridging this gap between how
humans would like to access images versus their typical representation is the
goal of image parsing, which involves assigning object and attribute labels to
pixel. In this paper we propose treating nouns as object labels and adjectives
as visual attribute labels. This allows us to formulate the image parsing
problem as one of jointly estimating per-pixel object and attribute labels from
a set of training images. We propose an efficient (interactive time) solution.
Using the extracted labels as handles, our system empowers a user to verbally
refine the results. This enables hands-free parsing of an image into pixel-wise
object/attribute labels that correspond to human semantics. Verbally selecting
objects of interests enables a novel and natural interaction modality that can
possibly be used to interact with new generation devices (e.g. smart phones,
Google Glass, living room devices). We demonstrate our system on a large number
of real-world images with varying complexity. To help understand the tradeoffs
compared to traditional mouse based interactions, results are reported for both
a large scale quantitative evaluation and a user study.Comment: http://mmcheng.net/imagespirit
- …