189 research outputs found
Autoregressive Kernels For Time Series
We propose in this work a new family of kernels for variable-length time
series. Our work builds upon the vector autoregressive (VAR) model for
multivariate stochastic processes: given a multivariate time series x, we
consider the likelihood function p_{\theta}(x) of different parameters \theta
in the VAR model as features to describe x. To compare two time series x and
x', we form the product of their features p_{\theta}(x) p_{\theta}(x') which is
integrated out w.r.t \theta using a matrix normal-inverse Wishart prior. Among
other properties, this kernel can be easily computed when the dimension d of
the time series is much larger than the lengths of the considered time series x
and x'. It can also be generalized to time series taking values in arbitrary
state spaces, as long as the state space itself is endowed with a kernel
\kappa. In that case, the kernel between x and x' is a a function of the Gram
matrices produced by \kappa on observations and subsequences of observations
enumerated in x and x'. We describe a computationally efficient implementation
of this generalization that uses low-rank matrix factorization techniques.
These kernels are compared to other known kernels using a set of benchmark
classification tasks carried out with support vector machines
Automated Classification of Vowel Category and Speaker Type in the High-Frequency Spectrum
The high-frequency region of vowel signals (above the third formant or F3) has received little research attention. Recent evidence, however, has documented the perceptual utility of high-frequency information in the speech signal above the traditional frequency bandwidth known to contain important cues for speech and speaker recognition. The purpose of this study was to determine if high-pass filtered vowels could be separated by vowel category and speaker type in a supervised learning framework. Mel frequency cepstral coefficients (MFCCs) were extracted from productions of six vowel categories produced by two male, two female, and two child speakers. Results revealed that the filtered vowels were well separated by vowel category and speaker type using MFCCs from the high-frequency spectrum. This demonstrates the presence of useful information for automated classification from the high-frequency region and is the first study to report findings of this nature in a supervised learning framework
Human Interaction Recognition with Audio and Visual Cues
The automated recognition of human activities from video is a fundamental problem with applications in several areas, ranging from video surveillance, and robotics, to smart healthcare, and multimedia indexing and retrieval, just to mention a few. However, the pervasive diffusion of cameras capable of recording audio also makes available to those applications a complementary modality. Despite the sizable progress made in the area of modeling and recognizing group activities, and actions performed by people in isolation from video, the availability of audio cues has rarely being leveraged. This is even more so in the area of modeling and recognizing binary interactions between humans, where also the use of video has been limited.;This thesis introduces a modeling framework for binary human interactions based on audio and visual cues. The main idea is to describe an interaction with a spatio-temporal trajectory modeling the visual motion cues, and a temporal trajectory modeling the audio cues. This poses the problem of how to fuse temporal trajectories from multiple modalities for the purpose of recognition. We propose a solution whereby trajectories are modeled as the output of kernel state space models. Then, we developed kernel-based methods for the audio-visual fusion that act at the feature level, as well as at the kernel level, by exploiting multiple kernel learning techniques. The approaches have been extensively tested and evaluated with a dataset made of videos obtained from TV shows and Hollywood movies, containing five different interactions. The results show the promise of this approach by producing a significant improvement of the recognition rate when audio cues are exploited, clearly setting the state-of-the-art in this particular application
Graph Kernels via Functional Embedding
We propose a representation of graph as a functional object derived from the
power iteration of the underlying adjacency matrix. The proposed functional
representation is a graph invariant, i.e., the functional remains unchanged
under any reordering of the vertices. This property eliminates the difficulty
of handling exponentially many isomorphic forms. Bhattacharyya kernel
constructed between these functionals significantly outperforms the
state-of-the-art graph kernels on 3 out of the 4 standard benchmark graph
classification datasets, demonstrating the superiority of our approach. The
proposed methodology is simple and runs in time linear in the number of edges,
which makes our kernel more efficient and scalable compared to many widely
adopted graph kernels with running time cubic in the number of vertices
Expanding the Family of Grassmannian Kernels: An Embedding Perspective
Modeling videos and image-sets as linear subspaces has proven beneficial for
many visual recognition tasks. However, it also incurs challenges arising from
the fact that linear subspaces do not obey Euclidean geometry, but lie on a
special type of Riemannian manifolds known as Grassmannian. To leverage the
techniques developed for Euclidean spaces (e.g, support vector machines) with
subspaces, several recent studies have proposed to embed the Grassmannian into
a Hilbert space by making use of a positive definite kernel. Unfortunately,
only two Grassmannian kernels are known, none of which -as we will show- is
universal, which limits their ability to approximate a target function
arbitrarily well. Here, we introduce several positive definite Grassmannian
kernels, including universal ones, and demonstrate their superiority over
previously-known kernels in various tasks, such as classification, clustering,
sparse coding and hashing
Recognition of Visual Dynamical Processes: Theory, Kernels, and Experimental Evaluation
Over the past few years, several papers have used Linear Dynamical Systems (LDS)s for modeling, registration,
segmentation, and recognition of visual dynamical processes, such as human gaits, dynamic textures and lip
articulations. The recognition framework involves identifying the parameters of the LDSs from features extracted
from a training set of videos, using metrics on the space of dynamical systems to compare them, and combining
these metrics with different classification methods. Usually, each paper makes an ad-hoc choice for every step,
and tests the recognition framework on small data sets often involving only one application. We present a detailed
evaluation of the LDS-based recognition pipeline; comparing identification methods, metrics, and classification
techniques. We propose new metrics that have certain invariance properties and explore a number of variations to the
existing metrics. We perform experimental evaluations on well-known data sets of human gaits, dynamic textures,
and lip articulations and provide benchmark recognition results. We also analyze the robustness of the recognition
pipeline with respect to changes in observation and experimental conditions. Overall, this work represents the most
extensive to-date evaluation of the LDS-based recognition framework.This work was partially supported by startup funds from JHU, by grants ONR N00014-05-10836, NSF
CAREER 0447739, NSF EHS-0509101, and by contract JHU APL-934652
- …