5 research outputs found
Efficient and Robust Methods for Audio and Video Signal Analysis
This thesis presents my research concerning audio and video signal processing and machine learning. Specifically, the topics of my research include computationally efficient classifier compounds, automatic speech recognition (ASR), music dereverberation, video cut point detection and video classification.Computational efficacy of information retrieval based on multiple measurement modalities has been considered in this thesis. Specifically, a cascade processing framework, including a training algorithm to set its parameters has been developed for combining multiple detectors or binary classifiers in computationally efficient way. The developed cascade processing framework has been applied on video information retrieval tasks of video cut point detection and video classification. The results in video classification, compared to others found in the literature, indicate that the developed framework is capable of both accurate and computationally efficient classification. The idea of cascade processing has been additionally adapted for the ASR task. A procedure for combining multiple speech state likelihood estimation methods within an ASR framework in cascaded manner has been developed. The results obtained clearly show that without impairing the transcription accuracy the computational load of ASR can be reduced using the cascaded speech state likelihood estimation process.Additionally, this thesis presents my work on noise robustness of ASR using a nonnegative matrix factorization (NMF) -based approach. Specifically, methods for transformation of sparse NMF-features into speech state likelihoods has been explored. The results reveal that learned transformations from NMF activations to speech state likelihoods provide better ASR transcription accuracy than dictionary label -based transformations. The results, compared to others in a noisy speech recognition -challenge show that NMF-based processing is an efficient strategy for noise robustness in ASR.The thesis also presents my work on audio signal enhancement, specifically, on removing the detrimental effect of reverberation from music audio. In the work, a linear prediction -based dereverberation algorithm, which has originally been developed for speech signal enhancement, was applied for music. The results obtained show that the algorithm performs well in conjunction with music signals and indicate that dynamic compression of music does not impair the dereverberation performance
Efficient multi-level scene understanding in videos
Automatic video parsing is a key step towards human-level dynamic
scene understanding, and a fundamental problem in computer
vision.
A core issue in video understanding is to infer multiple scene
properties of a video in an efficient and consistent manner. This
thesis addresses the problem of holistic scene understanding from
monocular videos, which jointly reason about semantic and
geometric scene properties from multiple levels, including
pixelwise annotation of video frames, object instance
segmentation in spatio-temporal domain, and/or scene-level
description in terms of scene categories and layouts.
We focus on four main issues in the holistic video understanding:
1) what is the representation for consistent semantic and
geometric parsing of videos? 2) how do we integrate high-level
reasoning (e.g., objects) with pixel-wise video parsing? 3) how
can we do efficient inference for multi-level video
understanding? and 4) what is the representation learning
strategy for efficient/cost-aware scene parsing?
We discuss three multi-level video scene segmentation scenarios
based on different aspects of scene properties and efficiency
requirements. The first case addresses the problem of consistent
geometric and semantic video segmentation for outdoor scenes.
We propose a geometric scene layout representation, or a stage
scene model, to efficiently capture the dependency between the
semantic and geometric labels.
We build a unified conditional random field for joint modeling of
the semantic class, geometric label and the stage representation,
and design an alternating inference algorithm to minimize the
resulting energy function. The second case focuses on the problem
of simultaneous pixel-level and object-level segmentation in
videos. We propose to incorporate foreground object information
into pixel labeling by jointly reasoning semantic labels of
supervoxels, object instance tracks and geometric relations
between objects. In order to model objects, we take an exemplar
approach based on a small set of object annotations to generate
a set of object proposals. We then design a conditional random
field framework that jointly models the supervoxel labels and
object instance segments. To scale up our method, we develop an
active inference strategy to improve the efficiency of
multi-level video parsing, which adaptively selects an
informative subset of object proposals and performs inference on
the resulting compact model.
The last case explores the problem of learning a flexible
representation for efficient scene labeling. We propose a dynamic
hierarchical model that allows us to achieve flexible trade-offs
between efficiency and accuracy. Our approach incorporates the
cost of feature computation and model inference, and optimizes
the model performance for any given test-time budget. We evaluate
all our methods on several publicly available video and image
semantic segmentation datasets, and demonstrate superior
performance in efficiency and accuracy.
Keywords: Semantic video segmentation, Multi-level scene
understanding, Efficient inference, Cost-aware scene parsin