68 research outputs found
Data Driven Approaches for Image & Video Understanding: from Traditional to Zero-shot Supervised Learning
In the present age of advanced computer vision, the necessity of (user-annotated) data is a key factor in image & video understanding. Recent success of deep learning on large scale data has only acted as a catalyst. There are certain problems that exist in this regard: 1) scarcity of (annotated) data, 2) need of expensive manual annotation, 3) problem of change in domain, 4) knowledge base not exhaustive.
To make efficient learning systems, one has to be prepared to deal with such diverse set of problems.
In terms of data availability, extensive manual annotation can be beneficial in obtaining category specific knowledge. Even then, learning efficient representation for the related task is challenging and requires special attention. On the other hand, when labelled data is scarce, learning category specific representation itself becomes challenging. In this work, I investigate data driven approaches that cater to traditional supervised learning setup as well as an extreme case of data scarcity where no
data from test classes are available during training, known as zero-shot learning. First, I look into supervised learning setup with ample annotations and propose efficient dictionary learning technique for better learning of data representation for the task of action classification in images & videos. Then I propose robust mid-level feature representations for action videos that are equally effective in traditional supervised learning as well as zero-shot learning. Finally, I come up with novel approach that
cater to zero-shot learning specifically. Thorough discussions followed by experimental validations establish the worth of these novel techniques in solving computer vision related tasks under varying data-dependent scenarios
Machine learning solutions to visual recognition problems
This thesis gives an overview of my research since my arrival in December 2005 as a postdoctoral fellow at the in the LEAR team at INRIA Rhone-Alpes. After a general introduction in Chapter 1, the contributions are presented in chapters 2–4 along three themes. In each chapter we describe the contributions, their relation to related work, and highlight two contributions with more detail. Chapter 2 is concerned with contributions related to the Fisher vector representation. We highlight an extension of the representation based on modeling dependencies among local descriptors (Cinbis et al., 2012, 2016a). The second highlight is on an approximate normalization scheme which speeds-up applications for object and action localization (Oneata et al., 2014b). In Chapter 3 we consider the contributions related to metric learning. The first contribution we highlight is a nearest-neighbor based image annotation method that learns weights over neighbors, and effectively determines the number of neighbors to use (Guillaumin et al., 2009a). The second contribution we highlight is an image classification method based on metric learning for the nearest class mean classifier that can efficiently generalize to new classes (Mensink et al., 2012, 2013b). The third set of contributions, presented in Chapter 4, is related to learning visual recognition models from incomplete supervision. The first highlighted contribution is an interactive image annotation method that exploits dependencies across different image labels, to improve predictions and to identify the most informative user input (Mensink et al., 2011, 2013a). The second highlighted contribution is a multi-fold multiple instance learning method for learning object localization models from training images where we only know if the object is present in the image or not (Cinbis et al., 2014, 2016b). Finally, Chapter 5 summarizes the contributions, and presents future research directions
A Methodology for Extracting Human Bodies from Still Images
Monitoring and surveillance of humans is one of the most prominent applications of today and it is expected to be part of many future aspects of our life, for safety reasons, assisted living and many others. Many efforts have been made towards automatic and robust solutions, but the general problem is very challenging and remains still open. In this PhD dissertation we examine the problem from many perspectives. First, we study the performance of a hardware architecture designed for large-scale surveillance systems. Then, we focus on the general problem of human activity recognition, present an extensive survey of methodologies that deal with this subject and propose a maturity metric to evaluate them.
One of the numerous and most popular algorithms for image processing found in the field is image segmentation and we propose a blind metric to evaluate their results regarding the activity at local regions. Finally, we propose a fully automatic system for segmenting and extracting human bodies from challenging single images, which is the main contribution of the dissertation. Our methodology is a novel bottom-up approach relying mostly on anthropometric constraints and is facilitated by our research in the fields of face, skin and hands detection. Experimental results and comparison with state-of-the-art methodologies demonstrate the success of our approach
Anonymizing Speech: Evaluating and Designing Speaker Anonymization Techniques
The growing use of voice user interfaces has led to a surge in the collection
and storage of speech data. While data collection allows for the development of
efficient tools powering most speech services, it also poses serious privacy
issues for users as centralized storage makes private personal speech data
vulnerable to cyber threats. With the increasing use of voice-based digital
assistants like Amazon's Alexa, Google's Home, and Apple's Siri, and with the
increasing ease with which personal speech data can be collected, the risk of
malicious use of voice-cloning and speaker/gender/pathological/etc. recognition
has increased.
This thesis proposes solutions for anonymizing speech and evaluating the
degree of the anonymization. In this work, anonymization refers to making
personal speech data unlinkable to an identity while maintaining the usefulness
(utility) of the speech signal (e.g., access to linguistic content). We start
by identifying several challenges that evaluation protocols need to consider to
evaluate the degree of privacy protection properly. We clarify how
anonymization systems must be configured for evaluation purposes and highlight
that many practical deployment configurations do not permit privacy evaluation.
Furthermore, we study and examine the most common voice conversion-based
anonymization system and identify its weak points before suggesting new methods
to overcome some limitations. We isolate all components of the anonymization
system to evaluate the degree of speaker PPI associated with each of them.
Then, we propose several transformation methods for each component to reduce as
much as possible speaker PPI while maintaining utility. We promote
anonymization algorithms based on quantization-based transformation as an
alternative to the most-used and well-known noise-based approach. Finally, we
endeavor a new attack method to invert anonymization.Comment: PhD Thesis Pierre Champion | Universit\'e de Lorraine - INRIA Nancy |
for associated source code, see https://github.com/deep-privacy/SA-toolki
Automatic human behaviour anomaly detection in surveillance video
This thesis work focusses upon developing the capability to automatically evaluate
and detect anomalies in human behaviour from surveillance video. We work with
static monocular cameras in crowded urban surveillance scenarios, particularly air-
ports and commercial shopping areas. Typically a person is 100 to 200 pixels high
in a scene ranging from 10 - 20 meters width and depth, populated by 5 to 40 peo-
ple at any given time. Our procedure evaluates human behaviour unobtrusively to
determine outlying behavioural events,
agging abnormal events to the operator.
In order to achieve automatic human behaviour anomaly detection we address
the challenge of interpreting behaviour within the context of the social and physical
environment. We develop and evaluate a process for measuring social connectivity
between individuals in a scene using motion and visual attention features. To do this
we use mutual information and Euclidean distance to build a social similarity matrix
which encodes the social connection strength between any two individuals. We de-
velop a second contextual basis which acts by segmenting a surveillance environment
into behaviourally homogeneous subregions which represent high tra c slow regions
and queuing areas. We model the heterogeneous scene in homogeneous subgroups
using both contextual elements. We bring the social contextual information, the
scene context, the motion, and visual attention features together to demonstrate
a novel human behaviour anomaly detection process which nds outlier behaviour
from a short sequence of video. The method, Nearest Neighbour Ranked Outlier
Clusters (NN-RCO), is based upon modelling behaviour as a time independent se-
quence of behaviour events, can be trained in advance or set upon a single sequence.
We nd that in a crowded scene the application of Mutual Information-based social
context permits the ability to prevent self-justifying groups and propagate anomalies
in a social network, granting a greater anomaly detection capability. Scene context
uniformly improves the detection of anomalies in all the datasets we test upon.
We additionally demonstrate that our work is applicable to other data domains.
We demonstrate upon the Automatic Identi cation Signal data in the maritime
domain. Our work is capable of identifying abnormal shipping behaviour using joint
motion dependency as analogous for social connectivity, and similarly segmenting
the shipping environment into homogeneous regions
Online, Supervised and Unsupervised Action Localization in Videos
Action recognition classifies a given video among a set of action labels, whereas action localization determines the location of an action in addition to its class. The overall aim of this dissertation is action localization. Many of the existing action localization approaches exhaustively search (spatially and temporally) for an action in a video. However, as the search space increases with high resolution and longer duration videos, it becomes impractical to use such sliding window techniques. The first part of this dissertation presents an efficient approach for localizing actions by learning contextual relations between different video regions in training. In testing, we use the context information to estimate the probability of each supervoxel belonging to the foreground action and use Conditional Random Field (CRF) to localize actions. In the above method and typical approaches to this problem, localization is performed in an offline manner where all the video frames are processed together. This prevents timely localization and prediction of actions/interactions - an important consideration for many tasks including surveillance and human-machine interaction. Therefore, in the second part of this dissertation we propose an online approach to the challenging problem of localization and prediction of actions/interactions in videos. In this approach, we use human poses and superpixels in each frame to train discriminative appearance models and perform online prediction of actions/interactions with Structural SVM. Above two approaches rely on human supervision in the form of assigning action class labels to videos and annotating actor bounding boxes in each frame of training videos. Therefore, in the third part of this dissertation we address the problem of unsupervised action localization. Given unlabeled videos without annotations, this approach aims at: 1) Discovering action classes using a discriminative clustering approach, and 2) Localizing actions using a variant of Knapsack problem
Computer-aided Visualization of Colonoscopy
Colonoscopy is the most widely used medical technique to examine the human large intestine (colon) and eliminate precancerous or malignant lesions, i.e., polyps. It uses a high-definition camera to examine the inner surface of the colon. Very often, a portion of the colon surface is not visualized during the procedure. Unsurveyed portions of the colon can harbor polyps that then progress to colorectal cancer. Unfortunately, it is hard for the endoscopist to realize there is unsurveyed surface from the video as it is formed. A system to alert endoscopists to missed surface area could thus more fully protect patients from colorectal cancer following colonoscopy. In this dissertation computer-aided visualization techniques were developed in order to solve this problem:1. A novel Simultaneous Localization and Mapping (SLAM) algorithm called RNNSLAM was proposed to address the difficulties of applying a traditional SLAM system on colonic images. I improved a standard SLAM system with a previously proposed Recurrent Neural Network for Depth and Pose Estimation (RNN-DP). The combination of SLAM’s optimization mechanism and RNN-DP’s prior knowledge achieved state-of-the-art performance on colonoscopy, especially addressing the drift problem in both SLAM and RNN-DP. A fusion module was added to this system to generate a dense 3D surface.2. I conducted exploration research on recognizing colonic places that have been visited based on video frames. This technique called image relocalization or retrieval is needed for helping the endoscopist to fully survey the previously unsurveyed regions. A benchmark testing dataset was created for colon image retrieval. Deep neural networks were successfully trained using Structure from Motion results on colonoscopy and achieved promising results.3. To visualize highly-curved portions of a colon or the whole colon, a generalized cylinder deformation algorithm was proposed to semi-flatten the geometry of the colon model for more succinct and global visualization.Doctor of Philosoph
Visual Representation Learning with Limited Supervision
The quality of a Computer Vision system is proportional to the rigor of data representation it is built upon. Learning expressive representations of images is therefore the centerpiece to almost every computer vision application, including image search, object detection and classification, human re-identification, object tracking, pose understanding, image-to-image translation, and embodied agent navigation to name a few. Deep Neural Networks are most often seen among the modern methods of representation learning. The limitation is, however, that deep representation learning methods require extremely large amounts of manually labeled data for training. Clearly, annotating vast amounts of images for various environments is infeasible due to cost and time constraints. This requirement of obtaining labeled data is a prime restriction regarding pace of the development of visual recognition systems.
In order to cope with the exponentially growing amounts of visual data generated daily, machine learning algorithms have to at least strive to scale at a similar rate.
The second challenge consists in the learned representations having to generalize to novel objects, classes, environments and tasks in order to accommodate to the diversity of the visual world.
Despite the evergrowing number of recent publications tangentially addressing the topic of learning generalizable representations, efficient generalization is yet to be achieved. This dissertation attempts to tackle the problem of learning visual representations that can generalize to novel settings while requiring few labeled examples.
In this research, we study the limitations of the existing supervised representation learning approaches and propose a framework that improves the generalization of learned features by exploiting visual similarities between images which are not captured by provided manual annotations. Furthermore, to mitigate the common requirement of large scale manually annotated datasets, we propose several approaches that can learn expressive representations without human-attributed labels, in a self-supervised fashion, by grouping highly-similar samples into surrogate classes based on progressively learned representations.
The development of computer vision as science is preconditioned upon the seamless ability of a machine to record and disentangle pictures' attributes that were expected to only be conceived by humans. As such, particular interest was dedicated to the ability to analyze the means of artistic expression and style which depicts a more complex task than merely breaking an image down to colors and pixels. The ultimate test for this ability is the task of style transfer which involves altering the style of an image while keeping its content. An effective solution of style transfer requires learning such image representation which would allow disentangling image style and its content.
Moreover, particular artistic styles come with idiosyncrasies that affect which content details should be preserved and which discarded.
Another pitfall here is that it is impossible to get pixel-wise annotations of style and how the style should be altered.
We address this problem by proposing an unsupervised approach that enables encoding the image content in such a way that is required by a particular style.
The proposed approach exchanges the style of an input image by first extracting the content representation in a style-aware way and then rendering it in a new style using a style-specific decoder network, achieving compelling results in image and video stylization.
Finally, we combine supervised and self-supervised representation learning techniques for the task of human and animals pose understanding. The proposed method enables transfer of the representation learned for recognition of human poses to proximal mammal species without using labeled animal images. This approach is not limited to dense pose estimation and could potentially enable autonomous agents from robots to self-driving cars to retrain themselves and adapt to novel environments based on learning from previous experiences
Characterizing Objects in Images using Human Context
Humans have an unmatched capability of interpreting detailed information about existent objects by just looking at an image. Particularly, they can effortlessly perform the following tasks: 1) Localizing various objects in the image and 2) Assigning functionalities to the parts of localized objects. This dissertation addresses the problem of aiding vision systems accomplish these two goals. The first part of the dissertation concerns object detection in a Hough-based framework. To this end, the independence assumption between features is addressed by grouping them in a local neighborhood. We study the complementary nature of individual and grouped features and combine them to achieve improved performance. Further, we consider the challenging case of detecting small and medium sized household objects under human-object interactions. We first evaluate appearance based star and tree models. While the tree model is slightly better, appearance based methods continue to suffer due to deficiencies caused by human interactions. To this end, we successfully incorporate automatically extracted human pose as a form of context for object detection. The second part of the dissertation addresses the tedious process of manually annotating objects to train fully supervised detectors. We observe that videos of human-object interactions with activity labels can serve as weakly annotated examples of household objects. Since such objects cannot be localized only through appearance or motion, we propose a framework that includes human centric functionality to retrieve the common object. Designed to maximize data utility by detecting multiple instances of an object per video, the framework achieves performance comparable to its fully supervised counterpart. The final part of the dissertation concerns localizing functional regions or affordances within objects by casting the problem as that of semantic image segmentation. To this end, we introduce a dataset involving human-object interactions with strong i.e. pixel level and weak i.e. clickpoint and image level affordance annotations. We propose a framework that utilizes both forms of weak labels and demonstrate that efforts for weak annotation can be further optimized using human context
- …