12 research outputs found
Recommended from our members
Object Tracking-by-Segmentation in Videos
This thesis focuses on the problem of object tracking. Given a video, the general objective of tracking is to track the location over time of one or more targets in the image sequence. This is a very challenging task as algorithms need to deal with problems such as appearance variations, non-rigid deformations, cluttered background, occlusions etc. While most existing methods use bounding boxes to represent the target, we use segmentations instead, which provide better ac- cess to target pixels and can better handle occlusions. Our first contribution, is a new tracking algorithm that given an over-segmentation of a video tracks multiple targets through interactions and occlusions. We develop a provably convergent learning algorithm for this approach, which leverages training data to improve performance. Our second contribution targets the case when an over-segmentation is not available due to poor video quality or low resolution. For this case, we develop a new algorithm that tracks coherent regions and estimates the number of target objects in each region. This count representation of a video can be used to help inform more traditional tracking techniques. Finally, we develop the first tracking-by-segmentation approach based on deep learning. We propose a novel deep network architecture and training algorithms for learning to segment and track a target object throughout a video. All of our algorithms are rigorously evaluated on challenging benchmark video collections, which demonstrate improvements over the state-of-the-art
Fusion of Head and Full-Body Detectors for Multi-Object Tracking
In order to track all persons in a scene, the tracking-by-detection paradigm
has proven to be a very effective approach. Yet, relying solely on a single
detector is also a major limitation, as useful image information might be
ignored. Consequently, this work demonstrates how to fuse two detectors into a
tracking system. To obtain the trajectories, we propose to formulate tracking
as a weighted graph labeling problem, resulting in a binary quadratic program.
As such problems are NP-hard, the solution can only be approximated. Based on
the Frank-Wolfe algorithm, we present a new solver that is crucial to handle
such difficult problems. Evaluation on pedestrian tracking is provided for
multiple scenarios, showing superior results over single detector tracking and
standard QP-solvers. Finally, our tracker ranks 2nd on the MOT16 benchmark and
1st on the new MOT17 benchmark, outperforming over 90 trackers.Comment: 10 pages, 4 figures; Winner of the MOT17 challenge; CVPRW 201
Enhancing camera surveillance using computer vision: a research note
- The growth of police operated surveillance cameras has
out-paced the ability of humans to monitor them effectively. Computer vision is
a possible solution. An ongoing research project on the application of computer
vision within a municipal police department is described. The paper aims to
discuss these issues.
- Following the demystification of
computer vision technology, its potential for police agencies is developed
within a focus on computer vision as a solution for two common surveillance
camera tasks (live monitoring of multiple surveillance cameras and summarizing
archived video files). Three unaddressed research questions (can specialized
computer vision applications for law enforcement be developed at this time, how
will computer vision be utilized within existing public safety camera
monitoring rooms, and what are the system-wide impacts of a computer vision
capability on local criminal justice systems) are considered.
- Despite computer vision becoming accessible to law
enforcement agencies the impact of computer vision has not been discussed or
adequately researched. There is little knowledge of computer vision or its
potential in the field.
- This paper introduces and discusses computer
vision from a law enforcement perspective and will be valuable to police
personnel tasked with monitoring large camera networks and considering computer
vision as a system upgrade
Learning to Reduce Annotation Load
Modern machine learning methods and their applications in computer vision are known to crave for large amounts of training data to reach their full potential. Because training data is mostly obtained through humans who manually label samples, it induces a significant cost. Therefore, the problem of reducing the annotation load is of great importance for the success of machine learning methods.
We study the problem of reducing the annotation load from two viewpoints, by answering the questions âWhat to annotate?â and âHow to annotate?â. The question âWhat?â addresses the selection of a small portion of the data that would be sufficient to train an accurate model. The question âHow? focuses on minimising the effort of labelling each datapoint. The question âWhat to annotate?â becomes particularly compelling if we can select data to be annotated in an iterative and adaptive way, a setting known as active learning (AL). The key challenge in AL is to identify the datapoints that are the most informative for the model at a given stage. We propose several techniques to address this challenge. Firstly, we consider the problem of segmenting natural images and image volumes. We take advantage of image priors, such as smoothness of objects of interest, and use them in a novel form of geometric uncertainty. Using this, we design an AL technique to efficiently annotate data that is tailored to segmentation applications. Next, we notice that no single manually-designed strategy outperforms others in every application and that often the burden of designing new strategies outweighs the benefits of AL. To overcome this problem we suggest learning an AL strategy from data by formulating the AL problem as a regression task that predicts the reduction in the generalisation error achieved by labelling each datapoint. This enables us to learn AL strategies from simulated data and to transfer them to new datasets. Finally, we turn towards non-myopic data-driven AL strategies. To this end, we formulate the AL problem as a Markov decision process and find the best selection policy using reinforcement learning. We design the decision process such that the policy can be learnt for any ML model and transferred to diverse application domains.
Effectively addressing the question âHow to annotate?â is of no less importance as large cost savings can be achieved by labelling each datapoint more efficiently. This can be done with intelligent interfaces that interact with a human annotator. We make two contributions towards answering the question âHow?â. Firstly, we propose an efficient technique to annotate 3D image volumes for image segmentation. Annotating data in 3D is cumbersome and an obvious way to facilitate it is to select a subset of the data lying on a 2D plane. To find the optimal plane (i.e. the one containing the most informative datapoints) we design a branch-and-bound algorithm that quickly eliminates hypotheses about the optimal projection. Secondly, we propose an intelligent data annotation method to train object detectors. Instead of always asking the human annotator to draw bounding boxes in images, we detect automatically in which cases we can rely on the current detector and verify its proposal
Efficient human annotation schemes for training object class detectors
A central task in computer vision is detecting object classes such as cars and horses
in complex scenes. Training an object class detector typically requires a large set of
images labeled with tight bounding boxes around every object instance. Obtaining
such data requires human annotation, which is very expensive and time consuming.
Alternatively, researchers have tried to train models in a weakly supervised setting (i.e.,
given only image-level labels), which is much cheaper but leads to weaker detectors.
In this thesis, we propose new and efficient human annotation schemes for training
object class detectors that bypass the need for drawing bounding boxes and reduce the
annotation cost while still obtaining high quality object detectors.
First, we propose to train object class detectors from eye tracking data. Instead
of drawing tight bounding boxes, the annotators only need to look at the image and
find the target object. We track the eye movements of annotators while they perform
this visual search task and we propose a technique for deriving object bounding boxes
from these eye fixations. To validate our idea, we augment an existing object detection
dataset with eye tracking data.
Second, we propose a scheme for training object class detectors, which only requires
annotators to verify bounding-boxes produced automatically by the learning
algorithm. Our scheme introduces human verification as a new step into a standard
weakly supervised framework which typically iterates between re-training object detectors
and re-localizing objects in the training images. We use the verification signal
to improve both re-training and re-localization.
Third, we propose another scheme where annotators are asked to click on the center
of an imaginary bounding box, which tightly encloses the object. We then incorporate
these clicks into a weakly supervised object localization technique, to jointly localize
object bounding boxes over all training images. Both our center-clicking and human
verification schemes deliver detectors performing almost as well as those trained in a
fully supervised setting.
Finally, we propose extreme clicking. We ask the annotator to click on four physical
points on the object: the top, bottom, left- and right-most points. This task is more
natural than the traditional way of drawing boxes and these points are easy to find. Our
experiments show that annotating objects with extreme clicking is 5 X faster than the
traditional way of drawing boxes and it leads to boxes of the same quality as the original
ground-truth drawn the traditional way. Moreover, we use the resulting extreme
points to obtain more accurate segmentations than those derived from bounding boxes
Symbiotic deep learning for medical image analysis with applications in real-time diagnosis for fetal ultrasound screening
The last hundred years have seen a monumental rise in the power and capability of machines to
perform intelligent tasks in the stead of previously human operators. This rise is not expected
to slow down any time soon and what this means for society and humanity as a whole remains
to be seen. The overwhelming notion is that with the right goals in mind, the growing influence
of machines on our every day tasks will enable humanity to give more attention to the truly
groundbreaking challenges that we all face together. This will usher in a new age of human
machine collaboration in which humans and machines may work side by side to achieve greater
heights for all of humanity. Intelligent systems are useful in isolation, but the true benefits of
intelligent systems come to the fore in complex systems where the interaction between humans
and machines can be made seamless, and it is this goal of symbiosis between human and machine
that may democratise complex knowledge, which motivates this thesis. In the recent past, datadriven
methods have come to the fore and now represent the state-of-the-art in many different
fields. Alongside the shift from rule-based towards data-driven methods we have also seen a
shift in how humans interact with these technologies. Human computer interaction is changing
in response to data-driven methods and new techniques must be developed to enable the same
symbiosis between man and machine for data-driven methods as for previous formula-driven
technology.
We address five key challenges which need to be overcome for data-driven human-in-the-loop
computing to reach maturity. These are (1) the ’Categorisation Challenge’ where we examine
existing work and form a taxonomy of the different methods being utilised for data-driven
human-in-the-loop computing; (2) the ’Confidence Challenge’, where data-driven methods must
communicate interpretable beliefs in how confident their predictions are; (3) the ’Complexity
Challenge’ where the aim of reasoned communication becomes increasingly important as the
complexity of tasks and methods to solve also increases; (4) the ’Classification Challenge’ in
which we look at how complex methods can be separated in order to provide greater reasoning
in complex classification tasks; and finally (5) the ’Curation Challenge’ where we challenge the
assumptions around bottleneck creation for the development of supervised learning methods.Open Acces
Combining content analysis with usage analysis to better understand visual contents
This thesis focuses on the problem of understanding visual contents, which can be images, videos or 3D contents. Understanding means that we aim at inferring semantic information about the visual content. The goal of our work is to study methods that combine two types of approaches: 1) automatic content analysis and 2) an analysis of how humans interact with the content (in other words, usage analysis). We start by reviewing the state of the art from both Computer Vision and Multimedia communities. Twenty years ago, the main approach was aiming at a fully automatic understanding of images. This approach today gives way to different forms of human intervention, whether it is through the constitution of annotated datasets, or by solving problems interactively (e.g. detection or segmentation), or by the implicit collection of information gathered from content usages. These different types of human intervention are at the heart of modern research questions: how to motivate human contributors? How to design interactive scenarii that will generate interactions that contribute to content understanding? How to check or ensure the quality of human contributions? How to aggregate human contributions? How to fuse inputs obtained from usage analysis with traditional outputs from content analysis? Our literature review addresses these questions and allows us to position the contributions of this thesis. In our first set of contributions we revisit the detection of important (or salient) regions through implicit feedback from users that either consume or produce visual contents. In 2D, we develop several interfaces of interactive video (e.g. zoomable video) in order to coordinate content analysis and usage analysis. We also generalize these results to 3D by introducing a new detector of salient regions that builds upon simultaneous video recordings of the same public artistic performance (dance show, chant, etc.) by multiple users. The second contribution of our work aims at a semantic understanding of fixed images. With this goal in mind, we use data gathered through a game, Ask’nSeek, that we created. Elementary interactions (such as clicks) together with textual input data from players are, as before, mixed with automatic analysis of images. In particular, we show the usefulness of interactions that help revealing spatial relations between different objects in a scene. After studying the problem of detecting objects on a scene, we also adress the more ambitious problem of segmentation