54,628 research outputs found
Automatic annotation for weakly supervised learning of detectors
PhDObject detection in images and action detection in videos are among the most widely studied
computer vision problems, with applications in consumer photography, surveillance, and automatic
media tagging. Typically, these standard detectors are fully supervised, that is they require
a large body of training data where the locations of the objects/actions in images/videos have
been manually annotated. With the emergence of digital media, and the rise of high-speed internet,
raw images and video are available for little to no cost. However, the manual annotation
of object and action locations remains tedious, slow, and expensive. As a result there has been
a great interest in training detectors with weak supervision where only the presence or absence
of object/action in image/video is needed, not the location. This thesis presents approaches for
weakly supervised learning of object/action detectors with a focus on automatically annotating
object and action locations in images/videos using only binary weak labels indicating the presence
or absence of object/action in images/videos.
First, a framework for weakly supervised learning of object detectors in images is presented.
In the proposed approach, a variation of multiple instance learning (MIL) technique for automatically
annotating object locations in weakly labelled data is presented which, unlike existing
approaches, uses inter-class and intra-class cue fusion to obtain the initial annotation. The initial
annotation is then used to start an iterative process in which standard object detectors are used to
refine the location annotation. Finally, to ensure that the iterative training of detectors do not drift
from the object of interest, a scheme for detecting model drift is also presented. Furthermore,
unlike most other methods, our weakly supervised approach is evaluated on data without manual
pose (object orientation) annotation.
Second, an analysis of the initial annotation of objects, using inter-class and intra-class cues,
is carried out. From the analysis, a new method based on negative mining (NegMine) is presented
for the initial annotation of both object and action data. The NegMine based approach is a
much simpler formulation using only inter-class measure and requires no complex combinatorial
optimisation but can still meet or outperform existing approaches including the previously pre3
sented inter-intra class cue fusion approach. Furthermore, NegMine can be fused with existing
approaches to boost their performance.
Finally, the thesis will take a step back and look at the use of generic object detectors as prior
knowledge in weakly supervised learning of object detectors. These generic object detectors are
typically based on sampling saliency maps that indicate if a pixel belongs to the background
or foreground. A new approach to generating saliency maps is presented that, unlike existing
approaches, looks beyond the current image of interest and into images similar to the current
image. We show that our generic object proposal method can be used by itself to annotate the
weakly labelled object data with surprisingly high accuracy
Visually Indicated Sounds
Objects make distinctive sounds when they are hit or scratched. These sounds
reveal aspects of an object's material properties, as well as the actions that
produced them. In this paper, we propose the task of predicting what sound an
object makes when struck as a way of studying physical interactions within a
visual scene. We present an algorithm that synthesizes sound from silent videos
of people hitting and scratching objects with a drumstick. This algorithm uses
a recurrent neural network to predict sound features from videos and then
produces a waveform from these features with an example-based synthesis
procedure. We show that the sounds predicted by our model are realistic enough
to fool participants in a "real or fake" psychophysical experiment, and that
they convey significant information about material properties and physical
interactions
High performances monolithic CMOS detectors for space applications
During the last 10 years, research about CMOS image sensors (also called APS -Active Pixel Sensors) has been intensively carried out, in order to offer an alternative to CCDs as image sensors. This is particularly the case for space applications as CMOS image sensors feature characteristics which are obviously of interest for flight hardware: parallel or semi-parallel architecture, on chip control and processing electronics, low power dissipation, high level ofradiation tolerance... Many image sensor companies, institutes and laboratories have demonstrated the compatibility of CMOS image sensors with consumer applications: micro-cameras, video-conferencing, digital-still cameras. And recent designs have shown that APS is getting closer to the CCD in terms ofperformance level. However, the large majority ofthe existing products do not offer the specific features which are required for many space applications. ASTRI1JM and SUPAERO/CIMI have decided to work together in view of developing CMOS image sensors dedicated to space business. After a brief presentation of the team organisation for space image sensor design and production, the latest results of a high performances 512x512 pixels CMOS device characterisation are presented with emphasis on the achieved electro-optical performance. Finally, the on going and short-term coming activities of the team are discussed
Stochastic Dynamics for Video Infilling
In this paper, we introduce a stochastic dynamics video infilling (SDVI)
framework to generate frames between long intervals in a video. Our task
differs from video interpolation which aims to produce transitional frames for
a short interval between every two frames and increase the temporal resolution.
Our task, namely video infilling, however, aims to infill long intervals with
plausible frame sequences. Our framework models the infilling as a constrained
stochastic generation process and sequentially samples dynamics from the
inferred distribution. SDVI consists of two parts: (1) a bi-directional
constraint propagation module to guarantee the spatial-temporal coherence among
frames, (2) a stochastic sampling process to generate dynamics from the
inferred distributions. Experimental results show that SDVI can generate clear
frame sequences with varying contents. Moreover, motions in the generated
sequence are realistic and able to transfer smoothly from the given start frame
to the terminal frame. Our project site is
https://xharlie.github.io/projects/project_sites/SDVI/video_results.htmlComment: Winter Conference on Applications of Computer Vision (WACV 2020
- …