530 research outputs found
Lucid Data Dreaming for Video Object Segmentation
Convolutional networks reach top quality in pixel-level video object
segmentation but require a large amount of training data (1k~100k) to deliver
such results. We propose a new training strategy which achieves
state-of-the-art results across three evaluation datasets while using 20x~1000x
less annotated data than competing methods. Our approach is suitable for both
single and multiple object segmentation. Instead of using large training sets
hoping to generalize across domains, we generate in-domain training data using
the provided annotation on the first frame of each video to synthesize ("lucid
dream") plausible future video frames. In-domain per-video training data allows
us to train high quality appearance- and motion-based models, as well as tune
the post-processing stage. This approach allows to reach competitive results
even when training from only a single annotated frame, without ImageNet
pre-training. Our results indicate that using a larger training set is not
automatically better, and that for the video object segmentation task a smaller
training set that is closer to the target domain is more effective. This
changes the mindset regarding how many training samples and general
"objectness" knowledge are required for the video object segmentation task.Comment: Accepted in International Journal of Computer Vision (IJCV
Dynablox: Real-time Detection of Diverse Dynamic Objects in Complex Environments
Real-time detection of moving objects is an essential capability for robots
acting autonomously in dynamic environments. We thus propose Dynablox, a novel
online mapping-based approach for robust moving object detection in complex
environments. The central idea of our approach is to incrementally estimate
high confidence free-space areas by modeling and accounting for sensing, state
estimation, and mapping limitations during online robot operation. The
spatio-temporally conservative free space estimate enables robust detection of
moving objects without making any assumptions on the appearance of objects or
environments. This allows deployment in complex scenes such as multi-storied
buildings or staircases, and for diverse moving objects such as people carrying
various items, doors swinging or even balls rolling around. We thoroughly
evaluate our approach on real-world data sets, achieving 86% IoU at 17 FPS in
typical robotic settings. The method outperforms a recent appearance-based
classifier and approaches the performance of offline methods. We demonstrate
its generality on a novel data set with rare moving objects in complex
environments. We make our efficient implementation and the novel data set
available as open-source.Comment: Code released at https://github.com/ethz-asl/dynablo
Towards Interaction-level Video Action Understanding
A huge amount of videos have been created, spread, and viewed daily. Among these massive videos, the actions and activities of humans account for a large part. We desire machines to understand human actions in videos as this is essential to various applications, including but not limited to autonomous driving cars, security systems, human-robot interactions and healthcare. Towards real intelligent system that is able to interact with humans, video understanding must go beyond simply answering ``what is the action in the video", but be more aware of what those actions mean to humans and be more in line with human thinking, which we call interactive-level action understanding. This thesis identifies three main challenges to approaching interactive-level video action understanding: 1) understanding actions given human consensus; 2) understanding actions based on specific human rules; 3) directly understanding actions in videos via human natural language. For the first challenge, we select video summary as a representative task that aims to select informative frames to retain high-level information based on human annotators' experience. Through self-attention architecture and meta-learning, which jointly process dual representations of visual and sequential information for video summarization, the proposed model is capable of understanding video from human consensus (e.g., how humans think which parts of an action sequence are essential). For the second challenge, our works on action quality assessment utilize transformer decoders to parse the input action into several sub-actions and assess the more fine-grained qualities of the given action, yielding the capability of action understanding given specific human rules. (e.g., how well a diving action performs, how well a robot performs surgery) The third key idea explored in this thesis is to use graph neural networks in an adversarial fashion to understand actions through natural language. We demonstrate the utility of this technique for the video captioning task, which takes an action video as input, outputs natural language, and yields state-of-the-art performance. It can be concluded that the research directions and methods introduced in this thesis provide fundamental components toward interactive-level action understanding
Visual Place Recognition for Autonomous Robots
Autonomous robotics has been the subject of great interest within the research community over the past few decades. Its applications are wide-spread, ranging from health-care to manufacturing, goods transportation to home deliveries, site-maintenance to construction, planetary explorations to rescue operations and many others, including but not limited to agriculture, defence, commerce, leisure and extreme environments. At the core of robot autonomy lies the problem of localisation, i.e, knowing where it is and within the robotics community, this problem is termed as place recognition. Place recognition using only visual input is termed as Visual Place Recognition (VPR) and refers to the ability of an autonomous system to recall a previously visited place using only visual input, under changing viewpoint, illumination and seasonal conditions, and given computational and storage constraints.
This thesis is a collection of 4 inter-linked, mutually-relevant but branching-out topics within VPR: 1) What makes a place/image worthy for VPR?, 2) How to define a state-of-the-art in VPR?, 3) Do VPR techniques designed for ground-based platforms extend to aerial platforms? and 4) Can a handcrafted VPR technique outperform deep-learning-based VPR techniques? Each of these questions is a dedicated, peer-reviewed chapter in this thesis and the author attempts to answer these questions to the best of his abilities.
The worthiness of a place essentially refers to the salience and distinctiveness of the content in the image of this place. This salience is modelled as a framework, namely memorable-maps, comprising of 3 conjoint criteria: a) Human-memorability of an image, 2) Staticity and 3) Information content. Because a large number of VPR techniques have been proposed over the past 10-15 years, and due to the variation of employed VPR datasets and metrics for evaluation, the correct state-of-the-art remains ambiguous. The author levels this playing field by deploying 10 contemporary techniques on a common platform and use the most challenging VPR datasets to provide a holistic performance comparison. This platform is then extended to aerial place recognition datasets to answer the 3rd question above. Finally, the author designs a novel, handcrafted, compute-efficient and training-free VPR technique that outperforms state-of-the-art VPR techniques on 5 different VPR datasets
Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work
Inspired by the fact that human brains can emphasize discriminative parts of
the input and suppress irrelevant ones, substantial local mechanisms have been
designed to boost the development of computer vision. They can not only focus
on target parts to learn discriminative local representations, but also process
information selectively to improve the efficiency. In terms of application
scenarios and paradigms, local mechanisms have different characteristics. In
this survey, we provide a systematic review of local mechanisms for various
computer vision tasks and approaches, including fine-grained visual
recognition, person re-identification, few-/zero-shot learning, multi-modal
learning, self-supervised learning, Vision Transformers, and so on.
Categorization of local mechanisms in each field is summarized. Then,
advantages and disadvantages for every category are analyzed deeply, leaving
room for exploration. Finally, future research directions about local
mechanisms have also been discussed that may benefit future works. To the best
our knowledge, this is the first survey about local mechanisms on computer
vision. We hope that this survey can shed light on future research in the
computer vision field
Survey on video anomaly detection in dynamic scenes with moving cameras
The increasing popularity of compact and inexpensive cameras, e.g.~dash
cameras, body cameras, and cameras equipped on robots, has sparked a growing
interest in detecting anomalies within dynamic scenes recorded by moving
cameras. However, existing reviews primarily concentrate on Video Anomaly
Detection (VAD) methods assuming static cameras. The VAD literature with moving
cameras remains fragmented, lacking comprehensive reviews to date. To address
this gap, we endeavor to present the first comprehensive survey on Moving
Camera Video Anomaly Detection (MC-VAD). We delve into the research papers
related to MC-VAD, critically assessing their limitations and highlighting
associated challenges. Our exploration encompasses three application domains:
security, urban transportation, and marine environments, which in turn cover
six specific tasks. We compile an extensive list of 25 publicly-available
datasets spanning four distinct environments: underwater, water surface,
ground, and aerial. We summarize the types of anomalies these datasets
correspond to or contain, and present five main categories of approaches for
detecting such anomalies. Lastly, we identify future research directions and
discuss novel contributions that could advance the field of MC-VAD. With this
survey, we aim to offer a valuable reference for researchers and practitioners
striving to develop and advance state-of-the-art MC-VAD methods.Comment: Under revie
BLiRF: Bandlimited Radiance Fields for Dynamic Scene Modeling
Reasoning the 3D structure of a non-rigid dynamic scene from a single moving
camera is an under-constrained problem. Inspired by the remarkable progress of
neural radiance fields (NeRFs) in photo-realistic novel view synthesis of
static scenes, extensions have been proposed for dynamic settings. These
methods heavily rely on neural priors in order to regularize the problem. In
this work, we take a step back and reinvestigate how current implementations
may entail deleterious effects, including limited expressiveness, entanglement
of light and density fields, and sub-optimal motion localization. As a remedy,
we advocate for a bridge between classic non-rigid-structure-from-motion
(\nrsfm) and NeRF, enabling the well-studied priors of the former to constrain
the latter. To this end, we propose a framework that factorizes time and space
by formulating a scene as a composition of bandlimited, high-dimensional
signals. We demonstrate compelling results across complex dynamic scenes that
involve changes in lighting, texture and long-range dynamics
- …