1,136 research outputs found
Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work
Inspired by the fact that human brains can emphasize discriminative parts of
the input and suppress irrelevant ones, substantial local mechanisms have been
designed to boost the development of computer vision. They can not only focus
on target parts to learn discriminative local representations, but also process
information selectively to improve the efficiency. In terms of application
scenarios and paradigms, local mechanisms have different characteristics. In
this survey, we provide a systematic review of local mechanisms for various
computer vision tasks and approaches, including fine-grained visual
recognition, person re-identification, few-/zero-shot learning, multi-modal
learning, self-supervised learning, Vision Transformers, and so on.
Categorization of local mechanisms in each field is summarized. Then,
advantages and disadvantages for every category are analyzed deeply, leaving
room for exploration. Finally, future research directions about local
mechanisms have also been discussed that may benefit future works. To the best
our knowledge, this is the first survey about local mechanisms on computer
vision. We hope that this survey can shed light on future research in the
computer vision field
Co-attention Propagation Network for Zero-Shot Video Object Segmentation
Zero-shot video object segmentation (ZS-VOS) aims to segment foreground
objects in a video sequence without prior knowledge of these objects. However,
existing ZS-VOS methods often struggle to distinguish between foreground and
background or to keep track of the foreground in complex scenarios. The common
practice of introducing motion information, such as optical flow, can lead to
overreliance on optical flow estimation. To address these challenges, we
propose an encoder-decoder-based hierarchical co-attention propagation network
(HCPN) capable of tracking and segmenting objects. Specifically, our model is
built upon multiple collaborative evolutions of the parallel co-attention
module (PCM) and the cross co-attention module (CCM). PCM captures common
foreground regions among adjacent appearance and motion features, while CCM
further exploits and fuses cross-modal motion features returned by PCM. Our
method is progressively trained to achieve hierarchical spatio-temporal feature
propagation across the entire video. Experimental results demonstrate that our
HCPN outperforms all previous methods on public benchmarks, showcasing its
effectiveness for ZS-VOS.Comment: accepted by IEEE Transactions on Image Processin
Spott : on-the-spot e-commerce for television using deep learning-based video analysis techniques
Spott is an innovative second screen mobile multimedia application which offers viewers relevant information on objects (e.g., clothing, furniture, food) they see and like on their television screens. The application enables interaction between TV audiences and brands, so producers and advertisers can offer potential consumers tailored promotions, e-shop items, and/or free samples. In line with the current views on innovation management, the technological excellence of the Spott application is coupled with iterative user involvement throughout the entire development process. This article discusses both of these aspects and how they impact each other. First, we focus on the technological building blocks that facilitate the (semi-) automatic interactive tagging process of objects in the video streams. The majority of these building blocks extensively make use of novel and state-of-the-art deep learning concepts and methodologies. We show how these deep learning based video analysis techniques facilitate video summarization, semantic keyframe clustering, and (similar) object retrieval. Secondly, we provide insights in user tests that have been performed to evaluate and optimize the application's user experience. The lessons learned from these open field tests have already been an essential input in the technology development and will further shape the future modifications to the Spott application
MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
Video object segmentation (VOS) aims at segmenting a particular object
throughout the entire video clip sequence. The state-of-the-art VOS methods
have achieved excellent performance (e.g., 90+% J&F) on existing datasets.
However, since the target objects in these existing datasets are usually
relatively salient, dominant, and isolated, VOS under complex scenes has rarely
been studied. To revisit VOS and make it more applicable in the real world, we
collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to
study the tracking and segmenting objects in complex environments. MOSE
contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725
high-quality object segmentation masks. The most notable feature of MOSE
dataset is complex scenes with crowded and occluded objects. The target objects
in the videos are commonly occluded by others and disappear in some frames. To
analyze the proposed MOSE dataset, we benchmark 18 existing VOS methods under 4
different settings on the proposed MOSE dataset and conduct comprehensive
comparisons. The experiments show that current VOS algorithms cannot well
perceive objects in complex scenes. For example, under the semi-supervised VOS
setting, the highest J&F by existing state-of-the-art VOS methods is only 59.4%
on MOSE, much lower than their ~90% J&F performance on DAVIS. The results
reveal that although excellent performance has been achieved on existing
benchmarks, there are unresolved challenges under complex scenes and more
efforts are desired to explore these challenges in the future. The proposed
MOSE dataset has been released at https://henghuiding.github.io/MOSE.Comment: MOSE Dataset Repor
Learning from Very Few Samples: A Survey
Few sample learning (FSL) is significant and challenging in the field of
machine learning. The capability of learning and generalizing from very few
samples successfully is a noticeable demarcation separating artificial
intelligence and human intelligence since humans can readily establish their
cognition to novelty from just a single or a handful of examples whereas
machine learning algorithms typically entail hundreds or thousands of
supervised samples to guarantee generalization ability. Despite the long
history dated back to the early 2000s and the widespread attention in recent
years with booming deep learning technologies, little surveys or reviews for
FSL are available until now. In this context, we extensively review 300+ papers
of FSL spanning from the 2000s to 2019 and provide a timely and comprehensive
survey for FSL. In this survey, we review the evolution history as well as the
current progress on FSL, categorize FSL approaches into the generative model
based and discriminative model based kinds in principle, and emphasize
particularly on the meta learning based FSL approaches. We also summarize
several recently emerging extensional topics of FSL and review the latest
advances on these topics. Furthermore, we highlight the important FSL
applications covering many research hotspots in computer vision, natural
language processing, audio and speech, reinforcement learning and robotic, data
analysis, etc. Finally, we conclude the survey with a discussion on promising
trends in the hope of providing guidance and insights to follow-up researches.Comment: 30 page
- …