507 research outputs found
GCoNet+: A Stronger Group Collaborative Co-Salient Object Detector
In this paper, we present a novel end-to-end group collaborative learning
network, termed GCoNet+, which can effectively and efficiently (250 fps)
identify co-salient objects in natural scenes. The proposed GCoNet+ achieves
the new state-of-the-art performance for co-salient object detection (CoSOD)
through mining consensus representations based on the following two essential
criteria: 1) intra-group compactness to better formulate the consistency among
co-salient objects by capturing their inherent shared attributes using our
novel group affinity module (GAM); 2) inter-group separability to effectively
suppress the influence of noisy objects on the output by introducing our new
group collaborating module (GCM) conditioning on the inconsistent consensus. To
further improve the accuracy, we design a series of simple yet effective
components as follows: i) a recurrent auxiliary classification module (RACM)
promoting the model learning at the semantic level; ii) a confidence
enhancement module (CEM) helping the model to improve the quality of the final
predictions; and iii) a group-based symmetric triplet (GST) loss guiding the
model to learn more discriminative features. Extensive experiments on three
challenging benchmarks, i.e., CoCA, CoSOD3k, and CoSal2015, demonstrate that
our GCoNet+ outperforms the existing 12 cutting-edge models. Code has been
released at https://github.com/ZhengPeng7/GCoNet_plus
Salient Object Detection Techniques in Computer Vision-A Survey.
Detection and localization of regions of images that attract immediate human visual attention is currently an intensive area of research in computer vision. The capability of automatic identification and segmentation of such salient image regions has immediate consequences for applications in the field of computer vision, computer graphics, and multimedia. A large number of salient object detection (SOD) methods have been devised to effectively mimic the capability of the human visual system to detect the salient regions in images. These methods can be broadly categorized into two categories based on their feature engineering mechanism: conventional or deep learning-based. In this survey, most of the influential advances in image-based SOD from both conventional as well as deep learning-based categories have been reviewed in detail. Relevant saliency modeling trends with key issues, core techniques, and the scope for future research work have been discussed in the context of difficulties often faced in salient object detection. Results are presented for various challenging cases for some large-scale public datasets. Different metrics considered for assessment of the performance of state-of-the-art salient object detection models are also covered. Some future directions for SOD are presented towards end
Towards holistic scene understanding:Semantic segmentation and beyond
This dissertation addresses visual scene understanding and enhances
segmentation performance and generalization, training efficiency of networks,
and holistic understanding. First, we investigate semantic segmentation in the
context of street scenes and train semantic segmentation networks on
combinations of various datasets. In Chapter 2 we design a framework of
hierarchical classifiers over a single convolutional backbone, and train it
end-to-end on a combination of pixel-labeled datasets, improving
generalizability and the number of recognizable semantic concepts. Chapter 3
focuses on enriching semantic segmentation with weak supervision and proposes a
weakly-supervised algorithm for training with bounding box-level and
image-level supervision instead of only with per-pixel supervision. The memory
and computational load challenges that arise from simultaneous training on
multiple datasets are addressed in Chapter 4. We propose two methodologies for
selecting informative and diverse samples from datasets with weak supervision
to reduce our networks' ecological footprint without sacrificing performance.
Motivated by memory and computation efficiency requirements, in Chapter 5, we
rethink simultaneous training on heterogeneous datasets and propose a universal
semantic segmentation framework. This framework achieves consistent increases
in performance metrics and semantic knowledgeability by exploiting various
scene understanding datasets. Chapter 6 introduces the novel task of part-aware
panoptic segmentation, which extends our reasoning towards holistic scene
understanding. This task combines scene and parts-level semantics with
instance-level object detection. In conclusion, our contributions span over
convolutional network architectures, weakly-supervised learning, part and
panoptic segmentation, paving the way towards a holistic, rich, and sustainable
visual scene understanding.Comment: PhD Thesis, Eindhoven University of Technology, October 202
Recommended from our members
Human machine collaboration for foreground segmentation in images and videos
Foreground segmentation is defined as the problem of generating pixel level foreground masks for all the objects in a given image or video. Accurate foreground segmentations in images and videos have several potential applications such as improving search, training richer object detectors, image synthesis and re-targeting, scene and activity understanding, video summarization, and post-production video editing.
One effective way to solve this problem is human-machine collaboration. The main idea is to let humans guide the segmentation process through some partial supervision. As humans, we are extremely good at perception and can easily identify the foreground regions. Computers, on the other hand, lack this capability, but are extremely good at continuously processing large volumes of data at the lowest level of detail with great efficiency. Bringing these complementary strengths together can lead to systems which are accurate and cost-effective at the same time. However, in any such human-machine collaboration system, cost effectiveness and higher accuracy are competing goals. While more involvement from humans can certainly lead to higher accuracy, it also leads to increased cost both in terms of time and money. On the other hand, relying more on machines is cost-effective, but algorithms are still nowhere near human-level performance. Balancing this cost versus accuracy trade-off holds the key behind success for such a hybrid system.
In this thesis, I develop foreground segmentation algorithms which effectively and efficiently make use of human guidance for accurately segmenting foreground objects in images and videos. The algorithms developed in this thesis actively reason about the best modalities or interactions through which a user can provide guidance to the system for generating accurate segmentations. At the same time, these algorithms are also capable of prioritizing human guidance on instances where it is most needed. Finally, when structural similarity exists within data (e.g., adjacent frames in a video or similar images in a collection), the algorithms developed in this thesis are capable of propagating information from instances which have received human guidance to the ones which did not. Together, these characteristics result in a substantial savings in human annotation cost while generating high quality foreground segmentations in images and videos.
In this thesis, I consider three categories of segmentation problems all of which can greatly benefit from human-machine collaboration. First, I consider the problem of interactive image segmentation. In traditional interactive methods a human annotator provides a coarse spatial annotation (e.g., bounding box or freehand outlines) around the object of interest to obtain a segmentation. The mode of manual annotation used affects both its accuracy and ease-of-use. Whereas existing methods assume a fixed form of input no matter the image, in this thesis I propose a data-driven algorithm which learns whether an interactive segmentation method will succeed if initialized with a given annotation mode. This allows us to predict the modality that will be sufficiently strong to yield a high quality segmentation for a given image and results in large savings in annotation costs. I also propose a novel interactive segmentation algorithm called Click Carving which can accurately segment objects in images and videos using a very simple form of human interaction---point clicks. It outperforms several state-of-the-art methods and requires only a fraction of human effort in comparison.
Second, I consider the problem of segmenting images in a weakly supervised image collection. Here, we are given a collection of images all belonging to the same object category and the goal is to jointly segment the common object from all the images. For this, I develop a stagewise active approach to segmentation propagation: in each stage, the images that appear most valuable for human annotation are actively determined and labeled by human annotators, then the foreground estimates are revised in all unlabeled images accordingly. In order to identify images that, once annotated, will propagate well to other examples, I introduce an active selection procedure that operates on the joint segmentation graph over all images. It prioritizes human intervention for those images that are uncertain and influential in the graph, while also mutually diverse. Building on this, I also introduce the problem of measuring compatibility between image pairs for joint segmentation. I show that restricting the joint segmentation to only compatible image pairs results in an improved joint segmentation performance.
Finally, I propose a semi-supervised approach for segmentation propagation in video. Given human supervision in some frames of a video, this information can be propagated through time. The main challenge is that the foreground object may move quickly in the scene at the same time its appearance and shape evolves over time. To address this, I propose a higher order supervoxel label consistency potential which leverages bottom-up supervoxels to enforce long-range temporal consistency during propagation. I also introduce the notion of a generic pixel-level objectness in images and videos by training a deep neural network which uses appearance and motion to automatically assign a score to each pixel capturing its likelihood to be an "object" or "background". I show that the human guidance in the semi-supervised propagation algorithm can be further augmented with the generic pixel-objectness scores to obtain an even more accurate foreground segmentation in videos.
Throughout, I provide extensive evaluation on challenging datasets and also compare with many state-of-the-art methods and other baselines validating the strengths of proposed algorithms. The outcomes across several different experiments show that the proposed human-machine collaboration algorithms achieve accurate segmentation of foreground objects in images and videos while saving a large amount of human annotation effort.Computer Science
Stochastic Methods for Fine-Grained Image Segmentation and Uncertainty Estimation in Computer Vision
In this dissertation, we exploit concepts of probability theory, stochastic methods and machine learning to address three existing limitations of deep learning-based models for image understanding. First, although convolutional neural networks (CNN) have substantially improved the state of the art in image understanding, conventional CNNs provide segmentation masks that poorly adhere to object boundaries, a critical limitation for many potential applications. Second, training deep learning models requires large amounts of carefully selected and annotated data, but large-scale annotation of image segmentation datasets is often prohibitively expensive. And third, conventional deep learning models also lack the capability of uncertainty estimation, which compromises both decision making and model interpretability. To address these limitations, we introduce the Region Growing Refinement (RGR) algorithm, an unsupervised post-processing algorithm that exploits Monte Carlo sampling and pixel similarities to propagate high-confidence labels into regions of low-confidence classification. The probabilistic Region Growing Refinement (pRGR) provides RGR with a rigorous mathematical foundation that exploits concepts of Bayesian estimation and variance reduction techniques. Experiments demonstrate both the effectiveness of (p)RGR for the refinement of segmentation predictions, as well as its suitability for uncertainty estimation, since its variance estimates obtained in the Monte Carlo iterations are highly correlated with segmentation accuracy. We also introduce FreeLabel, an intuitive open-source web interface that exploits RGR to allow users to obtain high-quality segmentation masks with just a few freehand scribbles, in a matter of seconds. Designed to benefit the computer vision community, FreeLabel can be used for both crowdsourced or private annotation and has a modular structure that can be easily adapted for any image dataset. The practical relevance of methods developed in this dissertation are illustrated through applications on agricultural and healthcare-related domains. We have combined RGR and modern CNNs for fine segmentation of fruit flowers, motivated by the importance of automated bloom intensity estimation for optimization of fruit orchard management and, possibly, automatizing procedures such as flower thinning and pollination. We also exploited an early version of FreeLabel to annotate novel datasets for segmentation of fruit flowers, which are currently publicly available. Finally, this dissertation also describes works on fine segmentation and gaze estimation for images collected from assisted living environments, with the ultimate goal of assisting geriatricians in evaluating health status of patients in such facilities
Robust Normalized Softmax Loss for Deep Metric Learning-Based Characterization of Remote Sensing Images With Label Noise
Most deep metric learning-based image characterization methods exploit supervised information to model the semantic relations among the remote sensing (RS) scenes. Nonetheless, the unprecedented availability of large-scale RS data makes the annotation of such images very challenging, requiring automated supportive processes. Whether the annotation is assisted by aggregation or crowd-sourcing, the RS large-variance problem, together with other important factors [e.g., geo-location/registration errors, land-cover changes, even low-quality Volunteered Geographic Information (VGI), etc.] often introduce the so-called label noise, i.e., semantic annotation errors. In this article, we first investigate the deep metric learning-based characterization of RS images with label noise and propose a novel loss formulation, named robust normalized softmax loss (RNSL), for robustly learning the metrics among RS scenes. Specifically, our RNSL improves the robustness of the normalized softmax loss (NSL), commonly utilized for deep metric learning, by replacing its logarithmic function with the negative Box–Cox transformation in order to down-weight the contributions from noisy images on the learning of the corresponding class prototypes. Moreover, by truncating the loss with a certain threshold, we also propose a truncated robust normalized softmax loss (t-RNSL) which can further enforce the learning of class prototypes based on the image features with high similarities between them, so that the intraclass features can be well grouped and interclass features can be well separated. Our experiments, conducted on two benchmark RS data sets, validate the effectiveness of the proposed approach with respect to different state-of-the-art methods in three different downstream applications (classification, clustering, and retrieval). The codes of this article will be publicly available from https://github.com/jiankang1991
- …