16 research outputs found

    Salient Object Detection Techniques in Computer Vision-A Survey.

    Full text link
    Detection and localization of regions of images that attract immediate human visual attention is currently an intensive area of research in computer vision. The capability of automatic identification and segmentation of such salient image regions has immediate consequences for applications in the field of computer vision, computer graphics, and multimedia. A large number of salient object detection (SOD) methods have been devised to effectively mimic the capability of the human visual system to detect the salient regions in images. These methods can be broadly categorized into two categories based on their feature engineering mechanism: conventional or deep learning-based. In this survey, most of the influential advances in image-based SOD from both conventional as well as deep learning-based categories have been reviewed in detail. Relevant saliency modeling trends with key issues, core techniques, and the scope for future research work have been discussed in the context of difficulties often faced in salient object detection. Results are presented for various challenging cases for some large-scale public datasets. Different metrics considered for assessment of the performance of state-of-the-art salient object detection models are also covered. Some future directions for SOD are presented towards end

    Object-Aware Tracking and Mapping

    Get PDF
    Reasoning about geometric properties of digital cameras and optical physics enabled researchers to build methods that localise cameras in 3D space from a video stream, while – often simultaneously – constructing a model of the environment. Related techniques have evolved substantially since the 1980s, leading to increasingly accurate estimations. Traditionally, however, the quality of results is strongly affected by the presence of moving objects, incomplete data, or difficult surfaces – i.e. surfaces that are not Lambertian or lack texture. One insight of this work is that these problems can be addressed by going beyond geometrical and optical constraints, in favour of object level and semantic constraints. Incorporating specific types of prior knowledge in the inference process, such as motion or shape priors, leads to approaches with distinct advantages and disadvantages. After introducing relevant concepts in Chapter 1 and Chapter 2, methods for building object-centric maps in dynamic environments using motion priors are investigated in Chapter 5. Chapter 6 addresses the same problem as Chapter 5, but presents an approach which relies on semantic priors rather than motion cues. To fully exploit semantic information, Chapter 7 discusses the conditioning of shape representations on prior knowledge and the practical application to monocular, object-aware reconstruction systems

    Exploring Subtasks of Scene Understanding: Challenges and Cross-Modal Analysis

    Get PDF
    Scene understanding is one of the most important problems in computer vision. It consists of many subtasks such as image classification for describing an image with one word, object detection for finding and localizing objects of interest in the image and assigning a category to each of them, semantic segmentation for assigning a category to each pixel of an image, instance segmentation for finding and localizing objects of interest and marking all the pixels belonging to each object, depth estimation for estimating the distance of each pixel in the image from the camera, etc. Each of these tasks has its advantages and limitations. These tasks have a common goal to achieve that is to understand and describe a scene captured in an image or a set of images. One common question is if there is any synergy between these tasks. Therefore, alongside single task approaches, there is a line of research on how to learn multiple tasks jointly. In this thesis, we explore different subtasks of scene understanding and propose mainly deep learning-based approaches to improve these tasks. First, we propose a modular Convolutional Neural Network (CNN) architecture for jointly training semantic segmentation and depth estimation tasks. We provide a setup suitable to analyze the cross-modality influence between these tasks for different architecture designs. Then, we utilize object detection and instance segmentation as auxiliary tasks for focusing on target objects in complex tasks of scene flow estimation and object 6d pose estimation. Furthermore, we propose a novel deep approach for object co-segmentation which is the task of segmenting common objects in a set of images. Finally, we introduce a novel pooling layer that preserves the spatial information while capturing a large receptive field. This pooling layer is designed for improving the dense prediction tasks such as semantic segmentation and depth estimation

    Deep Learning for 2D and 3D Scene Understanding

    Get PDF
    This thesis comprises a body of work that investigates the use of deep learning for 2D and 3D scene understanding. Although there has been significant progress made in computer vision using deep learning, a lot of that progress has been relative to performance benchmarks, and for static images; it is common to find that good performance on one benchmark does not necessarily mean good generalization to the kind of viewing conditions that might be encountered by an autonomous robot or agent. In this thesis, we address a variety of problems motivated by the desire to see deep learning algorithms generalize better to robotic vision scenarios. Specifically, we span topics of multi-object detection, unsupervised domain adaptation for semantic segmentation, video object segmentation, and semantic scene completion. First, most modern object detectors use a final post-processing step known as Non-maximum suppression (GreedyNMS). This suffers an inevitable trade-off between precision and recall in crowded scenes. To overcome this limitation, we propose a Pairwise-NMS to cure GreedyNMS. Specifically, a pairwise-relationship network that is based on deep learning is learned to predict if two overlapping proposal boxes contain two objects or zero/one object, which can handle multiple overlapping objects effectively. A common issue in training deep neural networks is the need for large training sets. One approach to this is to use simulated image and video data, but this suffers from a domain gap wherein the performance on real-world data is poor relative to performance on the simulation data. We target a few approaches to addressing so-called domain adaptation for semantic segmentation: (1) Single and multi-exemplars are employed for each class in order to cluster the per-pixel features in the embedding space; (2) Class-balanced self-training strategy is utilized for generating pseudo labels in the target domain; (3) Moreover, a convolutional adaptor is adopted to enforce the features in the source domain and target domain are closed with each other. Next, we tackle the video object segmentation by formulating it as a meta-learning problem, where the base learner aims to learn semantic scene understanding for general objects, and the meta learner quickly adapts the appearance of the target object with a few examples. Our proposed meta-learning method uses a closed-form optimizer, the so-called \ridge regression", which is conducive to fast and better training convergence. One-shot video object segmentation (OSVOS) has the limitation to \overemphasize" the generic semantic object information while \diluting" the instance cues of the object(s), which largely block the whole training process. Through adding a common module, video loss, which we formulate with various forms of constraints (including weighted BCE loss, high-dimensional triplet loss, as well as a novel mixed instance-aware video loss), to train the parent network, the network is then better prepared for the online fine-tuning. Next, we introduce a light-weight Dimensional Decomposition Residual network (DDR) for 3D dense prediction tasks. The novel factorized convolution layer is effective for reducing the network parameters, and the proposed multi-scale fusion mechanism for depth and color image can improve the completion and segmentation accuracy simultaneously. Moreover, we propose PALNet, a novel hybrid network for Semantic Scene Completion(SSC) based on single depth. PALNet utilizes a two-stream network to extract both 2D and 3D features from multi-stages using fine-grained depth information to eficiently capture the context, as well as the geometric cues of the scene. Position Aware Loss (PA-Loss) considers Local Geometric Anisotropy to determine the importance of different positions within the scene. It is beneficial for recovering key details like the boundaries of objects and the corners of the scene. Finally, we propose a 3D gated recurrent fusion network (GRFNet), which learns to adaptively select and fuse the relevant information from depth and RGB by making use of the gate and memory modules. Based on the single-stage fusion, we further propose a multi-stage fusion strategy, which could model the correlations among different stages within the network.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202

    A comprehensive review on 3D object detection and 6D pose estimation with deep learning

    Get PDF
    Nowadays, computer vision with 3D (dimension) object detection and 6D (degree of freedom) pose assumptions are widely discussed and studied in the field. In the 3D object detection process, classifications are centered on the object's size, position, and direction. And in 6D pose assumptions, networks emphasize 3D translation and rotation vectors. Successful application of these strategies can have a huge impact on various machine learning-based applications, including the autonomous vehicles, the robotics industry, and the augmented reality sector. Although extensive work has been done on 3D object detection with a pose assumption from RGB images, the challenges have not been fully resolved. Our analysis provides a comprehensive review of the proposed contemporary techniques for complete 3D object detection and the recovery of 6D pose assumptions of an object. In this review research paper, we have discussed several proposed sophisticated methods in 3D object detection and 6D pose estimation, including some popular data sets, evaluation matrix, and proposed method challenges. Most importantly, this study makes an effort to offer some possible future directions in 3D object detection and 6D pose estimation. We accept the autonomous vehicle as the sample case for this detailed review. Finally, this review provides a complete overview of the latest in-depth learning-based research studies related to 3D object detection and 6D pose estimation systems and points out a comparison between some popular frameworks. To be more concise, we propose a detailed summary of the state-of-the-art techniques of modern deep learning-based object detection and pose estimation models
    corecore