14,532 research outputs found

    Backtracking Spatial Pyramid Pooling (SPP)-based Image Classifier for Weakly Supervised Top-down Salient Object Detection

    Full text link
    Top-down saliency models produce a probability map that peaks at target locations specified by a task/goal such as object detection. They are usually trained in a fully supervised setting involving pixel-level annotations of objects. We propose a weakly supervised top-down saliency framework using only binary labels that indicate the presence/absence of an object in an image. First, the probabilistic contribution of each image region to the confidence of a CNN-based image classifier is computed through a backtracking strategy to produce top-down saliency. From a set of saliency maps of an image produced by fast bottom-up saliency approaches, we select the best saliency map suitable for the top-down task. The selected bottom-up saliency map is combined with the top-down saliency map. Features having high combined saliency are used to train a linear SVM classifier to estimate feature saliency. This is integrated with combined saliency and further refined through a multi-scale superpixel-averaging of saliency map. We evaluate the performance of the proposed weakly supervised topdown saliency and achieve comparable performance with fully supervised approaches. Experiments are carried out on seven challenging datasets and quantitative results are compared with 40 closely related approaches across 4 different applications.Comment: 14 pages, 7 figure

    Image Co-saliency Detection and Co-segmentation from The Perspective of Commonalities

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.Image co-saliency detection and image co-segmentation aim to identify the common salient objects and extract them in a group of images. Image co-saliency detection and image co-segmentation are important for many content-based applications such as image retrieval, image editing, and content aware image/video compression. The image co-saliency detection and image co-segmentation are very close works. The most important part in these two works is the definition of the commonality of the common objects. Usually, common objects share similar low-level features, such as appearances, including colours, textures shapes, etc. as well as the high-level semantic features. In this thesis, we explore the commonalities of the common objects in a group of images from low-level features and high-level features, and the way to achieve the commonalities and finally segment the common objects. Three main works are introduced, including an image co-saliency detection model and two image co-segmentation methods. , an image co-saliency detection model based on region-level fusion and pixel-level refinement is proposed. The commonalities between the common objects are defined by the appearance similarities on the regions from all the images. It discovers the regions that are salient in each individual image as well as salient in the whole image group. Extensive experiments on two benchmark datasets demonstrate that the proposed co-saliency model consistently outperforms the state-of-the-art co-saliency models in both subjective and objective evaluation. , an unsupervised images co-segmentation method via guidance of simple images is proposed. The commonalities are still defined by hand-crafted features on regions, colours and textures, but not calculated among regions from all the images. It takes advantages of the reliability of simple images, and successfully improves the performance. The experiments on the dataset demonstrate the outperformance and robustness of the proposed method. , a learned image co-segmentation model based on convolutional neural network with multi-scale feature fusion is proposed. The commonalities between objects are not defined by handcraft features but learned from the training data. When training a neural network with multiple input images simultaneously, the resource cost will increase rapidly with the inputs. To reduce the resource cost, reduced input size, less downsampling and dilation convolution are adopted in the proposed model. Experimental results on the public dataset demonstrate that the proposed model achieves a comparable performance to the state-of-the-art methods while the network has successfully gotten simplified and the resources cost is reduced

    注目領域検出のための視覚的注意モデル設計に関する研究

    Get PDF
    Visual attention is an important mechanism in the human visual system. When human observe images and videos, they usually do not describe all the contents in them. Instead, they tend to talk about the semantically important regions and objects in the images. The human eye is usually attracted by some regions of interest rather than the entire scene. These regions of interest that present the mainly meaningful or semantic content are called saliency region. Visual saliency detection refers to the use of intelligent algorithms to simulate human visual attention mechanism, extract both the low-level features and high-level semantic information and localize the salient object regions in images and videos. The generated saliency map indicates the regions that are likely to attract human attention. As a fundamental problem of image processing and computer vision, visual saliency detection algorithms have been extensively studied by researchers to solve practical tasks, such as image and video compression, image retargeting, object detection, etc. The visual attention mechanism adopted by saliency detection in general are divided into two categories, namely the bottom-up model and top-down model. The bottom-up attention algorithm focuses on utilizing the low-level visual features such as colour and edges to locate the salient objects. While the top-down attention utilizes the supervised learning to detect saliency. In recent years, more and more research tend to design deep neural networks with attention mechanisms to improve the accuracy of saliency detection. The design of deep attention neural network is inspired by human visual attention. The main goal is to enable the network to automatically capture the information that is critical to the target tasks and suppress irrelevant information, shift the attention from focusing on all to local. Currently various domain’s attention has been developed for saliency detection and semantic segmentation, such as the spatial attention module in convolution network, it generates a spatial attention map by utilizing the inter-spatial relationship of features; the channel attention module produces a attention by exploring the inter-channel relationship of features. All these well-designed attentions have been proven to be effective in improving the accuracy of saliency detection. This paper investigates the visual attention mechanism of salient object detection and applies it to digital histopathology image analysis for the detection and classification of breast cancer metastases. As shown in following contents, the main research contents include three parts: First, we studied the semantic attention mechanism and proposed a semantic attention approach to accurately localize the salient objects in complex scenarios. The proposed semantic attention uses Faster-RCNN to capture high-level deep features and replaces the last layer of Faster-RCNN by a FC layer and sigmoid function for visual saliency detection; it calculates proposals' attention probabilities by comparing their feature distances with the possible salient object. The proposed method introduces a re-weighting mechanism to reduce the influence of the complexity background, and a proposal selection mechanism to remove the background noise to obtain objects with accurate shape and contour. The simulation result shows that the semantic attention mechanism is robust to images with complex background due to the consideration of high-level object concept, the algorithm achieved outstanding performance among the salient object detection algorithms in the same period. Second, we designed a deep segmentation network (DSNet) for saliency object prediction. We explored a Pyramidal Attentional ASPP (PA-ASPP) module which can provide pixel level attention. DSNet extracts multi-level features with dilated ResNet-101 and the multiscale contextual information was locally weighted with the proposed PA-ASPP. The pyramid feature aggregation encodes the multi-level features from three different scales. This feature fusion incorporates neighboring scales of context features more precisely to produce better pixel-level attention. Finally, we use a scale-aware selection (SAS) module to locally weight multi-scale contextual features, capture important contexts of ASPP for the accurate and consistent dense prediction. The simulation results demonstrated that the proposed PA-ASPP is effective and can generate more coherent results. Besides, with the SAS, the model can adaptively capture the regions with different scales effectively. Finally, based on previous research on attentional mechanisms, we proposed a novel Deep Regional Metastases Segmentation (DRMS) framework for the detection and classification of breast cancer metastases. As we know, the digitalized whole slide image has high-resolution, usually has gigapixel, however the size of abnormal region is often relatively small, and most of the slide region are normal. The highly trained pathologists usually localize the regions of interest first in the whole slide, then perform precise examination in the selected regions. Even though the process is time-consuming and prone to miss diagnosis. Through observation and analysis, we believe that visual attention should be perfectly suited for the application of digital pathology image analysis. The integrated framework for WSI analysis can capture the granularity and variability of WSI, rich information from multi-grained pathological image. We first utilize the proposed attention mechanism based DSNet to detect the regional metastases in patch-level. Then, adopt the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to predict the whole metastases from individual slides. Finally, determine patient-level pN-stages by aggregating each individual slide-level prediction. In combination with the above techniques, the framework can make better use of the multi-grained information in histological lymph node section of whole-slice images. Experiments on large-scale clinical datasets (e.g., CAMELYON17) demonstrate that our method delivers advanced performance and provides consistent and accurate metastasis detection

    An investigation into image and video foreground segmentation and change detection

    Get PDF
    Detecting and segmenting Spatio-temporal foreground objects from videos are significant to motion pattern modelling and video content analysis. Extensive efforts have been made in the past decades. Nevertheless, video-based saliency detection and foreground segmentation remained challenging. On the one hand, the performances of image-based saliency detection algorithms are limited in complex contents, while the temporal connectivity between frames are not well-resolved. On the other hand, compared with the prosperous image-based datasets, the datasets in video-level saliency detection and segmentation usually have smaller scale and less diversity of contents. Towards a better understanding of video-level semantics, this thesis investigates the foreground estimation and segmentation in both image-level and video-level. This thesis firstly demonstrates the effectiveness of traditional features in video foreground estimation and segmentation. Motion patterns obtained by optical flow are utilised to draw coarse estimations about the foreground objects. The coarse estimations are refined by aligning motion boundaries with actual contours of the foreground objects with the participation of HOG descriptor. And a precise segmentation of the foreground is computed based on the refined foreground estimations and video-level colour distribution. Second, a deep convolutional neural network (CNN) for image saliency detection is proposed, which is named HReSNet. To improve the accuracy of saliency prediction, an independent feature refining network is implemented. A Euclidean distance loss is integrated into loss computation to enhance the saliency predictions near the contours of objects. The experimental results demonstrate that our network obtains competitive results compared with the state-of-art algorithms. Third, a large-scale dataset for video saliency detection and foreground segmentation is built to enrich the diversity of current video-based foreground segmentation datasets. A supervised framework is also proposed as the baseline, which integrates our HReSNet, Long-Short Term Memory (LSTM) networks and a hierarchical segmentation network. Forth, in the practice of change detection, there requires distinguishing the expected changes with semantics from the unexpected changes. Therefore, a new CNN design is proposed to detect changes in multi-temporal high-resolution urban images. Experimental results showed our change detection network outperformed the competing algorithms with significant advantages

    Deep Networks Based Energy Models for Object Recognition from Multimodality Images

    Get PDF
    Object recognition has been extensively investigated in computer vision area, since it is a fundamental and essential technique in many important applications, such as robotics, auto-driving, automated manufacturing, and security surveillance. According to the selection criteria, object recognition mechanisms can be broadly categorized into object proposal and classification, eye fixation prediction and saliency object detection. Object proposal tends to capture all potential objects from natural images, and then classify them into predefined groups for image description and interpretation. For a given natural image, human perception is normally attracted to the most visually important regions/objects. Therefore, eye fixation prediction attempts to localize some interesting points or small regions according to human visual system (HVS). Based on these interesting points and small regions, saliency object detection algorithms propagate the important extracted information to achieve a refined segmentation of the whole salient objects. In addition to natural images, object recognition also plays a critical role in clinical practice. The informative insights of anatomy and function of human body obtained from multimodality biomedical images such as magnetic resonance imaging (MRI), transrectal ultrasound (TRUS), computed tomography (CT) and positron emission tomography (PET) facilitate the precision medicine. Automated object recognition from biomedical images empowers the non-invasive diagnosis and treatments via automated tissue segmentation, tumor detection and cancer staging. The conventional recognition methods normally utilize handcrafted features (such as oriented gradients, curvature, Haar features, Haralick texture features, Laws energy features, etc.) depending on the image modalities and object characteristics. It is challenging to have a general model for object recognition. Superior to handcrafted features, deep neural networks (DNN) can extract self-adaptive features corresponding with specific task, hence can be employed for general object recognition models. These DNN-features are adjusted semantically and cognitively by over tens of millions parameters corresponding to the mechanism of human brain, therefore leads to more accurate and robust results. Motivated by it, in this thesis, we proposed DNN-based energy models to recognize object on multimodality images. For the aim of object recognition, the major contributions of this thesis can be summarized below: 1. We firstly proposed a new comprehensive autoencoder model to recognize the position and shape of prostate from magnetic resonance images. Different from the most autoencoder-based methods, we focused on positive samples to train the model in which the extracted features all come from prostate. After that, an image energy minimization scheme was applied to further improve the recognition accuracy. The proposed model was compared with three classic classifiers (i.e. support vector machine with radial basis function kernel, random forest, and naive Bayes), and demonstrated significant superiority for prostate recognition on magnetic resonance images. We further extended the proposed autoencoder model for saliency object detection on natural images, and the experimental validation proved the accurate and robust saliency object detection results of our model. 2. A general multi-contexts combined deep neural networks (MCDN) model was then proposed for object recognition from natural images and biomedical images. Under one uniform framework, our model was performed in multi-scale manner. Our model was applied for saliency object detection from natural images as well as prostate recognition from magnetic resonance images. Our experimental validation demonstrated that the proposed model was competitive to current state-of-the-art methods. 3. We designed a novel saliency image energy to finely segment salient objects on basis of our MCDN model. The region priors were taken into account in the energy function to avoid trivial errors. Our method outperformed state-of-the-art algorithms on five benchmarking datasets. In the experiments, we also demonstrated that our proposed saliency image energy can boost the results of other conventional saliency detection methods

    Contextual Encoder-Decoder Network for Visual Saliency Prediction

    Get PDF
    Predicting salient regions in natural images requires the detection of objects that are present in a scene. To develop robust representations for this challenging task, high-level visual features at multiple spatial scales must be extracted and augmented with contextual information. However, existing models aimed at explaining human fixation maps do not incorporate such a mechanism explicitly. Here we propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task. The architecture forms an encoder-decoder structure and includes a module with multiple convolutional layers at different dilation rates to capture multi-scale features in parallel. Moreover, we combine the resulting representations with global scene information for accurately predicting visual saliency. Our model achieves competitive and consistent results across multiple evaluation metrics on two public saliency benchmarks and we demonstrate the effectiveness of the suggested approach on five datasets and selected examples. Compared to state of the art approaches, the network is based on a lightweight image classification backbone and hence presents a suitable choice for applications with limited computational resources, such as (virtual) robotic systems, to estimate human fixations across complex natural scenes.Comment: Accepted Manuscrip

    Uncertainty-aware Salient Object Detection

    Get PDF
    Saliency detection models are trained to discover the region(s) of an image that attract human attention. According to whether depth data is used, static image saliency detection models can be divided into RGB image saliency detection models, and RGB-D image saliency detection models. The former predict salient regions of the RGB image, while the latter take both the RGB image and the depth data as input. Conventional saliency prediction models typically learn a deterministic mapping from images to the corresponding ground truth saliency maps without modeling the uncertainty of predictions, following the supervised learning pipeline. This thesis is dedicated to learning a conditional distribution over saliency maps, given an input image, and modeling the uncertainty of predictions. For RGB-D saliency detection, we present the first generative model based framework to achieve uncertainty-aware prediction. Our framework includes two main models: 1) a generator model and 2) an inference model. The generator model is an encoder-decoder saliency network. To infer the latent variable, we introduce two different solutions: i) a Conditional Variational Auto-encoder with an extra encoder to approximate the posterior distribution of the latent variable; and ii) an Alternating Back-Propagation technique, which directly samples the latent variable from the true posterior distribution. One drawback of above model is that it fails to explicitly model the connection between RGB image and depth data to achieve effective cooperative learning. We further introduce a novel latent variable model based complementary learning framework to explicitly model the complementary information between the two modes, namely the RGB mode and depth mode. Specifically, we first design a regularizer using mutual-information minimization to reduce the redundancy between appearance features from RGB and geometric features from depth in the latent space. Then we fuse the latent features of each mode to achieve multi-modal feature fusion. Extensive experiments on benchmark RGB-D saliency datasets illustrate the effectiveness of our framework. For RGB saliency detection, we propose a generative saliency prediction model based on the conditional generative cooperative network, where a conditional latent variable model and a conditional energy-based model are jointly trained to predict saliency in a cooperative manner. The latent variable model serves as a coarse saliency model to produce a fast initial prediction, which is then refined by Langevin revision of the energy-based model that serves as a fine saliency model. Apart from the fully supervised learning framework, we also investigate weakly supervised learning, and propose the first scribble-based weakly-supervised salient object detection model. In doing so, we first relabel an existing large-scale salient object detection dataset with scribbles, namely S-DUTS dataset. To mitigate the missing structure information in scribble annotation, we propose an auxiliary edge detection task to localize object edges explicitly, and a gated structure-aware loss to place constraints on the scope of structure to be recovered. To further reduce the labeling burden, we introduce a noise-aware encoder-decoder framework to disentangle a clean saliency predictor from noisy training examples, where the noisy labels are generated by unsupervised handcrafted feature-based methods. The whole model that represents noisy labels is a sum of the two sub-models. The goal of training the model is to estimate the parameters of both sub-models, and simultaneously infer the corresponding latent vector of each noisy label. We propose to train the model by using an alternating back-propagation algorithm. To prevent the network from converging to trivial solutions, we utilize an edge-aware smoothness loss to regularize hidden saliency maps to have similar structures as their corresponding images
    corecore