Search CORE

742 research outputs found

Priming Neural Networks

Author: Biparva Mahdi
Rosenfeld Amir
Tsotsos John K.
Publication venue
Publication date: 16/11/2017
Field of study

Visual priming is known to affect the human visual system to allow detection of scene elements, even those that may have been near unnoticeable before, such as the presence of camouflaged animals. This process has been shown to be an effect of top-down signaling in the visual system triggered by the said cue. In this paper, we propose a mechanism to mimic the process of priming in the context of object detection and segmentation. We view priming as having a modulatory, cue dependent effect on layers of features within a network. Our results show how such a process can be complementary to, and at times more effective than simple post-processing applied to the output of the network, notably so in cases where the object is hard to detect such as in severe noise. Moreover, we find the effects of priming are sometimes stronger when early visual layers are affected. Overall, our experiments confirm that top-down signals can go a long way in improving object detection and segmentation.Comment: fixed error in author nam

arXiv.org e-Print Archive

Crossref

BiSeg: Simultaneous Instance Segmentation and Semantic Segmentation with Fully Convolutional Networks

Author: Ito Satoshi
Kozakaya Tatsuo
Pham Viet-Quoc
Publication venue
Publication date: 01/01/2017
Field of study

We present a simple and effective framework for simultaneous semantic segmentation and instance segmentation with Fully Convolutional Networks (FCNs). The method, called BiSeg, predicts instance segmentation as a posterior in Bayesian inference, where semantic segmentation is used as a prior. We extend the idea of position-sensitive score maps used in recent methods to a fusion of multiple score maps at different scales and partition modes, and adopt it as a robust likelihood for instance segmentation inference. As both Bayesian inference and map fusion are performed per pixel, BiSeg is a fully convolutional end-to-end solution that inherits all the advantages of FCNs. We demonstrate state-of-the-art instance segmentation accuracy on PASCAL VOC.Comment: BMVC201

arXiv.org e-Print Archive

Crossref

A-Fast-RCNN: Hard Positive Generation via Adversary for Object Detection

Author: Gupta Abhinav
Shrivastava Abhinav
Wang Xiaolong
Publication venue
Publication date: 11/04/2017
Field of study

How do we learn an object detector that is invariant to occlusions and deformations? Our current solution is to use a data-driven strategy -- collect large-scale datasets which have object instances under different conditions. The hope is that the final classifier can use these examples to learn invariances. But is it really possible to see all the occlusions in a dataset? We argue that like categories, occlusions and object deformations also follow a long-tail. Some occlusions and deformations are so rare that they hardly happen; yet we want to learn a model invariant to such occurrences. In this paper, we propose an alternative solution. We propose to learn an adversarial network that generates examples with occlusions and deformations. The goal of the adversary is to generate examples that are difficult for the object detector to classify. In our framework both the original detector and adversary are learned in a joint manner. Our experimental results indicate a 2.3% mAP boost on VOC07 and a 2.6% mAP boost on VOC2012 object detection challenge compared to the Fast-RCNN pipeline. We also release the code for this paper.Comment: CVPR 2017 Camera Read

arXiv.org e-Print Archive

Crossref

Video Object Detection with an Aligned Spatial-Temporal Memory

Author: A Shrivastava
B Coifman
B Lee
C Wren
N Dalal
P Viola
T Brox
W Liu
Publication venue
Publication date: 26/07/2018
Field of study

We introduce Spatial-Temporal Memory Networks for video object detection. At its core, a novel Spatial-Temporal Memory module (STMM) serves as the recurrent computation unit to model long-term temporal appearance and motion dynamics. The STMM's design enables full integration of pretrained backbone CNN weights, which we find to be critical for accurate detection. Furthermore, in order to tackle object motion in videos, we propose a novel MatchTrans module to align the spatial-temporal memory from frame to frame. Our method produces state-of-the-art results on the benchmark ImageNet VID dataset, and our ablative studies clearly demonstrate the contribution of our different design choices. We release our code and models at http://fanyix.cs.ucdavis.edu/project/stmn/project.html

arXiv.org e-Print Archive

Crossref

Biologically Inspired Visual System Architecture for Object Recognition in Autonomous Systems

Author: Guterman Hugo
Malowany Dan
Publication venue
Publication date: 11/07/2020
Field of study

Findings in recent years on the sensitivity of convolutional neural networks to additive noise, light conditions and to the wholeness of the training dataset, indicate that this technology still lacks the robustness needed for the autonomous robotic industry. In an attempt to bring computer vision algorithms closer to the capabilities of a human operator, the mechanisms of the human visual system was analyzed in this work. Recent studies show that the mechanisms behind the recognition process in the human brain include continuous generation of predictions based on prior knowledge of the world. These predictions enable rapid generation of contextual hypotheses that bias the outcome of the recognition process. This mechanism is especially advantageous in situations of uncertainty, when visual input is ambiguous. In addition, the human visual system continuously updates its knowledge about the world based on the gaps between its prediction and the visual feedback. Convolutional neural networks are feed forward in nature and lack such top-down contextual attenuation mechanisms. As a result, although they process massive amounts of visual information during their operation, the information is not transformed into knowledge that can be used to generate contextual predictions and improve their performance. In this work, an architecture was designed that aims to integrate the concepts behind the top-down prediction and learning processes of the human visual system with the state of the art bottom-up object recognition models, e.g., deep convolutional neural networks. The work focuses on two mechanisms of the human visual system: anticipation-driven perception and reinforcement-driven learning. Imitating these top-down mechanisms, together with the state of the art bottom-up feed-forward algorithms, resulted in an accurate, robust, and continuously improving target recognition model

arXiv.org e-Print Archive

Multidisciplinary Digital Publishing Institute