265 research outputs found

    Salient Object Detection Techniques in Computer Vision-A Survey.

    Full text link
    Detection and localization of regions of images that attract immediate human visual attention is currently an intensive area of research in computer vision. The capability of automatic identification and segmentation of such salient image regions has immediate consequences for applications in the field of computer vision, computer graphics, and multimedia. A large number of salient object detection (SOD) methods have been devised to effectively mimic the capability of the human visual system to detect the salient regions in images. These methods can be broadly categorized into two categories based on their feature engineering mechanism: conventional or deep learning-based. In this survey, most of the influential advances in image-based SOD from both conventional as well as deep learning-based categories have been reviewed in detail. Relevant saliency modeling trends with key issues, core techniques, and the scope for future research work have been discussed in the context of difficulties often faced in salient object detection. Results are presented for various challenging cases for some large-scale public datasets. Different metrics considered for assessment of the performance of state-of-the-art salient object detection models are also covered. Some future directions for SOD are presented towards end

    Biologically Inspired Object Tracking Using Center-Surround Saliency Mechanisms

    Full text link

    Computational Models of Perceptual Organization and Bottom-up Attention in Visual and Audio-Visual Environments

    Get PDF
    Figure Ground Organization (FGO) - inferring spatial depth ordering of objects in a visual scene - involves determining which side of an occlusion boundary (OB) is figure (closer to the observer) and which is ground (further away from the observer). Attention, the process that governs how only some part of sensory information is selected for further analysis based on behavioral relevance, can be exogenous, driven by stimulus properties such as an abrupt sound or a bright flash, the processing of which is purely bottom-up; or endogenous (goal-driven or voluntary), where top-down factors such as familiarity, aesthetic quality, etc., determine attentional selection. The two main objectives of this thesis are developing computational models of: (i) FGO in visual environments; (ii) bottom-up attention in audio-visual environments. In the visual domain, we first identify Spectral Anisotropy (SA), characterized by anisotropic distribution of oriented high frequency spectral power on the figure side and lack of it on the ground side, as a novel FGO cue, that can determine Figure/Ground (FG) relations at an OB with an accuracy exceeding 60%. Next, we show a non-linear Support Vector Machine based classifier trained on the SA features achieves an accuracy close to 70% in determining FG relations, the highest for a stand-alone local cue. We then show SA can be computed in a biologically plausible manner by pooling the Complex cell responses of different scales in a specific orientation, which also achieves an accuracy greater than or equal to 60% in determining FG relations. Next, we present a biologically motivated, feed forward model of FGO incorporating convexity, surroundedness, parallelism as global cues and SA, T-junctions as local cues, where SA is computed in a biologically plausible manner. Each local cue, when added alone, gives statistically significant improvement in the model's performance. The model with both local cues achieves higher accuracy than those of models with individual cues in determining FG relations, indicating SA and T-Junctions are not mutually contradictory. Compared to the model with no local cues, the model with both local cues achieves greater than or equal to 8.78% improvement in determining FG relations at every border location of images in the BSDS dataset. In the audio-visual domain, first we build a simple computational model to explain how visual search can be aided by providing concurrent, co-spatial auditory cues. Our model shows that adding a co-spatial, concurrent auditory cue can enhance the saliency of a weakly visible target among prominent visual distractors, the behavioral effect of which could be faster reaction time and/or better search accuracy. Lastly, a bottom-up, feed-forward, proto-object based audiovisual saliency map (AVSM) for the analysis of dynamic natural scenes is presented. We demonstrate that the performance of proto-object based AVSM in detecting and localizing salient objects/events is in agreement with human judgment. In addition, we show the AVSM computed as a linear combination of visual and auditory feature conspicuity maps captures a higher number of valid salient events compared to unisensory saliency maps

    Processing boundary and region features for perception

    Get PDF
    A fundamental task for any visual system is the accurate detection of objects from background information, for example, defining fruit from foliage or a predator in a forest. This is commonly referred to as figure-ground segregation, which occurs when the visual system locates differences in visual features across an image, such as colour or texture. Combinations of feature contrast define an object from its surrounds, though the exact nature of that combination is still debated. Two processes are likely to contribute to object conspicuity, the pooling of features within an object's bounds relative to those in the background ('region' contrast) and detecting feature contrast at the boundary itself ('boundary' contrast). Investigations of the relative contributions of these two processes to perception have produced sometimes contradictory findings, some of which can be explained by the methodology adopted in those studies. For example, results from several studies adopting search-based methodologies have advocated nonlinear interaction of the boundary and region processes, whereas results from more subjective methods have indicated a linear combination. This thesis aims to compare search and subjective methodologies to determine how visual features (region and boundary) interact, highlight limitations of these metrics, and then unpack the contributions of boundary and region processes in greater detail. The first and second experiments investigated the relative contributions of boundary strength, regional orientation, and regional spatial frequency to object conspicuity. This was achieved via a comparison of search and subjective methodologies, which, as mentioned, have previously produced conflicting results in this domain. The results advocated a relatively strong contribution of boundary features compared to region-based features, and replicated the apparent incongruence between findings from search-based and subjective metrics. Results from the search task suggest nonlinear interaction and those from the subjective task suggest linear interaction. A unifying model that reconciles these seemingly contradicting findings (and those in the literature) is then presented, which considers the effect of metric sensitivity and performance ceilings in the paradigms employed. In light of the findings from the first and second experiments that suggest a stronger contribution of boundary information to object conspicuity, the third and fourth experiments investigated boundary features in more detail. Anecdotal reports from observers in the earlier experiments suggest that the conspicuity of boundaries is modulated by information in the background, regardless of boundary structure. As such, the relative contributions of boundary-background contrast and boundary composition were investigated using a novel stimulus generation technique that enables their effective isolation. A novel metric for boundary composition that correlates well with perception is also outlined. Results for those experiments suggested a significant contribution of both sources of boundary information, though advocate a critical role for boundary-background contrast. The final experiment explored the contribution of region-based information to object conspicuity in more detail, specifically how higher-order image structure, such as the components of complex texture, contribute to conspicuity. A state-of-the-art texture synthesis model, which reproduces textures via mechanisms that mimic processes in the human visual system, is evaluated respect to its perceptual applicability. Previous evaluations of this synthesis model are extended via a novel approach that enables the isolation of the model's parameters (which simulate physiological mechanisms) for independent examination. An alternative metric for the efficacy of the model is also presented

    Cortical Surround Interactions and Perceptual Salience via Natural Scene Statistics

    Get PDF
    Spatial context in images induces perceptual phenomena associated with salience and modulates the responses of neurons in primary visual cortex (V1). However, the computational and ecological principles underlying contextual effects are incompletely understood. We introduce a model of natural images that includes grouping and segmentation of neighboring features based on their joint statistics, and we interpret the firing rates of V1 neurons as performing optimal recognition in this model. We show that this leads to a substantial generalization of divisive normalization, a computation that is ubiquitous in many neural areas and systems. A main novelty in our model is that the influence of the context on a target stimulus is determined by their degree of statistical dependence. We optimized the parameters of the model on natural image patches, and then simulated neural and perceptual responses on stimuli used in classical experiments. The model reproduces some rich and complex response patterns observed in V1, such as the contrast dependence, orientation tuning and spatial asymmetry of surround suppression, while also allowing for surround facilitation under conditions of weak stimulation. It also mimics the perceptual salience produced by simple displays, and leads to readily testable predictions. Our results provide a principled account of orientation-based contextual modulation in early vision and its sensitivity to the homogeneity and spatial arrangement of inputs, and lends statistical support to the theory that V1 computes visual salience

    Automatic extraction of regions of interest from images based on visual attention models

    Get PDF
    This thesis presents a method for the extraction of regions of interest (ROIs) from images. By ROIs we mean the most prominent semantic objects in the images, of any size and located at any position in the image. The novel method is based on computational models of visual attention (VA), operates under a completely bottom-up and unsupervised way and does not present con-straints in the category of the input images. At the core of the architecture is de model VA proposed by Itti, Koch and Niebur and the one proposed by Stentiford. The first model takes into account color, intensity, and orientation features and provides coordinates corresponding to the points of attention (POAs) in the image. The second model considers color features and provides rough areas of attention (AOAs) in the image. In the proposed architecture, the POAs and AOAs are combined to establish the contours of the ROIs. Two implementations of this architecture are presented, namely 'first version' and 'improved version'. The first version mainly on traditional morphological operations and was applied in two novel region-based image retrieval systems. In the first one, images are clustered on the basis of the ROIs, instead of the global characteristics of the image. This provides a meaningful organization of the database images, since the output clusters tend to contain objects belonging to the same category. In the second system, we present a combination of the traditional global-based with region-based image retrieval under a multiple-example query scheme. In the improved version of the architecture, the main stages are a spatial coherence analysis between both VA models and a multiscale representation of the AOAs. Comparing to the first one, the improved version presents more versatility, mainly in terms of the size of the extracted ROIs. The improved version was directly evaluated for a wide variety of images from different publicly available databases, with ground truth in the form of bounding boxes and true object contours. The performance measures used were precision, recall, F1 and area overlap. Experimental results are of very high quality, particularly if one takes into account the bottom-up and unsupervised nature of the approach.UOL; CAPESEsta tese apresenta um método para a extração de regiões de interesse (ROIs) de imagens. No contexto deste trabalho, ROIs são definidas como os objetos semânticos que se destacam em uma imagem, podendo apresentar qualquer tamanho ou localização. O novo método baseia-se em modelos computacionais de atenção visual (VA), opera de forma completamente bottom-up, não supervisionada e não apresenta restrições com relação à categoria da imagem de entrada. Os elementos centrais da arquitetura são os modelos de VA propostos por Itti-Koch-Niebur e Stentiford. O modelo de Itti-Koch-Niebur considera as características de cor, intensidade e orientação da imagem e apresenta uma resposta na forma de coordenadas, correspondentes aos pontos de atenção (POAs) da imagem. O modelo Stentiford considera apenas as características de cor e apresenta a resposta na forma de áreas de atenção na imagem (AOAs). Na arquitetura proposta, a combinação de POAs e AOAs permite a obtenção dos contornos das ROIs. Duas implementações desta arquitetura, denominadas 'primeira versão' e 'versão melhorada' são apresentadas. A primeira versão utiliza principalmente operações tradicionais de morfologia matemática. Esta versão foi aplicada em dois sistemas de recuperação de imagens com base em regiões. No primeiro, as imagens são agrupadas de acordo com as ROIs, ao invés das características globais da imagem. O resultado são grupos de imagens mais significativos semanticamente, uma vez que o critério utilizado são os objetos da mesma categoria contidos nas imagens. No segundo sistema, á apresentada uma combinação da busca de imagens tradicional, baseada nas características globais da imagem, com a busca de imagens baseada em regiões. Ainda neste sistema, as buscas são especificadas através de mais de uma imagem exemplo. Na versão melhorada da arquitetura, os estágios principais são uma análise de coerência espacial entre as representações de ambos modelos de VA e uma representação multi-escala das AOAs. Se comparada à primeira versão, esta apresenta maior versatilidade, especialmente com relação aos tamanhos das ROIs presentes nas imagens. A versão melhorada foi avaliada diretamente, com uma ampla variedade de imagens diferentes bancos de imagens públicos, com padrões-ouro na forma de bounding boxes e de contornos reais dos objetos. As métricas utilizadas na avaliação foram presision, recall, F1 e area of overlap. Os resultados finais são excelentes, considerando-se a abordagem exclusivamente bottom-up e não-supervisionada do método
    • …
    corecore