136 research outputs found

    Salient Object Detection Techniques in Computer Vision-A Survey.

    Full text link
    Detection and localization of regions of images that attract immediate human visual attention is currently an intensive area of research in computer vision. The capability of automatic identification and segmentation of such salient image regions has immediate consequences for applications in the field of computer vision, computer graphics, and multimedia. A large number of salient object detection (SOD) methods have been devised to effectively mimic the capability of the human visual system to detect the salient regions in images. These methods can be broadly categorized into two categories based on their feature engineering mechanism: conventional or deep learning-based. In this survey, most of the influential advances in image-based SOD from both conventional as well as deep learning-based categories have been reviewed in detail. Relevant saliency modeling trends with key issues, core techniques, and the scope for future research work have been discussed in the context of difficulties often faced in salient object detection. Results are presented for various challenging cases for some large-scale public datasets. Different metrics considered for assessment of the performance of state-of-the-art salient object detection models are also covered. Some future directions for SOD are presented towards end

    Information selection and fusion in vision systems

    Get PDF
    Handling the enormous amounts of data produced by data-intensive imaging systems, such as multi-camera surveillance systems and microscopes, is technically challenging. While image and video compression help to manage the data volumes, they do not address the basic problem of information overflow. In this PhD we tackle the problem in a more drastic way. We select information of interest to a specific vision task, and discard the rest. We also combine data from different sources into a single output product, which presents the information of interest to end users in a suitable, summarized format. We treat two types of vision systems. The first type is conventional light microscopes. During this PhD, we have exploited for the first time the potential of the curvelet transform for image fusion for depth-of-field extension, allowing us to combine the advantages of multi-resolution image analysis for image fusion with increased directional sensitivity. As a result, the proposed technique clearly outperforms state-of-the-art methods, both on real microscopy data and on artificially generated images. The second type is camera networks with overlapping fields of view. To enable joint processing in such networks, inter-camera communication is essential. Because of infrastructure costs, power consumption for wireless transmission, etc., transmitting high-bandwidth video streams between cameras should be avoided. Fortunately, recently designed 'smart cameras', which have on-board processing and communication hardware, allow distributing the required image processing over the cameras. This permits compactly representing useful information from each camera. We focus on representing information for people localization and observation, which are important tools for statistical analysis of room usage, quick localization of people in case of building fires, etc. To further save bandwidth, we select which cameras should be involved in a vision task and transmit observations only from the selected cameras. We provide an information-theoretically founded framework for general purpose camera selection based on the Dempster-Shafer theory of evidence. Applied to tracking, it allows tracking people using a dynamic selection of as little as three cameras with the same accuracy as when using up to ten cameras

    Human action recognition using saliency-based global and local features

    Get PDF
    Recognising human actions from video sequences is one of the most important topics in computer vision and has been extensively researched during the last decades; however, it is still regarded as a challenging task especially in real scenarios due to difficulties mainly resulting from background clutter, partial occlusion, as well as changes in scale, viewpoint, lighting, and appearance. Human action recognition is involved in many applications, including video surveillance systems, human-computer interaction, and robotics for human behaviour characterisation. In this thesis, we aim to introduce new features and methods to enhance and develop human action recognition systems. Specifically, we have introduced three methods for human action recognition. In the first approach, we present a novel framework for human action recognition based on salient object detection and a combination of local and global descriptors. Saliency Guided Feature Extraction (SGFE) is proposed to detect salient objects and extract features on the detected objects. We then propose a simple strategy to identify and process only those video frames that contain salient objects. Processing salient objects instead of all the frames not only makes the algorithm more efficient, but more importantly also suppresses the interference of background pixels. We combine this approach with a new combination of local and global descriptors, namely 3D SIFT and Histograms of Oriented Optical Flow (HOOF). The resulting Saliency Guided 3D SIFT and HOOF (SGSH) feature is used along with a multi-class support vector machine (SVM) classifier for human action recognition. The second proposed method is a novel 3D extension of Gradient Location and Orientation Histograms (3D GLOH) which provides discriminative local features representing both the gradient orientation and their relative locations. We further propose a human action recognition system based on the Bag of Visual Words model, by combining the new 3D GLOH local features with Histograms of Oriented Optical Flow (HOOF) global features. Along with the idea from our first work to extract features only in salient regions, our overall system outperforms existing feature descriptors for human action recognition for challenging video datasets. Finally, we propose to extract minimal representative information, namely deforming skeleton graphs corresponding to foreground shapes, to effectively represent actions and remove the influence of changes of illumination, subject appearance and backgrounds. We propose a novel approach to action recognition based on matching of skeleton graphs, combining static pairwise graph similarity measure using Optimal Subsequence Bijection with Dynamic TimeWarping to robustly handle topological and temporal variations. We have evaluated the proposed methods by conducting extensive experiments on widely-used human action datasets including the KTH, the UCF Sports, TV Human Interaction (TVHI), Olympic Sports and UCF11 datasets. Experimental results show the effectiveness of our methods for action recognition

    Processing boundary and region features for perception

    Get PDF
    A fundamental task for any visual system is the accurate detection of objects from background information, for example, defining fruit from foliage or a predator in a forest. This is commonly referred to as figure-ground segregation, which occurs when the visual system locates differences in visual features across an image, such as colour or texture. Combinations of feature contrast define an object from its surrounds, though the exact nature of that combination is still debated. Two processes are likely to contribute to object conspicuity, the pooling of features within an object's bounds relative to those in the background ('region' contrast) and detecting feature contrast at the boundary itself ('boundary' contrast). Investigations of the relative contributions of these two processes to perception have produced sometimes contradictory findings, some of which can be explained by the methodology adopted in those studies. For example, results from several studies adopting search-based methodologies have advocated nonlinear interaction of the boundary and region processes, whereas results from more subjective methods have indicated a linear combination. This thesis aims to compare search and subjective methodologies to determine how visual features (region and boundary) interact, highlight limitations of these metrics, and then unpack the contributions of boundary and region processes in greater detail. The first and second experiments investigated the relative contributions of boundary strength, regional orientation, and regional spatial frequency to object conspicuity. This was achieved via a comparison of search and subjective methodologies, which, as mentioned, have previously produced conflicting results in this domain. The results advocated a relatively strong contribution of boundary features compared to region-based features, and replicated the apparent incongruence between findings from search-based and subjective metrics. Results from the search task suggest nonlinear interaction and those from the subjective task suggest linear interaction. A unifying model that reconciles these seemingly contradicting findings (and those in the literature) is then presented, which considers the effect of metric sensitivity and performance ceilings in the paradigms employed. In light of the findings from the first and second experiments that suggest a stronger contribution of boundary information to object conspicuity, the third and fourth experiments investigated boundary features in more detail. Anecdotal reports from observers in the earlier experiments suggest that the conspicuity of boundaries is modulated by information in the background, regardless of boundary structure. As such, the relative contributions of boundary-background contrast and boundary composition were investigated using a novel stimulus generation technique that enables their effective isolation. A novel metric for boundary composition that correlates well with perception is also outlined. Results for those experiments suggested a significant contribution of both sources of boundary information, though advocate a critical role for boundary-background contrast. The final experiment explored the contribution of region-based information to object conspicuity in more detail, specifically how higher-order image structure, such as the components of complex texture, contribute to conspicuity. A state-of-the-art texture synthesis model, which reproduces textures via mechanisms that mimic processes in the human visual system, is evaluated respect to its perceptual applicability. Previous evaluations of this synthesis model are extended via a novel approach that enables the isolation of the model's parameters (which simulate physiological mechanisms) for independent examination. An alternative metric for the efficacy of the model is also presented

    Irish Machine Vision and Image Processing Conference Proceedings 2017

    Get PDF

    Acquiring 3D scene information from 2D images

    Get PDF
    In recent years, people are becoming increasingly acquainted with 3D technologies such as 3DTV, 3D movies and 3D virtual navigation of city environments in their daily life. Commercial 3D movies are now commonly available for consumers. Virtual navigation of our living environment as used on a personal computer has become a reality due to well-known web-based geographic applications using advanced imaging technologies. To enable such 3D applications, many technological challenges such as 3D content creation, 3D displaying technology and 3D content transmission need to tackled and deployed at low cost. This thesis concentrates on the reconstruction of 3D scene information from multiple 2D images, aiming for an automatic and low-cost production of the 3D content. In this thesis, two multiple-view 3D reconstruction systems are proposed: a 3D modeling system for reconstructing the sparse 3D scene model from long video sequences captured with a hand-held consumer camcorder, and a depth reconstruction system for creating depth maps from multiple-view videos taken by multiple synchronized cameras. Both systems are designed to compute the 3D scene information in an automated way with minimum human interventions, in order to reduce the production cost of 3D contents. Experimental results on real videos of hundreds and thousands frames have shown that the two systems are able to accurately and automatically reconstruct the 3D scene information from 2D image data. The findings of this research are useful for emerging 3D applications such as 3D games, 3D visualization and 3D content production. Apart from designing and implementing the two proposed systems, we have developed three key scientific contributions to enable the two proposed 3D reconstruction systems. The first contribution is that we have designed a novel feature point matching algorithm that uses only a smoothness constraint for matching the points, which states that neighboring feature points in images tend to move with similar directions and magnitudes. The employed smoothness assumption is not only valid but also robust for most images with limited image motion, regardless of the camera motion and scene structure. Because of this, the algorithm obtains two major advan- 1 tages. First, the algorithm is robust to illumination changes, as the employed smoothness constraint does not rely on any texture information. Second, the algorithm has a good capability to handle the drift of the feature points over time, as the drift can hardly lead to a violation of the smoothness constraint. This leads to the large number of feature points matched and tracked by the proposed algorithm, which significantly helps the subsequent 3D modeling process. Our feature point matching algorithm is specifically designed for matching and tracking feature points in image/video sequences where the image motion is limited. Our extensive experimental results show that the proposed algorithm is able to track at least 2.5 times as many feature points compared with the state-of-the-art algorithms, with a comparable or higher accuracy. This contributes significantly to the robustness of the 3D reconstruction process. The second contribution is that we have developed algorithms to detect critical configurations where the factorization-based 3D reconstruction degenerates. Based on the detection, we have proposed a sequence-dividing algorithm to divide a long sequence into subsequences, such that successful 3D reconstructions can be performed on individual subsequences with a high confidence. The partial reconstructions are merged later to obtain the 3D model of the complete scene. In the critical configuration detection algorithm, the four critical configurations are detected: (1) coplanar 3D scene points, (2) pure camera rotation, (3) rotation around two camera centers, and (4) presence of excessive noise and outliers in the measurements. The configurations in cases (1), (2) and (4) will affect the rank of the Scaled Measurement Matrix (SMM). The number of camera centers in case (3) will affect the number of independent rows of the SMM. By examining the rank and the row space of the SMM, the abovementioned critical configurations are detected. Based on the detection results, the proposed sequence-dividing algorithm divides a long sequence into subsequences, such that each subsequence is free of the four critical configurations in order to obtain successful 3D reconstructions on individual subsequences. Experimental results on both synthetic and real sequences have demonstrated that the above four critical configurations are robustly detected, and a long sequence of thousands frames is automatically divided into subsequences, yielding successful 3D reconstructions. The proposed critical configuration detection and sequence-dividing algorithms provide an essential processing block for an automatical 3D reconstruction on long sequences. The third contribution is that we have proposed a coarse-to-fine multiple-view depth labeling algorithm to compute depth maps from multiple-view videos, where the accuracy of resulting depth maps is gradually refined in multiple optimization passes. In the proposed algorithm, multiple-view depth reconstruction is formulated as an image-based labeling problem using the framework of Maximum A Posterior (MAP) on Markov Random Fields (MRF). The MAP-MRF framework allows the combination of various objective and heuristic depth cues to define the local penalty and the interaction energies, which provides a straightforward and computationally tractable formulation. Furthermore, the global optimal MAP solution to depth labeli ing can be found by minimizing the local energies, using existing MRF optimization algorithms. The proposed algorithm contains the following three key contributions. (1) A graph construction algorithm to proposed to construct triangular meshes on over-segmentation maps, in order to exploit the color and the texture information for depth labeling. (2) Multiple depth cues are combined to define the local energies. Furthermore, the local energies are adapted to the local image content, in order to consider the varying nature of the image content for an accurate depth labeling. (3) Both the density of the graph nodes and the intervals of the depth labels are gradually refined in multiple labeling passes. By doing so, both the computational efficiency and the robustness of the depth labeling process are improved. The experimental results on real multiple-view videos show that the depth maps of for selected reference view are accurately reconstructed. Depth discontinuities are very well preserved
    • …
    corecore