70 research outputs found

    Representations for Cognitive Vision : a Review of Appearance-Based, Spatio-Temporal, and Graph-Based Approaches

    Get PDF
    The emerging discipline of cognitive vision requires a proper representation of visual information including spatial and temporal relationships, scenes, events, semantics and context. This review article summarizes existing representational schemes in computer vision which might be useful for cognitive vision, a and discusses promising future research directions. The various approaches are categorized according to appearance-based, spatio-temporal, and graph-based representations for cognitive vision. While the representation of objects has been covered extensively in computer vision research, both from a reconstruction as well as from a recognition point of view, cognitive vision will also require new ideas how to represent scenes. We introduce new concepts for scene representations and discuss how these might be efficiently implemented in future cognitive vision systems

    Multispectral Image Road Extraction Based Upon Automated Map Conflation

    Get PDF
    Road network extraction from remotely sensed imagery enables many important and diverse applications such as vehicle tracking, drone navigation, and intelligent transportation studies. There are, however, a number of challenges to road detection from an image. Road pavement material, width, direction, and topology vary across a scene. Complete or partial occlusions caused by nearby buildings, trees, and the shadows cast by them, make maintaining road connectivity difficult. The problems posed by occlusions are exacerbated with the increasing use of oblique imagery from aerial and satellite platforms. Further, common objects such as rooftops and parking lots are made of materials similar or identical to road pavements. This problem of common materials is a classic case of a single land cover material existing for different land use scenarios. This work addresses these problems in road extraction from geo-referenced imagery by leveraging the OpenStreetMap digital road map to guide image-based road extraction. The crowd-sourced cartography has the advantages of worldwide coverage that is constantly updated. The derived road vectors follow only roads and so can serve to guide image-based road extraction with minimal confusion from occlusions and changes in road material. On the other hand, the vector road map has no information on road widths and misalignments between the vector map and the geo-referenced image are small but nonsystematic. Properly correcting misalignment between two geospatial datasets, also known as map conflation, is an essential step. A generic framework requiring minimal human intervention is described for multispectral image road extraction and automatic road map conflation. The approach relies on the road feature generation of a binary mask and a corresponding curvilinear image. A method for generating the binary road mask from the image by applying a spectral measure is presented. The spectral measure, called anisotropy-tunable distance (ATD), differs from conventional measures and is created to account for both changes of spectral direction and spectral magnitude in a unified fashion. The ATD measure is particularly suitable for differentiating urban targets such as roads and building rooftops. The curvilinear image provides estimates of the width and orientation of potential road segments. Road vectors derived from OpenStreetMap are then conflated to image road features by applying junction matching and intermediate point matching, followed by refinement with mean-shift clustering and morphological processing to produce a road mask with piecewise width estimates. The proposed approach is tested on a set of challenging, large, and diverse image data sets and the performance accuracy is assessed. The method is effective for road detection and width estimation of roads, even in challenging scenarios when extensive occlusion occurs

    3D Pedestrian Tracking and Virtual Reconstruction of Ceramic Vessels Using Geometric and Color Cues

    Get PDF
    Object tracking using cameras has many applications ranging from monitoring children and the elderly, to behavior analysis, entertainment, and homeland security. This thesis concentrates on the problem of tracking person(s) of interest in crowded scenes (e.g., airports, train stations, malls, etc.), rendering their locations in time and space along with high quality close-up images of the person for recognition. The tracking is achieved using a combination of overhead cameras for 3D tracking and a network of pan-tilt-zoom (PTZ) cameras to obtain close-up frontal face images. Based on projective geometry, the overhead cameras track people using salient and easily computable feature points such as head points. When the obtained head point is not accurate enough, the color information of the head tops across subsequent frames is integrated to detect and track people. To capture the best frontal face images of a target across time, a PTZ camera scheduling is proposed, where the 'best' PTZ camera is selected based on the capture quality (as close as possible to frontal view) and handoff success (response time needed by the newly selected camera to move from current to desired state) probabilities. The experiments show the 3D tracking errors are very small (less than 5 cm with 14 people crowding an area of around 4 m2) and the frontal face images are captured effectively with most of them centering in the frames. Computational archaeology is becoming a success story of applying computational tools in the reconstruction of vessels obtained from digs, freeing the expert from hours of intensive labor in manually stitching shards into meaningful vessels. In this thesis, we concentrate on the use of geometric and color information of the fragments for 3D virtual reconstruction of broken ceramic vessels. Generic models generated by the experts as a rendition of what the original vessel may have looked like are also utilized. The generic models need not to be identical to the original vessel, but are within a geometric transformation of it in most of its parts. The markings on the 3D surfaces of fragments and generic models are extracted based on their color cues. Ceramic fragments are then aligned against the corresponding generic models based on the geometric relation between the extracted markings. The alignments yield sub-scanner resolution fitting errors.Ph.D., Electrical Engineering -- Drexel University, 201

    Robust and real-time hand detection and tracking in monocular video

    Get PDF
    In recent years, personal computing devices such as laptops, tablets and smartphones have become ubiquitous. Moreover, intelligent sensors are being integrated into many consumer devices such as eyeglasses, wristwatches and smart televisions. With the advent of touchscreen technology, a new human-computer interaction (HCI) paradigm arose that allows users to interface with their device in an intuitive manner. Using simple gestures, such as swipe or pinch movements, a touchscreen can be used to directly interact with a virtual environment. Nevertheless, touchscreens still form a physical barrier between the virtual interface and the real world. An increasingly popular field of research that tries to overcome this limitation, is video based gesture recognition, hand detection and hand tracking. Gesture based interaction allows the user to directly interact with the computer in a natural manner by exploring a virtual reality using nothing but his own body language. In this dissertation, we investigate how robust hand detection and tracking can be accomplished under real-time constraints. In the context of human-computer interaction, real-time is defined as both low latency and low complexity, such that a complete video frame can be processed before the next one becomes available. Furthermore, for practical applications, the algorithms should be robust to illumination changes, camera motion, and cluttered backgrounds in the scene. Finally, the system should be able to initialize automatically, and to detect and recover from tracking failure. We study a wide variety of existing algorithms, and propose significant improvements and novel methods to build a complete detection and tracking system that meets these requirements. Hand detection, hand tracking and hand segmentation are related yet technically different challenges. Whereas detection deals with finding an object in a static image, tracking considers temporal information and is used to track the position of an object over time, throughout a video sequence. Hand segmentation is the task of estimating the hand contour, thereby separating the object from its background. Detection of hands in individual video frames allows us to automatically initialize our tracking algorithm, and to detect and recover from tracking failure. Human hands are highly articulated objects, consisting of finger parts that are connected with joints. As a result, the appearance of a hand can vary greatly, depending on the assumed hand pose. Traditional detection algorithms often assume that the appearance of the object of interest can be described using a rigid model and therefore can not be used to robustly detect human hands. Therefore, we developed an algorithm that detects hands by exploiting their articulated nature. Instead of resorting to a template based approach, we probabilistically model the spatial relations between different hand parts, and the centroid of the hand. Detecting hand parts, such as fingertips, is much easier than detecting a complete hand. Based on our model of the spatial configuration of hand parts, the detected parts can be used to obtain an estimate of the complete hand's position. To comply with the real-time constraints, we developed techniques to speed-up the process by efficiently discarding unimportant information in the image. Experimental results show that our method is competitive with the state-of-the-art in object detection while providing a reduction in computational complexity with a factor 1 000. Furthermore, we showed that our algorithm can also be used to detect other articulated objects such as persons or animals and is therefore not restricted to the task of hand detection. Once a hand has been detected, a tracking algorithm can be used to continuously track its position in time. We developed a probabilistic tracking method that can cope with uncertainty caused by image noise, incorrect detections, changing illumination, and camera motion. Furthermore, our tracking system automatically determines the number of hands in the scene, and can cope with hands entering or leaving the video canvas. We introduced several novel techniques that greatly increase tracking robustness, and that can also be applied in other domains than hand tracking. To achieve real-time processing, we investigated several techniques to reduce the search space of the problem, and deliberately employ methods that are easily parallelized on modern hardware. Experimental results indicate that our methods outperform the state-of-the-art in hand tracking, while providing a much lower computational complexity. One of the methods used by our probabilistic tracking algorithm, is optical flow estimation. Optical flow is defined as a 2D vector field describing the apparent velocities of objects in a 3D scene, projected onto the image plane. Optical flow is known to be used by many insects and birds to visually track objects and to estimate their ego-motion. However, most optical flow estimation methods described in literature are either too slow to be used in real-time applications, or are not robust to illumination changes and fast motion. We therefore developed an optical flow algorithm that can cope with large displacements, and that is illumination independent. Furthermore, we introduce a regularization technique that ensures a smooth flow-field. This regularization scheme effectively reduces the number of noisy and incorrect flow-vector estimates, while maintaining the ability to handle motion discontinuities caused by object boundaries in the scene. The above methods are combined into a hand tracking framework which can be used for interactive applications in unconstrained environments. To demonstrate the possibilities of gesture based human-computer interaction, we developed a new type of computer display. This display is completely transparent, allowing multiple users to perform collaborative tasks while maintaining eye contact. Furthermore, our display produces an image that seems to float in thin air, such that users can touch the virtual image with their hands. This floating imaging display has been showcased on several national and international events and tradeshows. The research that is described in this dissertation has been evaluated thoroughly by comparing detection and tracking results with those obtained by state-of-the-art algorithms. These comparisons show that the proposed methods outperform most algorithms in terms of accuracy, while achieving a much lower computational complexity, resulting in a real-time implementation. Results are discussed in depth at the end of each chapter. This research further resulted in an international journal publication; a second journal paper that has been submitted and is under review at the time of writing this dissertation; nine international conference publications; a national conference publication; a commercial license agreement concerning the research results; two hardware prototypes of a new type of computer display; and a software demonstrator

    Exploring the Internal Statistics: Single Image Super-Resolution, Completion and Captioning

    Full text link
    Image enhancement has drawn increasingly attention in improving image quality or interpretability. It aims to modify images to achieve a better perception for human visual system or a more suitable representation for further analysis in a variety of applications such as medical imaging, remote sensing, and video surveillance. Based on different attributes of the given input images, enhancement tasks vary, e.g., noise removal, deblurring, resolution enhancement, prediction of missing pixels, etc. The latter two are usually referred to as image super-resolution and image inpainting (or completion). Image super-resolution and completion are numerically ill-posed problems. Multi-frame-based approaches make use of the presence of aliasing in multiple frames of the same scene. For cases where only one input image is available, it is extremely challenging to estimate the unknown pixel values. In this dissertation, we target at single image super-resolution and completion by exploring the internal statistics within the input image and across scales. An internal gradient similarity-based single image super-resolution algorithm is first presented. Then we demonstrate that the proposed framework could be naturally extended to accomplish super-resolution and completion simultaneously. Afterwards, a hybrid learning-based single image super-resolution approach is proposed to benefit from both external and internal statistics. This framework hinges on image-level hallucination from externally learned regression models as well as gradient level pyramid self-awareness for edges and textures refinement. The framework is then employed to break the resolution limitation of the passive microwave imagery and to boost the tracking accuracy of the sea ice movements. To extend our research to the quality enhancement of the depth maps, a novel system is presented to handle circumstances where only one pair of registered low-resolution intensity and depth images are available. High quality RGB and depth images are generated after the system. Extensive experimental results have demonstrated the effectiveness of all the proposed frameworks both quantitatively and qualitatively. Different from image super-resolution and completion which belong to low-level vision research, image captioning is a high-level vision task related to the semantic understanding of an input image. It is a natural task for human beings. However, image captioning remains challenging from a computer vision point of view especially due to the fact that the task itself is ambiguous. In principle, descriptions of an image can talk about any visual aspects in it varying from object attributes to scene features, or even refer to objects that are not depicted and the hidden interaction or connection that requires common sense knowledge to analyze. Therefore, learning-based image captioning is in general a data-driven task, which relies on the training dataset. Descriptions in the majority of the existing image-sentence datasets are generated by humans under specific instructions. Real-world sentence data is rarely directly utilized for training since it is sometimes noisy and unbalanced, which makes it ‘imperfect’ for the training of the image captioning task. In this dissertation, we present a novel image captioning framework to deal with the uncontrolled image-sentence dataset where descriptions could be strongly or weakly correlated to the image content and in arbitrary lengths. A self-guiding learning process is proposed to fully reveal the internal statistics of the training dataset and to look into the learning process in a global way and generate descriptions that are syntactically correct and semantically sound

    Change blindness: eradication of gestalt strategies

    Get PDF
    Arrays of eight, texture-defined rectangles were used as stimuli in a one-shot change blindness (CB) task where there was a 50% chance that one rectangle would change orientation between two successive presentations separated by an interval. CB was eliminated by cueing the target rectangle in the first stimulus, reduced by cueing in the interval and unaffected by cueing in the second presentation. This supports the idea that a representation was formed that persisted through the interval before being 'overwritten' by the second presentation (Landman et al, 2003 Vision Research 43149–164]. Another possibility is that participants used some kind of grouping or Gestalt strategy. To test this we changed the spatial position of the rectangles in the second presentation by shifting them along imaginary spokes (by ±1 degree) emanating from the central fixation point. There was no significant difference seen in performance between this and the standard task [F(1,4)=2.565, p=0.185]. This may suggest two things: (i) Gestalt grouping is not used as a strategy in these tasks, and (ii) it gives further weight to the argument that objects may be stored and retrieved from a pre-attentional store during this task

    Model and Appearance Based Analysis of Neuronal Morphology from Different Microscopy Imaging Modalities

    Get PDF
    The neuronal morphology analysis is key for understanding how a brain works. This process requires the neuron imaging system with single-cell resolution; however, there is no feasible system for the human brain. Fortunately, the knowledge can be inferred from the model organism, Drosophila melanogaster, to the human system. This dissertation explores the morphology analysis of Drosophila larvae at single-cell resolution in static images and image sequences, as well as multiple microscopy imaging modalities. Our contributions are on both computational methods for morphology quantification and analysis of the influence of the anatomical aspect. We develop novel model-and-appearance-based methods for morphology quantification and illustrate their significance in three neuroscience studies. Modeling of the structure and dynamics of neuronal circuits creates understanding about how connectivity patterns are formed within a motor circuit and determining whether the connectivity map of neurons can be deduced by estimations of neuronal morphology. To address this problem, we study both boundary-based and centerline-based approaches for neuron reconstruction in static volumes. Neuronal mechanisms are related to the morphology dynamics; so the patterns of neuronal morphology changes are analyzed along with other aspects. In this case, the relationship between neuronal activity and morphology dynamics is explored to analyze locomotion procedures. Our tracking method models the morphology dynamics in the calcium image sequence designed for detecting neuronal activity. It follows the local-to-global design to handle calcium imaging issues and neuronal movement characteristics. Lastly, modeling the link between structural and functional development depicts the correlation between neuron growth and protein interactions. This requires the morphology analysis of different imaging modalities. It can be solved using the part-wise volume segmentation with artificial templates, the standardized representation of neurons. Our method follows the global-to-local approach to solve both part-wise segmentation and registration across modalities. Our methods address common issues in automated morphology analysis from extracting morphological features to tracking neurons, as well as mapping neurons across imaging modalities. The quantitative analysis delivered by our techniques enables a number of new applications and visualizations for advancing the investigation of phenomena in the nervous system

    Scalable exploration of highly detailed and annotated 3D models

    Get PDF
    With the widespread availability of mobile graphics terminals andWebGL-enabled browsers, 3D graphics over the Internet is thriving. Thanks to recent advances in 3D acquisition and modeling systems, high-quality 3D models are becoming increasingly common, and are now potentially available for ubiquitous exploration. In current 3D repositories, such as Blend Swap, 3D Café or Archive3D, 3D models available for download are mostly presented through a few user-selected static images. Online exploration is limited to simple orbiting and/or low-fidelity explorations of simplified models, since photorealistic rendering quality of complex synthetic environments is still hardly achievable within the real-time constraints of interactive applications, especially on on low-powered mobile devices or script-based Internet browsers. Moreover, navigating inside 3D environments, especially on the now pervasive touch devices, is a non-trivial task, and usability is consistently improved by employing assisted navigation controls. In addition, 3D annotations are often used in order to integrate and enhance the visual information by providing spatially coherent contextual information, typically at the expense of introducing visual cluttering. In this thesis, we focus on efficient representations for interactive exploration and understanding of highly detailed 3D meshes on common 3D platforms. For this purpose, we present several approaches exploiting constraints on the data representation for improving the streaming and rendering performance, and camera movement constraints in order to provide scalable navigation methods for interactive exploration of complex 3D environments. Furthermore, we study visualization and interaction techniques to improve the exploration and understanding of complex 3D models by exploiting guided motion control techniques to aid the user in discovering contextual information while avoiding cluttering the visualization. We demonstrate the effectiveness and scalability of our approaches both in large screen museum installations and in mobile devices, by performing interactive exploration of models ranging from 9Mtriangles to 940Mtriangles

    Computer Vision-based Monitoring of Harvest Quality

    Get PDF
    • …
    corecore