188 research outputs found

    Holistic interpretation of visual data based on topology:semantic segmentation of architectural facades

    Get PDF
    The work presented in this dissertation is a step towards effectively incorporating contextual knowledge in the task of semantic segmentation. To date, the use of context has been confined to the genre of the scene with a few exceptions in the field. Research has been directed towards enhancing appearance descriptors. While this is unarguably important, recent studies show that computer vision has reached a near-human level of performance in relying on these descriptors when objects have stable distinctive surface properties and in proper imaging conditions. When these conditions are not met, humans exploit their knowledge about the intrinsic geometric layout of the scene to make local decisions. Computer vision lags behind when it comes to this asset. For this reason, we aim to bridge the gap by presenting algorithms for semantic segmentation of building facades making use of scene topological aspects. We provide a classification scheme to carry out segmentation and recognition simultaneously.The algorithm is able to solve a single optimization function and yield a semantic interpretation of facades, relying on the modeling power of probabilistic graphs and efficient discrete combinatorial optimization tools. We tackle the same problem of semantic facade segmentation with the neural network approach.We attain accuracy figures that are on-par with the state-of-the-art in a fully automated pipeline.Starting from pixelwise classifications obtained via Convolutional Neural Networks (CNN). These are then structurally validated through a cascade of Restricted Boltzmann Machines (RBM) and Multi-Layer Perceptron (MLP) that regenerates the most likely layout. In the domain of architectural modeling, there is geometric multi-model fitting. We introduce a novel guided sampling algorithm based on Minimum Spanning Trees (MST), which surpasses other propagation techniques in terms of robustness to noise. We make a number of additional contributions such as measure of model deviation which captures variations among fitted models

    Exploring new representations and applications for motion analysis

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 153-164).The focus of motion analysis has been on estimating a flow vector for every pixel by matching intensities. In my thesis, I will explore motion representations beyond the pixel level and new applications to which these representations lead. I first focus on analyzing motion from video sequences. Traditional motion analysis suffers from the inappropriate modeling of the grouping relationship of pixels and from a lack of ground-truth data. Using layers as the interface for humans to interact with videos, we build a human-assisted motion annotation system to obtain ground-truth motion, missing in the literature, for natural video sequences. Furthermore, we show that with the layer representation, we can detect and magnify small motions to make them visible to human eyes. Then we move to a contour presentation to analyze the motion for textureless objects under occlusion. We demonstrate that simultaneous boundary grouping and motion analysis can solve challenging data, where the traditional pixel-wise motion analysis fails. In the second part of my thesis, I will show the benefits of matching local image structures instead of intensity values. We propose SIFT flow that establishes dense, semantically meaningful correspondence between two images across scenes by matching pixel-wise SIFT features. Using SIFT flow, we develop a new framework for image parsing by transferring the metadata information, such as annotation, motion and depth, from the images in a large database to an unknown query image. We demonstrate this framework using new applications such as predicting motion from a single image and motion synthesis via object transfer.(cont.) Based on SIFT flow, we introduce a nonparametric scene parsing system using label transfer, with very promising experimental results suggesting that our system outperforms state-of-the-art techniques based on training classifiers.by Ce Liu.Ph.D

    A brief survey of visual saliency detection

    Get PDF

    Image-based recognition, 3D localization, and retro-reflectivity evaluation of high-quantity low-cost roadway assets for enhanced condition assessment

    Get PDF
    Systematic condition assessment of high-quantity low-cost roadway assets such as traffic signs, guardrails, and pavement markings requires frequent reporting on location and up-to-date status of these assets. Today, most Departments of Transportation (DOTs) in the US collect data using camera-mounted vehicles to filter, annotate, organize, and present the data necessary for these assessments. However, the cost and complexity of the collection, analysis, and reporting as-is conditions result in sparse and infrequent monitoring. Thus, some of the gains in efficiency are consumed by monitoring costs. This dissertation proposes to improve frequency, detail, and applicability of image-based condition assessment via automating detection, classification, and 3D localization of multiple types of high-quantity low-cost roadway assets using both images collected by the DOTs and online databases such Google Street View Images. To address the new requirements of US Federal Highway Administration (FHWA), a new method is also developed that simulates nighttime visibility of traffic signs from images taken during daytime and measures their retro-reflectivity condition. To initiate detection and classification of high-quantity low-cost roadway assets from street-level images, a number of algorithms are proposed that automatically segment and localize high-level asset categories in 3D. The first set of algorithms focus on the task of detecting and segmenting assets at high-level categories. More specifically, a method based on Semantic Texton Forest classifiers, segments each geo-registered 2D video frame at the pixel-level based on shape, texture, and color. A Structure from Motion (SfM) procedure reconstructs the road and its assets in 3D. Next, a voting scheme assigns the most observed asset category to each point in 3D. The experimental results from application of this method are promising, nevertheless because this method relies on using supervised ground-truth pixel labels for training purposes, scaling it to various types of assets is challenging. To address this issue, a non-parametric image parsing method is proposed that leverages lazy learning scheme for segmentation and recognition of roadway assets. The semi-supervised technique used in the proposed method does not need training and provides ground truth data in a more efficient manner. It is easily scalable to thousands of video frames captured during data collection. Once the high-level asset categories are detected, specific techniques needs to be exploited to detect and classify the assets at a higher level of granularity. To this end, performance of three computer vision algorithms are evaluated for classification of traffic signs in presence of cluttered backgrounds and static and dynamic occlusions. Without making any prior assumptions about the location of traffic signs in 2D, the best performing method uses histograms of oriented gradients and color together with multiple one-vs-all Support Vector Machines, and classifies these assets into warning, regulatory, stop, and yield sign categories. To minimize the reliance on visual data collected by the DOTs and improve frequency and applicability of condition assessment, a new end-to-end procedure is presented that applies the above algorithms and creates comprehensive inventory of traffic signs using Google Street View images. By processing images extracted using Google Street View API and discriminative classification scores from all images that see a sign, the most probable 3D location of each traffic sign is derived and is shown on the Google Earth using a dynamic heat map. A data card containing information about location, type, and condition of each detected traffic sign is also created. Finally, a computer vision-based algorithm is proposed that measures retro-reflectivity of traffic signs during daytime using a vehicle mounted device. The algorithm simulates nighttime visibility of traffic signs from images taken during daytime and measures their retro-reflectivity. The technique is faster, cheaper, and safer compared to the state-of-the-art as it neither requires nighttime operation nor requires manual sign inspection. It also satisfies measurement guidelines set forth by FHWA both in terms of granularity and accuracy. To validate the techniques, new detailed video datasets and their ground-truth were generated from 2.2-mile smart road research facility and two interstate highways in the US. The comprehensive dataset contains over 11,000 annotated U.S. traffic sign images and exhibits large variations in sign pose, scale, background, illumination, and occlusion conditions. The performance of all algorithms were examined using these datasets. For retro-reflectivity measurement of traffic signs, experiments were conducted at different times of day and for different distances. Results were compared with a method recommended by ASTM standards. The experimental results show promise in scalability of these methods to reduce the time and effort required for developing road inventories, especially for those assets such as guardrails and traffic lights that are not typically considered in 2D asset recognition methods and also multiple categories of traffic signs. The applicability of Google Street View Images for inventory management purposes and also the technique for retro-reflectivity measurement during daytime demonstrate strong potential in lowering inspection costs and improving safety in practical applications

    Deep Learning-Based Human Pose Estimation: A Survey

    Full text link
    Human pose estimation aims to locate the human body parts and build human body representation (e.g., body skeleton) from input data such as images and videos. It has drawn increasing attention during the past decade and has been utilized in a wide range of applications including human-computer interaction, motion analysis, augmented reality, and virtual reality. Although the recently developed deep learning-based solutions have achieved high performance in human pose estimation, there still remain challenges due to insufficient training data, depth ambiguities, and occlusion. The goal of this survey paper is to provide a comprehensive review of recent deep learning-based solutions for both 2D and 3D pose estimation via a systematic analysis and comparison of these solutions based on their input data and inference procedures. More than 240 research papers since 2014 are covered in this survey. Furthermore, 2D and 3D human pose estimation datasets and evaluation metrics are included. Quantitative performance comparisons of the reviewed methods on popular datasets are summarized and discussed. Finally, the challenges involved, applications, and future research directions are concluded. We also provide a regularly updated project page: \url{https://github.com/zczcwh/DL-HPE

    Overview of Environment Perception for Intelligent Vehicles

    Get PDF
    This paper presents a comprehensive literature review on environment perception for intelligent vehicles. The state-of-the-art algorithms and modeling methods for intelligent vehicles are given, with a summary of their pros and cons. A special attention is paid to methods for lane and road detection, traffic sign recognition, vehicle tracking, behavior analysis, and scene understanding. In addition, we provide information about datasets, common performance analysis, and perspectives on future research directions in this area

    Multigranularity Representations for Human Inter-Actions: Pose, Motion and Intention

    Get PDF
    Tracking people and their body pose in videos is a central problem in computer vision. Standard tracking representations reason about temporal coherence of detected people and body parts. They have difficulty tracking targets under partial occlusions or rare body poses, where detectors often fail, since the number of training examples is often too small to deal with the exponential variability of such configurations. We propose tracking representations that track and segment people and their body pose in videos by exploiting information at multiple detection and segmentation granularities when available, whole body, parts or point trajectories. Detections and motion estimates provide contradictory information in case of false alarm detections or leaking motion affinities. We consolidate contradictory information via graph steering, an algorithm for simultaneous detection and co-clustering in a two-granularity graph of motion trajectories and detections, that corrects motion leakage between correctly detected objects, while being robust to false alarms or spatially inaccurate detections. We first present a motion segmentation framework that exploits long range motion of point trajectories and large spatial support of image regions. We show resulting video segments adapt to targets under partial occlusions and deformations. Second, we augment motion-based representations with object detection for dealing with motion leakage. We demonstrate how to combine dense optical flow trajectory affinities with repulsions from confident detections to reach a global consensus of detection and tracking in crowded scenes. Third, we study human motion and pose estimation. We segment hard to detect, fast moving body limbs from their surrounding clutter and match them against pose exemplars to detect body pose under fast motion. We employ on-the-fly human body kinematics to improve tracking of body joints under wide deformations. We use motion segmentability of body parts for re-ranking a set of body joint candidate trajectories and jointly infer multi-frame body pose and video segmentation. We show empirically that such multi-granularity tracking representation is worthwhile, obtaining significantly more accurate multi-object tracking and detailed body pose estimation in popular datasets

    On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator

    Get PDF
    Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise
    • …
    corecore