13,237 research outputs found

    People counting using an overhead fisheye camera

    Full text link
    As climate change concerns grow, the reduction of energy consumption is seen as one of many potential solutions. In the US, a considerable amount of energy is wasted in commercial buildings due to sub-optimal heating, ventilation and air conditioning that operate with no knowledge of the occupancy level in various rooms and open areas. In this thesis, I develop an approach to passive occupancy estimation that does not require occupants to carry any type of beacon, but instead uses an overhead camera with fisheye lens (360 by 180 degree field of view). The difficulty with fisheye images is that occupants may appear not only in the upright position, but also upside-down, horizontally and diagonally, and thus algorithms developed for typical side-mounted, standard-lens cameras tend to fail. As the top-performing people detection algorithms today use deep learning, a logical step would be to develop and train a new neural-network model. However, there exist no large fisheye-image datasets with person annotations to facilitate training a new model. Therefore, I developed two people-counting methods that leverage YOLO (version 3), a state-of-the-art object detection method trained on standard datasets. In one approach, YOLO is applied to 24 rotated and highly-overlapping windows, and the results are post-processed to produce a people count. In the other approach, regions of interest are first extracted via background subtraction and only windows that include such regions are supplied to YOLO and post-processed. I carried out extensive experimental evaluation of both algorithms and showed their superior performance compared to a benchmark method

    Learning to count with deep object features

    Full text link
    Learning to count is a learning strategy that has been recently proposed in the literature for dealing with problems where estimating the number of object instances in a scene is the final objective. In this framework, the task of learning to detect and localize individual object instances is seen as a harder task that can be evaded by casting the problem as that of computing a regression value from hand-crafted image features. In this paper we explore the features that are learned when training a counting convolutional neural network in order to understand their underlying representation. To this end we define a counting problem for MNIST data and show that the internal representation of the network is able to classify digits in spite of the fact that no direct supervision was provided for them during training. We also present preliminary results about a deep network that is able to count the number of pedestrians in a scene.Comment: This paper has been accepted at Deep Vision Workshop at CVPR 201

    Understanding Traffic Density from Large-Scale Web Camera Data

    Full text link
    Understanding traffic density from large-scale web camera (webcam) videos is a challenging problem because such videos have low spatial and temporal resolution, high occlusion and large perspective. To deeply understand traffic density, we explore both deep learning based and optimization based methods. To avoid individual vehicle detection and tracking, both methods map the image into vehicle density map, one based on rank constrained regression and the other one based on fully convolution networks (FCN). The regression based method learns different weights for different blocks in the image to increase freedom degrees of weights and embed perspective information. The FCN based method jointly estimates vehicle density map and vehicle count with a residual learning framework to perform end-to-end dense prediction, allowing arbitrary image resolution, and adapting to different vehicle scales and perspectives. We analyze and compare both methods, and get insights from optimization based method to improve deep model. Since existing datasets do not cover all the challenges in our work, we collected and labelled a large-scale traffic video dataset, containing 60 million frames from 212 webcams. Both methods are extensively evaluated and compared on different counting tasks and datasets. FCN based method significantly reduces the mean absolute error from 10.99 to 5.31 on the public dataset TRANCOS compared with the state-of-the-art baseline.Comment: Accepted by CVPR 2017. Preprint version was uploaded on http://welcome.isr.tecnico.ulisboa.pt/publications/understanding-traffic-density-from-large-scale-web-camera-data

    LCrowdV: Generating Labeled Videos for Simulation-based Crowd Behavior Learning

    Full text link
    We present a novel procedural framework to generate an arbitrary number of labeled crowd videos (LCrowdV). The resulting crowd video datasets are used to design accurate algorithms or training models for crowded scene understanding. Our overall approach is composed of two components: a procedural simulation framework for generating crowd movements and behaviors, and a procedural rendering framework to generate different videos or images. Each video or image is automatically labeled based on the environment, number of pedestrians, density, behavior, flow, lighting conditions, viewpoint, noise, etc. Furthermore, we can increase the realism by combining synthetically-generated behaviors with real-world background videos. We demonstrate the benefits of LCrowdV over prior lableled crowd datasets by improving the accuracy of pedestrian detection and crowd behavior classification algorithms. LCrowdV would be released on the WWW
    corecore