1,869 research outputs found

    Counting with limited supervision

    Get PDF
    Counting is among the first abstract analysis tasks that we learn. It is the most fundamental way we can quantitatively understand data. From an early age, people are able to accurately count objects with minimal direction, even when the type of object to be counted is completely unknown. While prior machine learning-based methods have addressed the problem of counting previously unseen kinds of objects, known as class-agnostic counting, they have all required both user input during deployment in the form of exemplar images to define the type to be counted and the locations of every object during training to act as supervision. In this thesis, we aim to further automated counting methods with the goal of replicating the human ability to perform completely naive counting. To achieve this, we recognise that counting is composed of, at its heart, two different tasks: instance finding and repetition recognition. We explore these problems first in isolation and then together. Within this exploration, we introduce various paradigms to the field of class-agnostic counting including exemplar-free counting, weak supervision, simultaneous multi-class counting, and more abstract concepts which we believe should be considered such as valid-but-unknown counts and the distinction between intrinsic and non-intrinsic tasks. Over the course of this thesis, we propose three methods which demonstrate that class-agnostic counting can be achieved with less information than previously postulated during both training and deployment. Specifically, we show that large sets of high dimensional data can be clustered flexibly and accurately using only relational pairwise labels, that robust counting can be achieved on novel classes without the requirement of exemplar images to define type during training or inference, and additionally that under certain conditions, such a method can be trained using only image-wise scalar count supervision. We also propose two datasets to facilitate training and reliably evaluate the performance of said methods alongside other contemporary work. Together, these contributions create a strong base for counting in settings with limited supervision and minimal user input

    Scene Monitoring With A Forest Of Cooperative Sensors

    Get PDF
    In this dissertation, we present vision based scene interpretation methods for monitoring of people and vehicles, in real-time, within a busy environment using a forest of co-operative electro-optical (EO) sensors. We have developed novel video understanding algorithms with learning capability, to detect and categorize people and vehicles, track them with in a camera and hand-off this information across multiple networked cameras for multi-camera tracking. The ability to learn prevents the need for extensive manual intervention, site models and camera calibration, and provides adaptability to changing environmental conditions. For object detection and categorization in the video stream, a two step detection procedure is used. First, regions of interest are determined using a novel hierarchical background subtraction algorithm that uses color and gradient information for interest region detection. Second, objects are located and classified from within these regions using a weakly supervised learning mechanism based on co-training that employs motion and appearance features. The main contribution of this approach is that it is an online procedure in which separate views (features) of the data are used for co-training, while the combined view (all features) is used to make classification decisions in a single boosted framework. The advantage of this approach is that it requires only a few initial training samples and can automatically adjust its parameters online to improve the detection and classification performance. Once objects are detected and classified they are tracked in individual cameras. Single camera tracking is performed using a voting based approach that utilizes color and shape cues to establish correspondence in individual cameras. The tracker has the capability to handle multiple occluded objects. Next, the objects are tracked across a forest of cameras with non-overlapping views. This is a hard problem because of two reasons. First, the observations of an object are often widely separated in time and space when viewed from non-overlapping cameras. Secondly, the appearance of an object in one camera view might be very different from its appearance in another camera view due to the differences in illumination, pose and camera properties. To deal with the first problem, the system learns the inter-camera relationships to constrain track correspondences. These relationships are learned in the form of multivariate probability density of space-time variables (object entry and exit locations, velocities, and inter-camera transition times) using Parzen windows. To handle the appearance change of an object as it moves from one camera to another, we show that all color transfer functions from a given camera to another camera lie in a low dimensional subspace. The tracking algorithm learns this subspace by using probabilistic principal component analysis and uses it for appearance matching. The proposed system learns the camera topology and subspace of inter-camera color transfer functions during a training phase. Once the training is complete, correspondences are assigned using the maximum a posteriori (MAP) estimation framework using both the location and appearance cues. Extensive experiments and deployment of this system in realistic scenarios has demonstrated the robustness of the proposed methods. The proposed system was able to detect and classify targets, and seamlessly tracked them across multiple cameras. It also generated a summary in terms of key frames and textual description of trajectories to a monitoring officer for final analysis and response decision. This level of interpretation was the goal of our research effort, and we believe that it is a significant step forward in the development of intelligent systems that can deal with the complexities of real world scenarios

    Novel deep learning architectures for marine and aquaculture applications

    Get PDF
    Alzayat Saleh's research was in the area of artificial intelligence and machine learning to autonomously recognise fish and their morphological features from digital images. Here he created new deep learning architectures that solved various computer vision problems specific to the marine and aquaculture context. He found that these techniques can facilitate aquaculture management and environmental protection. Fisheries and conservation agencies can use his results for better monitoring strategies and sustainable fishing practices

    Weighted Bayesian Gaussian Mixture Model for Roadside LiDAR Object Detection

    Full text link
    Background modeling is widely used for intelligent surveillance systems to detect moving targets by subtracting the static background components. Most roadside LiDAR object detection methods filter out foreground points by comparing new data points to pre-trained background references based on descriptive statistics over many frames (e.g., voxel density, number of neighbors, maximum distance). However, these solutions are inefficient under heavy traffic, and parameter values are hard to transfer from one scenario to another. In early studies, the probabilistic background modeling methods widely used for the video-based system were considered unsuitable for roadside LiDAR surveillance systems due to the sparse and unstructured point cloud data. In this paper, the raw LiDAR data were transformed into a structured representation based on the elevation and azimuth value of each LiDAR point. With this high-order tensor representation, we break the barrier to allow efficient high-dimensional multivariate analysis for roadside LiDAR background modeling. The Bayesian Nonparametric (BNP) approach integrates the intensity value and 3D measurements to exploit the measurement data using 3D and intensity info entirely. The proposed method was compared against two state-of-the-art roadside LiDAR background models, computer vision benchmark, and deep learning baselines, evaluated at point, object, and path levels under heavy traffic and challenging weather. This multimodal Weighted Bayesian Gaussian Mixture Model (GMM) can handle dynamic backgrounds with noisy measurements and substantially enhances the infrastructure-based LiDAR object detection, whereby various 3D modeling for smart city applications could be created

    Object-Oriented Dynamics Learning through Multi-Level Abstraction

    Full text link
    Object-based approaches for learning action-conditioned dynamics has demonstrated promise for generalization and interpretability. However, existing approaches suffer from structural limitations and optimization difficulties for common environments with multiple dynamic objects. In this paper, we present a novel self-supervised learning framework, called Multi-level Abstraction Object-oriented Predictor (MAOP), which employs a three-level learning architecture that enables efficient object-based dynamics learning from raw visual observations. We also design a spatial-temporal relational reasoning mechanism for MAOP to support instance-level dynamics learning and handle partial observability. Our results show that MAOP significantly outperforms previous methods in terms of sample efficiency and generalization over novel environments for learning environment models. We also demonstrate that learned dynamics models enable efficient planning in unseen environments, comparable to true environment models. In addition, MAOP learns semantically and visually interpretable disentangled representations.Comment: Accepted to the Thirthy-Fourth AAAI Conference On Artificial Intelligence (AAAI), 202

    A review of silhouette extraction algorithms for use within visual hull pipelines

    Get PDF
    © 2020, © 2020 Informa UK Limited, trading as Taylor & Francis Group. Markerless motion capture would permit the study of human biomechanics in environments where marker-based systems are impractical, e.g. outdoors or underwater. The visual hull tool may enable such data to be recorded, but it requires the accurate detection of the silhouette of the object in multiple camera views. This paper reviews the top-performing algorithms available to date for silhouette extraction, with the visual hull in mind as the downstream application; the rationale is that higher-quality silhouettes would lead to higher-quality visual hulls, and consequently better measurement of movement. This paper is the first attempt in the literature to compare silhouette extraction algorithms that belong to different fields of Computer Vision, namely background subtraction, semantic segmentation, and multi-view segmentation. It was found that several algorithms exist that would be substantial improvements over the silhouette extraction algorithms traditionally used in visual hull pipelines. In particular, FgSegNet v2 (a background subtraction algorithm), DeepLabv3+ JFT (a semantic segmentation algorithm), and Djelouah 2013 (a multi-view segmentation algorithm) are the most accurate and promising methods for the extraction of silhouettes from 2D images to date, and could seamlessly be integrated within a visual hull pipeline for studies of human movement or biomechanics
    corecore