1,869 research outputs found
Counting with limited supervision
Counting is among the first abstract analysis tasks that we learn. It is the most fundamental way we can quantitatively understand data. From an early age, people are able to accurately count objects with minimal direction, even when the type of object to be counted is completely unknown. While prior machine learning-based methods have addressed the problem of counting previously unseen kinds of objects, known as class-agnostic counting, they have all required both user input during deployment in the form of exemplar images to define the type to be counted and the locations of every object during training to act as supervision.
In this thesis, we aim to further automated counting methods with the goal of replicating the human ability to perform completely naive counting. To achieve this, we recognise that counting is composed of, at its heart, two different tasks: instance finding and repetition recognition. We explore these problems first in isolation and then together. Within this exploration, we introduce various paradigms to the field of class-agnostic counting including exemplar-free counting, weak supervision, simultaneous multi-class counting, and more abstract concepts which we believe should be considered such as valid-but-unknown counts and the distinction between intrinsic and non-intrinsic tasks.
Over the course of this thesis, we propose three methods which demonstrate that class-agnostic counting can be achieved with less information than previously postulated during both training and deployment. Specifically, we show that large sets of high dimensional data can be clustered flexibly and accurately using only relational pairwise labels, that robust counting can be achieved on novel classes without the requirement of exemplar images to define type during training or inference, and additionally that under certain conditions, such a method can be trained using only image-wise scalar count supervision.
We also propose two datasets to facilitate training and reliably evaluate the performance of said methods alongside other contemporary work. Together, these contributions create a strong base for counting in settings with limited supervision and minimal user input
Scene Monitoring With A Forest Of Cooperative Sensors
In this dissertation, we present vision based scene interpretation methods for monitoring of people and vehicles, in real-time, within a busy environment using a forest of co-operative electro-optical (EO) sensors. We have developed novel video understanding algorithms with learning capability, to detect and categorize people and vehicles, track them with in a camera and hand-off this information across multiple networked cameras for multi-camera tracking. The ability to learn prevents the need for extensive manual intervention, site models and camera calibration, and provides adaptability to changing environmental conditions. For object detection and categorization in the video stream, a two step detection procedure is used. First, regions of interest are determined using a novel hierarchical background subtraction algorithm that uses color and gradient information for interest region detection. Second, objects are located and classified from within these regions using a weakly supervised learning mechanism based on co-training that employs motion and appearance features. The main contribution of this approach is that it is an online procedure in which separate views (features) of the data are used for co-training, while the combined view (all features) is used to make classification decisions in a single boosted framework. The advantage of this approach is that it requires only a few initial training samples and can automatically adjust its parameters online to improve the detection and classification performance. Once objects are detected and classified they are tracked in individual cameras. Single camera tracking is performed using a voting based approach that utilizes color and shape cues to establish correspondence in individual cameras. The tracker has the capability to handle multiple occluded objects. Next, the objects are tracked across a forest of cameras with non-overlapping views. This is a hard problem because of two reasons. First, the observations of an object are often widely separated in time and space when viewed from non-overlapping cameras. Secondly, the appearance of an object in one camera view might be very different from its appearance in another camera view due to the differences in illumination, pose and camera properties. To deal with the first problem, the system learns the inter-camera relationships to constrain track correspondences. These relationships are learned in the form of multivariate probability density of space-time variables (object entry and exit locations, velocities, and inter-camera transition times) using Parzen windows. To handle the appearance change of an object as it moves from one camera to another, we show that all color transfer functions from a given camera to another camera lie in a low dimensional subspace. The tracking algorithm learns this subspace by using probabilistic principal component analysis and uses it for appearance matching. The proposed system learns the camera topology and subspace of inter-camera color transfer functions during a training phase. Once the training is complete, correspondences are assigned using the maximum a posteriori (MAP) estimation framework using both the location and appearance cues. Extensive experiments and deployment of this system in realistic scenarios has demonstrated the robustness of the proposed methods. The proposed system was able to detect and classify targets, and seamlessly tracked them across multiple cameras. It also generated a summary in terms of key frames and textual description of trajectories to a monitoring officer for final analysis and response decision. This level of interpretation was the goal of our research effort, and we believe that it is a significant step forward in the development of intelligent systems that can deal with the complexities of real world scenarios
Novel deep learning architectures for marine and aquaculture applications
Alzayat Saleh's research was in the area of artificial intelligence and machine learning to autonomously recognise fish and their morphological features from digital images. Here he created new deep learning architectures that solved various computer vision problems specific to the marine and aquaculture context. He found that these techniques can facilitate aquaculture management and environmental protection. Fisheries and conservation agencies can use his results for better monitoring strategies and sustainable fishing practices
Weighted Bayesian Gaussian Mixture Model for Roadside LiDAR Object Detection
Background modeling is widely used for intelligent surveillance systems to
detect moving targets by subtracting the static background components. Most
roadside LiDAR object detection methods filter out foreground points by
comparing new data points to pre-trained background references based on
descriptive statistics over many frames (e.g., voxel density, number of
neighbors, maximum distance). However, these solutions are inefficient under
heavy traffic, and parameter values are hard to transfer from one scenario to
another. In early studies, the probabilistic background modeling methods widely
used for the video-based system were considered unsuitable for roadside LiDAR
surveillance systems due to the sparse and unstructured point cloud data. In
this paper, the raw LiDAR data were transformed into a structured
representation based on the elevation and azimuth value of each LiDAR point.
With this high-order tensor representation, we break the barrier to allow
efficient high-dimensional multivariate analysis for roadside LiDAR background
modeling. The Bayesian Nonparametric (BNP) approach integrates the intensity
value and 3D measurements to exploit the measurement data using 3D and
intensity info entirely. The proposed method was compared against two
state-of-the-art roadside LiDAR background models, computer vision benchmark,
and deep learning baselines, evaluated at point, object, and path levels under
heavy traffic and challenging weather. This multimodal Weighted Bayesian
Gaussian Mixture Model (GMM) can handle dynamic backgrounds with noisy
measurements and substantially enhances the infrastructure-based LiDAR object
detection, whereby various 3D modeling for smart city applications could be
created
Object-Oriented Dynamics Learning through Multi-Level Abstraction
Object-based approaches for learning action-conditioned dynamics has
demonstrated promise for generalization and interpretability. However, existing
approaches suffer from structural limitations and optimization difficulties for
common environments with multiple dynamic objects. In this paper, we present a
novel self-supervised learning framework, called Multi-level Abstraction
Object-oriented Predictor (MAOP), which employs a three-level learning
architecture that enables efficient object-based dynamics learning from raw
visual observations. We also design a spatial-temporal relational reasoning
mechanism for MAOP to support instance-level dynamics learning and handle
partial observability. Our results show that MAOP significantly outperforms
previous methods in terms of sample efficiency and generalization over novel
environments for learning environment models. We also demonstrate that learned
dynamics models enable efficient planning in unseen environments, comparable to
true environment models. In addition, MAOP learns semantically and visually
interpretable disentangled representations.Comment: Accepted to the Thirthy-Fourth AAAI Conference On Artificial
Intelligence (AAAI), 202
A review of silhouette extraction algorithms for use within visual hull pipelines
© 2020, © 2020 Informa UK Limited, trading as Taylor & Francis Group. Markerless motion capture would permit the study of human biomechanics in environments where marker-based systems are impractical, e.g. outdoors or underwater. The visual hull tool may enable such data to be recorded, but it requires the accurate detection of the silhouette of the object in multiple camera views. This paper reviews the top-performing algorithms available to date for silhouette extraction, with the visual hull in mind as the downstream application; the rationale is that higher-quality silhouettes would lead to higher-quality visual hulls, and consequently better measurement of movement. This paper is the first attempt in the literature to compare silhouette extraction algorithms that belong to different fields of Computer Vision, namely background subtraction, semantic segmentation, and multi-view segmentation. It was found that several algorithms exist that would be substantial improvements over the silhouette extraction algorithms traditionally used in visual hull pipelines. In particular, FgSegNet v2 (a background subtraction algorithm), DeepLabv3+ JFT (a semantic segmentation algorithm), and Djelouah 2013 (a multi-view segmentation algorithm) are the most accurate and promising methods for the extraction of silhouettes from 2D images to date, and could seamlessly be integrated within a visual hull pipeline for studies of human movement or biomechanics
- …