147 research outputs found
LIPIcs, Volume 251, ITCS 2023, Complete Volume
LIPIcs, Volume 251, ITCS 2023, Complete Volum
Dynamic scene understanding: Pedestrian tracking from aerial devices.
Multiple Object Tracking (MOT) is the problem that involves following the trajectory of multiple objects in a sequence, generally a video. Pedestrians are among the most interesting subjects to track and recognize for many purposes such as surveillance, and safety. In the recent years, Unmanned Aerial Vehicles (UAV’s) have been viewed as a viable option for monitoring public areas, as they provide a low-cost method of data collection while covering large and difficult-to-reach areas. In this thesis, we present an online pedestrian tracking and re-identification from aerial devices framework. This framework is based on learning a compact directional statistic distribution (von-Mises-Fisher distribution) for each person ID using a deep convolutional neural network. The distribution characteristics are trained to be invariant to clothes appearances and to transformations. In real world scenarios, during deployment, new pedestrian and objects can appear in the scene and the model should detect them as Out Of Distribution (OOD). Thus, our frameworks also includes an OOD detection adopted from [16] called Virtual Outlier Synthetic (VOS), that detects OOD based on synthesising virtual outlier in the embedding space in an online manner. To validate, analyze and compare our approach, we use a large real benchmark data that contain detection tracking and identity annotations. These targets are captured at different viewing angles, different places, and different times by a ”DJI Phantom 4” drone. We validate the effectiveness of the proposed framework by evaluating their detection, tracking and long term identification performance as well as classification performance between In Distribution (ID) and OOD. We show that the the proposed methods in the framework can learn models that achieve their objectives
Toward Efficient and Robust Computer Vision for Large-Scale Edge Applications
The past decade has been witnessing remarkable advancements in computer vision and deep learning algorithms, ushering in a transformative wave of large-scale edge applications across various industries. These image processing methods, however, still encounter numerous challenges when it comes to meeting real-world demands, especially in terms of accuracy and latency at scale. Indeed, striking a balance among efficiency, robustness, and scalability remains a common obstacle. This dissertation investigates these issues in the context of different computer vision tasks, including image classification, semantic segmentation, depth estimation, and object detection. We introduce novel solutions, focusing on utilizing adjustable neural networks, joint multi-task architecture search, and generalized supervision interpolation. The first obstacle revolves around the ability to trade off between speed and accuracy in convolutional neural networks (CNNs) during inference on resource-constrained platforms. Despite their progress, CNNs are typically monolithic at runtime, which can present practical difficulties since computational budgets may vary over time. To address this, we introduce Any-Width Network, an adjustable-width CNN architecture that utilizes a novel Triangular Convolution module to enable fine-grained control over speed and accuracy during inference. The second challenge focuses on the computationally demanding nature of dense prediction tasks such as semantic segmentation and depth estimation. This issue becomes especially problematic for edge platforms with limited resources. To tackle this, we propose a novel and scalable framework named EDNAS. EDNAS leverages the synergistic relationship between Multi-Task Learning and hardware-aware Neural Architecture Search to significantly enhance on-device speed and accuracy of dense predictions. Finally, to improve the robustness of object detection, we introduce a novel data mixing augmentation. While mixing techniques such as Mixup have proven successful in image classification, their application to object detection is non-trivial due to spatial misalignment, foreground/background distinction, and instance multiplicity. To address these issues, we propose a generalized data mixing principle, Supervision Interpolation, and its simple yet effective implementation, LossMix. By addressing these challenges, this dissertation aims to facilitate better efficiency, accuracy, and scalability of computer vision and deep learning algorithms and contribute to the advancement of large-scale edge applications across different domains.Doctor of Philosoph
Multimodal Dataset Distillation for Image-Text Retrieval
Dataset distillation methods offer the promise of reducing a large-scale
dataset down to a significantly smaller set of (potentially synthetic) training
examples, which preserve sufficient information for training a new model from
scratch. So far dataset distillation methods have been developed for image
classification. However, with the rise in capabilities of vision-language
models, and especially given the scale of datasets necessary to train these
models, the time is ripe to expand dataset distillation methods beyond image
classification. In this work, we take the first steps towards this goal by
expanding on the idea of trajectory matching to create a distillation method
for vision-language datasets. The key challenge is that vision-language
datasets do not have a set of discrete classes. To overcome this, our proposed
multimodal dataset distillation method jointly distill the images and their
corresponding language descriptions in a contrastive formulation. Since there
are no existing baselines, we compare our approach to three coreset selection
methods (strategic subsampling of the training dataset), which we adapt to the
vision-language setting. We demonstrate significant improvements on the
challenging Flickr30K and COCO retrieval benchmark: the best coreset selection
method which selects 1000 image-text pairs for training is able to achieve only
5.6% image-to-text retrieval accuracy (recall@1); in contrast, our dataset
distillation approach almost doubles that with just 100 (an order of magnitude
fewer) training pairs.Comment: 28 pages, 11 figure
Benne: A Modular and Self-Optimizing Algorithm for Data Stream Clustering
In various real-world applications, ranging from the Internet of Things (IoT)
to social media and financial systems, data stream clustering is a critical
operation. This paper introduces Benne, a modular and highly configurable data
stream clustering algorithm designed to offer a nuanced balance between
clustering accuracy and computational efficiency. Benne distinguishes itself by
clearly demarcating four pivotal design dimensions: the summarizing data
structure, the window model for handling data temporality, the outlier
detection mechanism, and the refinement strategy for improving cluster quality.
This clear separation not only facilitates a granular understanding of the
impact of each design choice on the algorithm's performance but also enhances
the algorithm's adaptability to a wide array of application contexts. We
provide a comprehensive analysis of these design dimensions, elucidating the
challenges and opportunities inherent to each. Furthermore, we conduct a
rigorous performance evaluation of Benne, employing diverse configurations and
benchmarking it against existing state-of-the-art data stream clustering
algorithms. Our empirical results substantiate that Benne either matches or
surpasses competing algorithms in terms of clustering accuracy, processing
throughput, and adaptability to varying data stream characteristics. This
establishes Benne as a valuable asset for both practitioners and researchers in
the field of data stream mining
DESIGN AND VERIFICATION OF AUTONOMOUS SYSTEMS IN THE PRESENCE OF UNCERTAINTIES
Autonomous Systems offer hope towards moving away from mechanized, unsafe, manual, often inefficient practices. The last decade has seen several small, but important, steps towards making this dream into reality. These advancements have helped us to achieve limited autonomy in several places, such as, driving, factory floors, surgeries, wearables, and home assistants, etc. Nevertheless, autonomous systems are required to operate in a wide range of environments with uncertainties (viz., sensor errors, timing errors, dynamic nature of the environment, etc.). Such environmental uncertainties, even when present in small amounts, can have drastic impact on the safety of the system—thus hampering the goal of achieving higher degree of autonomy, especially in safety critical domains. To this end, the dissertation shall discuss formaltechniques that are able to verify and design autonomous systems for safety, even under the presence of such uncertainties, allowing for their trustworthy deployment in the real world. Specifically, the dissertation shall discuss monitoring techniques for autonomous systems from available (noisy) logs, and safety-verification techniques of autonomous system controllers under timing uncertainties. Secondly, using heterogeneous learning-based cloud computing models that can balance uncertainty in output and computation cost, the dissertation will present techniques for designing safe and performance-optimal autonomous systems.Doctor of Philosoph
Deep Neural Network Compression with Filter Pruning
The rapid development of convolutional neural networks (CNNs) in computer vision tasks has inspired researchers to apply their potential to embedded or mobile devices. However, it typically requires a large amount of computation and memory footprint, limiting their deployment in those resource-limited systems. Therefore, how to compress complex networks while maintaining competitive performance has become the focus of attention in recent years. On the subject of network compression, filter pruning methods that achieve structured compact model by finding and removing redundant filters, have attracted widespread attention. Inspired by previous dedicated works, this thesis focuses on the way to obtain the compact model while maximizing the retention of the original model performance. In particular, aiming at the limitations of choosing filters on the existing popular pruning methods, several novel filter pruning strategies are proposed to find and remove redundant filters more accurately to reduce the performance loss of the model caused by pruning. For instance, the filter pruning method with an attention mechanism (Chapter 3), data-dependent filter pruning guided by LSTM (Chapter 4), and filter pruning with uniqueness mechanism in the frequency domain (Chapter 5). This thesis first addresses the filter pruning issue from a global perspective. To this end, we propose a new scheme, termed Pruning Filter with an Attention Mechanism (PFAM). That is, by establishing the dependency/relationship between filters at each layer, we explore the long-term dependence between filters via attention module in order to choose the tobe-pruned filters. Unlike prior approaches that identify the to-be-pruned filters simply based on their intrinsic properties, the less correlated filters are first pruned after the pruning step in the current training epoch and then reconstructed and updated during the subsequent training epoch. Thus, the compressed network model can be achieved without the requirement for a pre-trained model since input data can be manipulated with the maximum information maintained when the original training strategy is executed. Next, it is noticed that most existing pruning algorithms seek to prune the filter layer by layer. Specifically, they guide filter pruning at each layer by setting a global pruning rate, which indicates that each convolutional layer is treated equally without regard to its depth and width. In this situation, we argue that the convolutional layers in the network also have varying degrees of significance. Besides, we propose that selecting the appropriate layers for pruning is more reasonable since it can result in more complexity reduction with less performance loss by keeping and removing more filters in those critical and nonsignificant layers, respectively. In order to do this, long short-term memory (LSTM) is employed to learn the hierarchical properties of a network and to generalize a global network pruning scheme. On top of that, we present a data-dependent soft pruning strategy named Squeeze-Excitation-Pruning (SEP), which does not physically prune any filters but removes specific kernels involved in calculating forward and backward propagations based on the pruning scheme. Doing so can further decrease the model’s performance decline while achieving a deep model compression. Lastly, we transfer the concept of relationship from the filter level to the feature map level because the feature maps can reflect the comprehensive information of both input data and filters. Hence, we propose Filter Pruning with Uniqueness Mechanism in the Frequency Domain (FPUM) to serve as a guideline for the filter pruning strategy by generating the correlation between feature maps. Specifically, we first transfer features to the frequency domain by Discrete Cosine Transform (DCT). Then, for each feature map, we compute a uniqueness score, which measures its probability of being replaced by others. Doing so allows us to prune the filters corresponding to the low-uniqueness maps without significant performance degradation. In addition, our strategy is more resistant to noise than spatial methods, further enhancing the network’s compactness while maintaining performance, as the critical pruning clues are more concentrated following DCT
LIPIcs, Volume 261, ICALP 2023, Complete Volume
LIPIcs, Volume 261, ICALP 2023, Complete Volum
Ulcerative Colitis Mayo Endoscopic Scoring Classification with Active Learning and Generative Data Augmentation
Endoscopic imaging is commonly used to diagnose Ulcerative Colitis (UC) and
classify its severity. It has been shown that deep learning based methods are
effective in automated analysis of these images and can potentially be used to
aid medical doctors. Unleashing the full potential of these methods depends on
the availability of large amount of labeled images; however, obtaining and
labeling these images are quite challenging. In this paper, we propose a active
learning based generative augmentation method. The method involves generating a
large number of synthetic samples by training using a small dataset consisting
of real endoscopic images. The resulting data pool is narrowed down by using
active learning methods to select the most informative samples, which are then
used to train a classifier. We demonstrate the effectiveness of our method
through experiments on a publicly available endoscopic image dataset. The
results show that using synthesized samples in conjunction with active learning
leads to improved classification performance compared to using only the
original labeled examples and the baseline classification performance of 68.1%
increases to 74.5% in terms of Quadratic Weighted Kappa (QWK) Score. Another
observation is that, attaining equivalent performance using only real data
necessitated three times higher number of images.Comment: 6 pages, 3 figures, to be published in IEEE International Conference
on Bioinformatics and Biomedicine (BIBM) 202
Towards Sustainable Learning: Coresets for Data-efficient Deep Learning
To improve the efficiency and sustainability of learning deep models, we
propose CREST, the first scalable framework with rigorous theoretical
guarantees to identify the most valuable examples for training non-convex
models, particularly deep networks. To guarantee convergence to a stationary
point of a non-convex function, CREST models the non-convex loss as a series of
quadratic functions and extracts a coreset for each quadratic sub-region. In
addition, to ensure faster convergence of stochastic gradient methods such as
(mini-batch) SGD, CREST iteratively extracts multiple mini-batch coresets from
larger random subsets of training data, to ensure nearly-unbiased gradients
with small variances. Finally, to further improve scalability and efficiency,
CREST identifies and excludes the examples that are learned from the coreset
selection pipeline. Our extensive experiments on several deep networks trained
on vision and NLP datasets, including CIFAR-10, CIFAR-100, TinyImageNet, and
SNLI, confirm that CREST speeds up training deep networks on very large
datasets, by 1.7x to 2.5x with minimum loss in the performance. By analyzing
the learning difficulty of the subsets selected by CREST, we show that deep
models benefit the most by learning from subsets of increasing difficulty
levels
- …