325 research outputs found
The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models
The computer vision world has been re-gaining enthusiasm in various
pre-trained models, including both classical ImageNet supervised pre-training
and recently emerged self-supervised pre-training such as simCLR and MoCo.
Pre-trained weights often boost a wide range of downstream tasks including
classification, detection, and segmentation. Latest studies suggest that
pre-training benefits from gigantic model capacity. We are hereby curious and
ask: after pre-training, does a pre-trained model indeed have to stay large for
its downstream transferability?
In this paper, we examine supervised and self-supervised pre-trained models
through the lens of the lottery ticket hypothesis (LTH). LTH identifies highly
sparse matching subnetworks that can be trained in isolation from (nearly)
scratch yet still reach the full models' performance. We extend the scope of
LTH and question whether matching subnetworks still exist in pre-trained
computer vision models, that enjoy the same downstream transfer performance.
Our extensive experiments convey an overall positive message: from all
pre-trained weights obtained by ImageNet classification, simCLR, and MoCo, we
are consistently able to locate such matching subnetworks at 59.04% to 96.48%
sparsity that transfer universally to multiple downstream tasks, whose
performance see no degradation compared to using full pre-trained weights.
Further analyses reveal that subnetworks found from different pre-training tend
to yield diverse mask structures and perturbation sensitivities. We conclude
that the core LTH observations remain generally relevant in the pre-training
paradigm of computer vision, but more delicate discussions are needed in some
cases. Codes and pre-trained models will be made available at:
https://github.com/VITA-Group/CV_LTH_Pre-training.Comment: CVPR 202
COLT: Cyclic Overlapping Lottery Tickets for Faster Pruning of Convolutional Neural Networks
Pruning refers to the elimination of trivial weights from neural networks.
The sub-networks within an overparameterized model produced after pruning are
often called Lottery tickets. This research aims to generate winning lottery
tickets from a set of lottery tickets that can achieve similar accuracy to the
original unpruned network. We introduce a novel winning ticket called Cyclic
Overlapping Lottery Ticket (COLT) by data splitting and cyclic retraining of
the pruned network from scratch. We apply a cyclic pruning algorithm that keeps
only the overlapping weights of different pruned models trained on different
data segments. Our results demonstrate that COLT can achieve similar accuracies
(obtained by the unpruned model) while maintaining high sparsities. We show
that the accuracy of COLT is on par with the winning tickets of Lottery Ticket
Hypothesis (LTH) and, at times, is better. Moreover, COLTs can be generated
using fewer iterations than tickets generated by the popular Iterative
Magnitude Pruning (IMP) method. In addition, we also notice COLTs generated on
large datasets can be transferred to small ones without compromising
performance, demonstrating its generalizing capability. We conduct all our
experiments on Cifar-10, Cifar-100 & TinyImageNet datasets and report superior
performance than the state-of-the-art methods
Pruning Convolutional Neural Networks with Self-Supervision
Convolutional neural networks trained without supervision come close to
matching performance with supervised pre-training, but sometimes at the cost of
an even higher number of parameters. Extracting subnetworks from these large
unsupervised convnets with preserved performance is of particular interest to
make them less computationally intensive. Typical pruning methods operate
during training on a task while trying to maintain the performance of the
pruned network on the same task. However, in self-supervised feature learning,
the training objective is agnostic on the representation transferability to
downstream tasks. Thus, preserving performance for this objective does not
ensure that the pruned subnetwork remains effective for solving downstream
tasks. In this work, we investigate the use of standard pruning methods,
developed primarily for supervised learning, for networks trained without
labels (i.e. on self-supervised tasks). We show that pruned masks obtained with
or without labels reach comparable performance when re-trained on labels,
suggesting that pruning operates similarly for self-supervised and supervised
learning. Interestingly, we also find that pruning preserves the transfer
performance of self-supervised subnetwork representations
Robust Tickets Can Transfer Better: Drawing More Transferable Subnetworks in Transfer Learning
Transfer learning leverages feature representations of deep neural networks
(DNNs) pretrained on source tasks with rich data to empower effective
finetuning on downstream tasks. However, the pretrained models are often
prohibitively large for delivering generalizable representations, which limits
their deployment on edge devices with constrained resources. To close this gap,
we propose a new transfer learning pipeline, which leverages our finding that
robust tickets can transfer better, i.e., subnetworks drawn with properly
induced adversarial robustness can win better transferability over vanilla
lottery ticket subnetworks. Extensive experiments and ablation studies validate
that our proposed transfer learning pipeline can achieve enhanced
accuracy-sparsity trade-offs across both diverse downstream tasks and sparsity
patterns, further enriching the lottery ticket hypothesis.Comment: Accepted by DAC 202
HUMAN ACTIVITY RECOGNITION FROM EGOCENTRIC VIDEOS AND ROBUSTNESS ANALYSIS OF DEEP NEURAL NETWORKS
In recent years, there has been significant amount of research work on human activity classification relying either on Inertial Measurement Unit (IMU) data or data from static cameras providing a third-person view. There has been relatively less work using wearable cameras, providing egocentric view, which is a first-person view providing the view of the environment as seen by the wearer. Using only IMU data limits the variety and complexity of the activities that can be detected. Deep machine learning has achieved great success in image and video processing in recent years. Neural network based models provide improved accuracy in multiple fields in computer vision. However, there has been relatively less work focusing on designing specific models to improve the performance of egocentric image/video tasks. As deep neural networks keep improving the accuracy in computer vision tasks, the robustness and resilience of the networks should be improved as well to make it possible to be applied in safety-crucial areas such as autonomous driving.
Motivated by these considerations, in the first part of the thesis, the problem of human activity detection and classification from egocentric cameras is addressed. First, anew method is presented to count the number of footsteps and compute the total traveled distance by using the data from the IMU sensors and camera of a smart phone. By incorporating data from multiple sensor modalities, and calculating the length of each step, instead of using preset stride lengths and assuming equal-length steps, the proposed method provides much higher accuracy compared to commercially available step counting apps. After the application of footstep counting, more complicated human activities, such as steps of preparing a recipe and sitting on a sofa, are taken into consideration. Multiple classification methods, non-deep learning and deep-learning-based, are presented, which employ both ego-centric camera and IMU data. Then, a Genetic Algorithm-based approach is employed to set the parameters of an activity classification network autonomously and performance is compared with empirically-set parameters.
Then, a new framework is introduced to reduce the computational cost of human temporal activity recognition from egocentric videos while maintaining the accuracy at a comparable level. The actor-critic model of reinforcement learning is applied to optical flow data to locate a bounding box around region of interest, which is then used for clipping a sub-image from a video frame. A shallow and deeper 3D convolutional neural network is designed to process the original image and the clipped image region, respectively.Next, a systematic method is introduced that autonomously and simultaneously optimizes multiple parameters of any deep neural network by using a bi-generative adversarial network (Bi-GAN) guiding a genetic algorithm(GA). The proposed Bi-GAN allows the autonomous exploitation and choice of the number of neurons for the fully-connected layers, and number of filters for the convolutional layers, from a large range of values. The Bi-GAN involves two generators, and two different models compete and improve each other progressively with a GAN-based strategy to optimize the networks during a GA evolution.In this analysis, three different neural network layers and datasets are taken into consideration:
First, 3D convolutional layers for ModelNet40 dataset. We applied the proposed approach on a 3D convolutional network by using the ModelNet40 dataset. ModelNet is a dataset of 3D point clouds. The goal is to perform shape classification over 40shape classes.
LSTM layers for UCI HAR dataset. UCI HAR dataset is composed of InertialMeasurement Unit (IMU) data captured during activities of standing, sitting, laying, walking, walking upstairs and walking downstairs. These activities were performed by 30 subjects, and the 3-axial linear acceleration and 3-axial angular velocity were collected at a constant rate of 50Hz.
2D convolutional layers for Chars74k Dataset. Chars74k dataset contains 64 classes(0-9, A-Z, a-z), 7705 characters obtained from natural images, 3410 hand-drawn characters using a tablet PC and 62992 synthesised characters from computer fonts giving a total of over 74K images.
In the final part of the thesis, network robustness and resilience for neural network models is investigated from adversarial examples (AEs) and automatic driving conditions. The transferability of adversarial examples across a wide range of real-world computer vision tasks, including image classification, explicit content detection, optical character recognition(OCR), and object detection are investigated. It represents the cybercriminal’s situation where an ensemble of different detection mechanisms need to be evaded all at once.Novel dispersion Reduction(DR) attack is designed, which is a practical attack that overcomes existing attacks’ limitation of requiring task-specific loss functions by targeting on the “dispersion” of internal feature map. In the autonomous driving scenario, the adversarial machine learning attacks against the complete visual perception pipeline in autonomous driving is studied. A novel attack technique, tracker hijacking, that can effectively fool Multi-Object Tracking (MOT) using AEs on object detection is presented. Using this technique, successful AEs on as few as one single frame can move an existing object in to or out of the headway of an autonomous vehicle to cause potential safety hazards
Why is Machine Learning Security so hard?
The increase of available data and computing power has fueled a wide application of machine learning (ML). At the same time, security concerns are raised: ML models were shown to be easily fooled by slight perturbations on their inputs. Furthermore, by querying a model and analyzing output and input pairs, an attacker can infer the training data or replicate the model, thereby harming the owner’s intellectual property. Also, altering the training data can lure the model into producing specific or generally wrong outputs at test time. So far, none of the attacks studied in the field has been satisfactorily defended. In this work, we shed light on these difficulties. We first consider classifier evasion or adversarial examples. The computation of such examples is an inherent problem, as opposed to a bug that can be fixed. We also show that adversarial examples often transfer from one model to another, different model. Afterwards, we point out that the detection of backdoors (a training-time attack) is hindered as natural backdoor-like patterns occur even in benign neural networks. The question whether a pattern is benign or malicious then turns into a question of intention, which is hard to tackle. A different kind of complexity is added with the large libraries nowadays in use to implement machine learning. We introduce an attack that alters the library, thereby decreasing the accuracy a user can achieve. In case the user is aware of the attack, however, it is straightforward to defeat. This is not the case for most classical attacks described above. Additional difficulty is added if several attacks are studied at once: we show that even if the model is configured for one attack to be less effective, another attack might perform even better. We conclude by pointing out the necessity of understanding the ML model under attack. On the one hand, as we have seen throughout the examples given here, understanding precedes defenses and attacks. On the other hand, an attack, even a failed one, often yields new insights and knowledge about the algorithm studied.This work was supported by the German Federal Ministry of Education and Research (BMBF) through funding for the Center for IT-Security,Privacy and Accountability (CISPA) (FKZ: 16KIS0753
Reprogramming under constraints: Revisiting efficient and reliable transferability of lottery tickets
In the era of foundation models with huge pre-training budgets, the
downstream tasks have been shifted to the narrative of efficient and fast
adaptation. For classification-based tasks in the domain of computer vision,
the two most efficient approaches have been linear probing (LP) and visual
prompting/reprogramming (VP); the former aims to learn a classifier in the form
of a linear head on the features extracted by the pre-trained model, while the
latter maps the input data to the domain of the source data on which the model
was originally pre-trained on. Although extensive studies have demonstrated the
differences between LP and VP in terms of downstream performance, we explore
the capabilities of the two aforementioned methods via the sparsity axis: (a)
Data sparsity: the impact of few-shot adaptation and (b) Model sparsity: the
impact of lottery tickets (LT). We demonstrate that LT are not universal
reprogrammers, i.e., for certain target datasets, reprogramming an LT yields
significantly lower performance than the reprogrammed dense model although
their corresponding upstream performance is similar. Further, we demonstrate
that the calibration of dense models is always superior to that of their
lottery ticket counterparts under both LP and VP regimes. Our empirical study
opens a new avenue of research into VP for sparse models and encourages further
understanding of the performance beyond the accuracy achieved by VP under
constraints of sparsity. Code and logs can be accessed at
\url{https://github.com/landskape-ai/Reprogram_LT}.Comment: Preprin
Pruning Convolutional Neural Networks with Self-Supervision
Convolutional neural networks trained without supervision come close to matching performance with supervised pre-training, but sometimes at the cost of an even higher number of parameters. Extracting subnetworks from these large unsupervised convnets with preserved performance is of particular interest to make them less computationally intensive. Typical pruning methods operate during training on a task while trying to maintain the performance of the pruned network on the same task. However, in self-supervised feature learning, the training objective is agnostic on the representation transferability to downstream tasks. Thus, preserving performance for this objective does not ensure that the pruned subnetwork remains effective for solving downstream tasks. In this work, we investigate the use of standard pruning methods, developed primarily for supervised learning, for networks trained without labels (i.e. on self-supervised tasks). We show that pruned masks obtained with or without labels reach comparable performance when retrained on labels, suggesting that pruning operates similarly for self-supervised and supervised learning. Interestingly, we also find that pruning preserves the transfer performance of self-supervised subnetwork representations
Towards Compute-Optimal Transfer Learning
The field of transfer learning is undergoing a significant shift with the
introduction of large pretrained models which have demonstrated strong
adaptability to a variety of downstream tasks. However, the high computational
and memory requirements to finetune or use these models can be a hindrance to
their widespread use. In this study, we present a solution to this issue by
proposing a simple yet effective way to trade computational efficiency for
asymptotic performance which we define as the performance a learning algorithm
achieves as compute tends to infinity. Specifically, we argue that zero-shot
structured pruning of pretrained models allows them to increase compute
efficiency with minimal reduction in performance. We evaluate our method on the
Nevis'22 continual learning benchmark that offers a diverse set of transfer
scenarios. Our results show that pruning convolutional filters of pretrained
models can lead to more than 20% performance improvement in low computational
regimes
Lottery Tickets in Evolutionary Optimization: On Sparse Backpropagation-Free Trainability
Is the lottery ticket phenomenon an idiosyncrasy of gradient-based training
or does it generalize to evolutionary optimization? In this paper we establish
the existence of highly sparse trainable initializations for evolution
strategies (ES) and characterize qualitative differences compared to gradient
descent (GD)-based sparse training. We introduce a novel signal-to-noise
iterative pruning procedure, which incorporates loss curvature information into
the network pruning step. This can enable the discovery of even sparser
trainable network initializations when using black-box evolution as compared to
GD-based optimization. Furthermore, we find that these initializations encode
an inductive bias, which transfers across different ES, related tasks and even
to GD-based training. Finally, we compare the local optima resulting from the
different optimization paradigms and sparsity levels. In contrast to GD, ES
explore diverse and flat local optima and do not preserve linear mode
connectivity across sparsity levels and independent runs. The results highlight
qualitative differences between evolution and gradient-based learning dynamics,
which can be uncovered by the study of iterative pruning procedures.Comment: 13 pages, 11 figures, International Conference on Machine Learning
(ICML) 202
- …