2,062 research outputs found
Adversarial Robustness and Robust Meta-Learning for Neural Networks
Despite the overwhelming success of neural networks for pattern recognition, these models behave categorically different from humans. Adversarial examples, small perturbations which are often undetectable to the human eye, easily fool neural networks, demonstrating that neural networks lack the robustness of human classifiers. This thesis comprises a sequence of three parts. First, we motivate the study of defense against adversarial examples with a case study on algorithmic trading in which robustness may be critical for security reasons. Second, we develop methods for hardening neural networks against an adversary, especially in the low-data regime, where meta-learning methods achieve state-of-the-art results. Finally, we discuss several properties of the neural network models we use. These properties are of interest beyond robustness to adversarial examples, and they extend to the broad setting of deep learning
Practical Robust Learning Under Domain Shifts
With the constantly upgraded devices, the data we capture is shifting with time. Despite the domain shifts among the images, we as humans can put aside the difference and still recognize the content. However, these shifts are a bigger challenge for machines. It is widely known that humans are naturally adaptive to the visual changes in the environment, without learning all over again. However, to make machines work in the changed environment we need new annotations from human. The fundamental question is: can we make machines as adaptive as humans?
In this thesis, we have worked towards addressing this question through advances in the study of robust learning under domain shifts via domain adaptation. Our goal is to facilitate the transfer of information of the machines while minimizing the need for human supervision.
To enable real systems with demonstrated robustness, the study of domain adaptation needs to move from ideals to realities. In current domain adaptation research, there are few ideals that are not consistent with reality: i) The assumption that domains are perfectly sliced and that domain labels are available. ii) The assumption that the annotations from the target domain should be treated equally as those of the source domain. iii) The assumption that the samples of target domains are constantly accessible. In this thesis, we try to address the issue that true domain labels are hard to obtain, the target domain labels have better ways to exploited, and that in reality the target domain is often time-sensitive.
In the scope of problem settings, this thesis has covered the following scenarios with practical values. Unsupervised multi-source domain adaptation, semi-supervised domain adaptation and online domain adaptation. Three completed works are reviewed corresponding to each problem setting. The first work proposes an adversarial learning strategy that learns a dynamic curriculum for source samples to maximize the utility of source labels of multiple domains. The model iteratively learns which domains or samples are best suited for aligning to the target. The intuition is to force the adversarial agent to constantly re-measure the transferability of latent domains over time to adversarially raise the error rate of the domain discriminator. The method has removed the need of domain labels, yet it outperforms other methods on four well-known benchmarks by significant margins. The second work aims to address the problem that current methods have not effectively used the target supervision by treating source and target supervision without distinction. The work points out that the labeled target data needs to be distinguished from the source, and propose to explicitly decompose the task into two sub-tasks: a semi-supervised learning task in the target domain and an unsupervised domain adaptation task across domains. By doing so, the two sub-tasks can better leverage the corresponding supervision and thus yield very different classifiers. The third work is proposed in the context of online privacy, i.e. each online sample of the target domain is permanently deleted after processed. The proposed framework utilizes the labels from the public data and predicts on the unlabeled sensitive private data. To tackle the inevitable distribution shift from the public data to the private data, the work proposes a novel domain adaptation algorithm that directly aims at the fundamental challenge of this online setting--the lack of diverse source-target data pairs
Cognitive Architectures for Language Agents
Recent efforts have incorporated large language models (LLMs) with external
resources (e.g., the Internet) or internal control flows (e.g., prompt
chaining) for tasks requiring grounding or reasoning. However, these efforts
have largely been piecemeal, lacking a systematic framework for constructing a
fully-fledged language agent. To address this challenge, we draw on the rich
history of agent design in symbolic artificial intelligence to develop a
blueprint for a new wave of cognitive language agents. We first show that LLMs
have many of the same properties as production systems, and recent efforts to
improve their grounding or reasoning mirror the development of cognitive
architectures built around production systems. We then propose Cognitive
Architectures for Language Agents (CoALA), a conceptual framework to
systematize diverse methods for LLM-based reasoning, grounding, learning, and
decision making as instantiations of language agents in the framework. Finally,
we use the CoALA framework to highlight gaps and propose actionable directions
toward more capable language agents in the future.Comment: 16 pages of main content, 10 pages of references, 5 figures. Equal
contribution among the first two authors, order decided by coin flip. A
CoALA-based repo of recent work on language agents:
https://github.com/ysymyth/awesome-language-agent
DPL: Decoupled Prompt Learning for Vision-Language Models
Prompt learning has emerged as an efficient and effective approach for
transferring foundational Vision-Language Models (e.g., CLIP) to downstream
tasks. However, current methods tend to overfit to seen categories, thereby
limiting their generalization ability for unseen classes. In this paper, we
propose a new method, Decoupled Prompt Learning (DPL), which reformulates the
attention in prompt learning to alleviate this problem. Specifically, we
theoretically investigate the collaborative process between prompts and
instances (i.e., image patches/text tokens) by reformulating the original
self-attention into four separate sub-processes. Through detailed analysis, we
observe that certain sub-processes can be strengthened to bolster robustness
and generalizability by some approximation techniques. Furthermore, we
introduce language-conditioned textual prompting based on decoupled attention
to naturally preserve the generalization of text input. Our approach is
flexible for both visual and textual modalities, making it easily extendable to
multi-modal prompt learning. By combining the proposed techniques, our approach
achieves state-of-the-art performance on three representative benchmarks
encompassing 15 image recognition datasets, while maintaining
parameter-efficient. Moreover, our DPL does not rely on any auxiliary
regularization task or extra training data, further demonstrating its
remarkable generalization ability.Comment: 11 pages, 5 figures, 8 table
Meta learning for few shot learning
Few-shot learning aims to scale visual recognition to open-ended growth of new classes with limited labelled examples, thus alleviating data and computation bottleneck of conventional deep learning. This thesis proposes a meta learning (a.k.a. learning to learn), paradigm to tackle the real-world few shot learning challenges.
Firstly, we present a parameterized multi-metric based meta learning algorithm (RelationNet2). Existing metric learning algorithms are always based on training a global deep embedding and metric to support image similarity matching, but we propose a deep comparison network comprised of embedding and relation modules learning multiple non-linear distance metrics based on different levels of features simultaneously. Furthermore, images are represented as \todo{a} distribution rather than vectors via learning parameterized Gaussian noise regularization, reducing overfitting and enable the use of deeper embeddings.
We next consider the fact that several recent competitors develop effective few-shot learners through strong conventional representations in combination with very simple classifiers, questioning whether “meta-learning” is necessary or highly effective features are sufficient. To defend meta-learning, we take an approach agnostic to the off-the-shelf features, and focus exclusively on meta-learning the final classifier layer. Specifically, we introduce MetaQDA, a Bayesian meta-learning extension of quadratic discriminant analysis classifier, that is complementary to advances in feature representations, leading to high accuracy and state-of-the-art uncertainty calibration performance in predictions.
Finally, we investigate the extension of MetaQDA to more generalized real-world scenarios beyond the narrow standard few-shot benchmarks. Our model achieves both many-shot and few-shot classification accuracy in generalized few-shot learning. In terms of few-shot class-incremental learning, MetaQDA is inherently suitable to novel classes growing \todo{scenarios}. As for open-set recognition, we calculate the probability belonging to novel class by Bayes' Rule, maintaining high accuracy in both close-set recognition and open-set rejection.
Overall, our contributions in few-shot meta-learning advance state of the art under both accuracy and calibration metrics, explore a series of increasingly realistic problem settings, to support more researchers and practitioners in future exploration
Conditional Positional Encodings for Vision Transformers
We propose a conditional positional encoding (CPE) scheme for vision
Transformers. Unlike previous fixed or learnable positional encodings, which
are pre-defined and independent of input tokens, CPE is dynamically generated
and conditioned on the local neighborhood of the input tokens. As a result, CPE
can easily generalize to the input sequences that are longer than what the
model has ever seen during training. Besides, CPE can keep the desired
translation-invariance in the image classification task, resulting in improved
classification accuracy. CPE can be effortlessly implemented with a simple
Position Encoding Generator (PEG), and it can be seamlessly incorporated into
the current Transformer framework. Built on PEG, we present Conditional
Position encoding Vision Transformer (CPVT). We demonstrate that CPVT has
visually similar attention maps compared to those with learned positional
encodings. Benefit from the conditional positional encoding scheme, we obtain
state-of-the-art results on the ImageNet classification task compared with
vision Transformers to date. Our code will be made available at
https://github.com/Meituan-AutoML/CPVT .Comment: A general purpose conditional position encoding for vision
transformer
- …