18 research outputs found
AIREPAIR: A Repair Platform for Neural Networks
We present AIREPAIR, a platform for repairing neural networks. It features
the integration of existing network repair tools. Based on AIREPAIR, one can
run different repair methods on the same model, thus enabling the fair
comparison of different repair techniques. We evaluate AIREPAIR with three
state-of-the-art repair tools on popular deep-learning datasets and models. Our
evaluation confirms the utility of AIREPAIR, by comparing and analyzing the
results from different repair techniques. A demonstration is available at
https://youtu.be/UkKw5neeWhw
Self-Destructing Models: Increasing the Costs of Harmful Dual Uses in Foundation Models
A growing ecosystem of large, open-source foundation models has reduced the
labeled data and technical expertise necessary to apply machine learning to
many new problems. Yet foundation models pose a clear dual-use risk,
indiscriminately reducing the costs of building both harmful and beneficial
machine learning systems. To mitigate this risk, we propose the task blocking
paradigm, in which foundation models are trained with an additional mechanism
to impede adaptation to harmful tasks while retaining good performance on
desired tasks. We call the resulting models self-destructing models, inspired
by mechanisms that prevent adversaries from using tools for harmful purposes.
We present an algorithm for training self-destructing models leveraging
techniques from meta-learning and adversarial learning, showing that it can
largely prevent a BERT-based model from learning to perform gender
identification without harming the model's ability to perform profession
classification. We conclude with a discussion of future directions.Comment: Presented at the First Workshop of Pre-training: Perspectives,
Pitfalls, and Paths Forward (ICML, 2022) and New Frontiers in Adversarial
Machine Learning Workshop (ICML, 2022
SpecAttack: Specification-Based Adversarial Training for Deep Neural Networks
Safety specification-based adversarial training aims to generate examples
violating a formal safety specification and therefore provides approaches for
repair. The need for maintaining high prediction accuracy while ensuring the
save behavior remains challenging. Thus we present SpecAttack, a
query-efficient counter-example generation and repair method for deep neural
networks. Using SpecAttack allows specifying safety constraints on the model to
find inputs that violate these constraints. These violations are then used to
repair the neural network via re-training such that it becomes provably safe.
We evaluate SpecAttack's performance on the task of counter-example generation
and repair. Our experimental evaluation demonstrates that SpecAttack is in most
cases more query-efficient than comparable attacks, yields counter-examples of
higher quality, with its repair technique being more efficient, maintaining
higher functional correctness, and provably guaranteeing safety specification
compliance
Editing Language Model-based Knowledge Graph Embeddings
Recently decades have witnessed the empirical success of framing Knowledge
Graph (KG) embeddings via language models. However, language model-based KG
embeddings are usually deployed as static artifacts, which are challenging to
modify without re-training after deployment. To address this issue, we propose
a new task of editing language model-based KG embeddings in this paper. The
proposed task aims to enable data-efficient and fast updates to KG embeddings
without damaging the performance of the rest. We build four new datasets:
E-FB15k237, A-FB15k237, E-WN18RR, and A-WN18RR, and evaluate several knowledge
editing baselines demonstrating the limited ability of previous models to
handle the proposed challenging task. We further propose a simple yet strong
baseline dubbed KGEditor, which utilizes additional parametric layers of the
hyper network to edit/add facts. Comprehensive experimental results demonstrate
that KGEditor can perform better when updating specific facts while not
affecting the rest with low training resources. Code and datasets will be
available in https://github.com/zjunlp/PromptKG/tree/main/deltaKG.Comment: Work in progress and the project website is
https://zjunlp.github.io/project/KGE_Editing
Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors
Large pre-trained models decay over long-term deployment as input
distributions shift, user requirements change, or crucial knowledge gaps are
discovered. Recently, model editors have been proposed to modify a model's
behavior by adjusting its weights during deployment. However, when editing the
same model multiple times, these approaches quickly decay a model's performance
on upstream data and forget how to fix previous errors. We propose and study a
novel Lifelong Model Editing setting, where streaming errors are identified for
a deployed model and we update the model to correct its predictions without
influencing unrelated inputs without access to training edits, exogenous
datasets, or any upstream data for the edited model. To approach this problem,
we introduce General Retrieval Adaptors for Continual Editing, or GRACE, which
learns to cache a chosen layer's activations in an adaptive codebook as edits
stream in, leaving original model weights frozen. GRACE can thus edit models
thousands of times in a row using only streaming errors, while minimally
influencing unrelated inputs. Experimentally, we show that GRACE improves over
recent model editors and generalizes to unseen inputs. Our code is available at
https://www.github.com/thartvigsen/grace
Stable Knowledge Editing in Large Language Models
Efficient knowledge editing of large language models is crucial for replacing
obsolete information or incorporating specialized knowledge on a large scale.
However, previous methods implicitly assume that knowledge is localized and
isolated within the model, an assumption that oversimplifies the interconnected
nature of model knowledge. The premise of localization results in an incomplete
knowledge editing, whereas an isolated assumption may impair both other
knowledge and general abilities. It introduces instability to the performance
of the knowledge editing method. To transcend these assumptions, we introduce
StableKE, a method adopts a novel perspective based on knowledge augmentation
rather than knowledge localization. To overcome the expense of human labeling,
StableKE integrates two automated knowledge augmentation strategies: Semantic
Paraphrase Enhancement strategy, which diversifies knowledge descriptions to
facilitate the teaching of new information to the model, and Contextual
Description Enrichment strategy, expanding the surrounding knowledge to prevent
the forgetting of related information. StableKE surpasses other knowledge
editing methods, demonstrating stability both edited knowledge and multi-hop
knowledge, while also preserving unrelated knowledge and general abilities.
Moreover, StableKE can edit knowledge on ChatGPT
Separate the Wheat from the Chaff: Model Deficiency Unlearning via Parameter-Efficient Module Operation
Large language models (LLMs) have been widely used in various applications
but are known to suffer from issues related to untruthfulness and toxicity.
While parameter-efficient modules (PEMs) have demonstrated their effectiveness
in equipping models with new skills, leveraging PEMs for deficiency unlearning
remains underexplored. In this work, we propose a PEMs operation approach,
namely Extraction-before-Subtraction (Ext-Sub), to enhance the truthfulness and
detoxification of LLMs through the integration of ``expert'' PEM and
``anti-expert'' PEM. Remarkably, even anti-expert PEM possess valuable
capabilities due to their proficiency in generating fabricated content, which
necessitates language modeling and logical narrative competence. Rather than
merely negating the parameters, our approach involves extracting and
eliminating solely the deficiency capability within anti-expert PEM while
preserving the general capabilities. To evaluate the effectiveness of our
approach in terms of truthfulness and detoxification, we conduct extensive
experiments on LLMs, encompassing additional abilities such as language
modeling and mathematical reasoning. Our empirical results demonstrate that our
approach effectively improves truthfulness and detoxification, while largely
preserving the fundamental abilities of LLMs
Short-Term Plasticity Neurons Learning to Learn and Forget
Short-term plasticity (STP) is a mechanism that stores decaying memories in
synapses of the cerebral cortex. In computing practice, STP has been used, but
mostly in the niche of spiking neurons, even though theory predicts that it is
the optimal solution to certain dynamic tasks. Here we present a new type of
recurrent neural unit, the STP Neuron (STPN), which indeed turns out strikingly
powerful. Its key mechanism is that synapses have a state, propagated through
time by a self-recurrent connection-within-the-synapse. This formulation
enables training the plasticity with backpropagation through time, resulting in
a form of learning to learn and forget in the short term. The STPN outperforms
all tested alternatives, i.e. RNNs, LSTMs, other models with fast weights, and
differentiable plasticity. We confirm this in both supervised and reinforcement
learning (RL), and in tasks such as Associative Retrieval, Maze Exploration,
Atari video games, and MuJoCo robotics. Moreover, we calculate that, in
neuromorphic or biological circuits, the STPN minimizes energy consumption
across models, as it depresses individual synapses dynamically. Based on these,
biological STP may have been a strong evolutionary attractor that maximizes
both efficiency and computational power. The STPN now brings these neuromorphic
advantages also to a broad spectrum of machine learning practice. Code is
available at https://github.com/NeuromorphicComputing/stpnComment: Accepted at ICML 202
Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies
Large language models (LLMs) have demonstrated remarkable performance across
a wide array of NLP tasks. However, their efficacy is undermined by undesired
and inconsistent behaviors, including hallucination, unfaithful reasoning, and
toxic content. A promising approach to rectify these flaws is self-correction,
where the LLM itself is prompted or guided to fix problems in its own output.
Techniques leveraging automated feedback -- either produced by the LLM itself
or some external system -- are of particular interest as they are a promising
way to make LLM-based solutions more practical and deployable with minimal
human feedback. This paper presents a comprehensive review of this emerging
class of techniques. We analyze and taxonomize a wide array of recent work
utilizing these strategies, including training-time, generation-time, and
post-hoc correction. We also summarize the major applications of this strategy
and conclude by discussing future directions and challenges.Comment: Work in Progress. Version