28 research outputs found
Embracing Unknown Step by Step: Towards Reliable Sparse Training in Real World
Sparse training has emerged as a promising method for resource-efficient deep
neural networks (DNNs) in real-world applications. However, the reliability of
sparse models remains a crucial concern, particularly in detecting unknown
out-of-distribution (OOD) data. This study addresses the knowledge gap by
investigating the reliability of sparse training from an OOD perspective and
reveals that sparse training exacerbates OOD unreliability. The lack of unknown
information and the sparse constraints hinder the effective exploration of
weight space and accurate differentiation between known and unknown knowledge.
To tackle these challenges, we propose a new unknown-aware sparse training
method, which incorporates a loss modification, auto-tuning strategy, and a
voting scheme to guide weight space exploration and mitigate confusion between
known and unknown information without incurring significant additional costs or
requiring access to additional OOD data. Theoretical insights demonstrate how
our method reduces model confidence when faced with OOD samples. Empirical
experiments across multiple datasets, model architectures, and sparsity levels
validate the effectiveness of our method, with improvements of up to
\textbf{8.4\%} in AUROC while maintaining comparable or higher accuracy and
calibration. This research enhances the understanding and readiness of sparse
DNNs for deployment in resource-limited applications. Our code is available on:
\url{https://github.com/StevenBoys/MOON}
Towards Robust Pruning: An Adaptive Knowledge-Retention Pruning Strategy for Language Models
The pruning objective has recently extended beyond accuracy and sparsity to
robustness in language models. Despite this, existing methods struggle to
enhance robustness against adversarial attacks when continually increasing
model sparsity and require a retraining process. As humans step into the era of
large language models, these issues become increasingly prominent. This paper
proposes that the robustness of language models is proportional to the extent
of pre-trained knowledge they encompass. Accordingly, we introduce a
post-training pruning strategy designed to faithfully replicate the embedding
space and feature space of dense language models, aiming to conserve more
pre-trained knowledge during the pruning process. In this setup, each layer's
reconstruction error not only originates from itself but also includes
cumulative error from preceding layers, followed by an adaptive rectification.
Compared to other state-of-art baselines, our approach demonstrates a superior
balance between accuracy, sparsity, robustness, and pruning cost with BERT on
datasets SST2, IMDB, and AGNews, marking a significant stride towards robust
pruning in language models
Breaking through Deterministic Barriers: Randomized Pruning Mask Generation and Selection
It is widely acknowledged that large and sparse models have higher accuracy
than small and dense models under the same model size constraints. This
motivates us to train a large model and then remove its redundant neurons or
weights by pruning. Most existing works pruned the networks in a deterministic
way, the performance of which solely depends on a single pruning criterion and
thus lacks variety. Instead, in this paper, we propose a model pruning strategy
that first generates several pruning masks in a designed random way.
Subsequently, along with an effective mask-selection rule, the optimal mask is
chosen from the pool of mask candidates. To further enhance efficiency, we
introduce an early mask evaluation strategy, mitigating the overhead associated
with training multiple masks. Our extensive experiments demonstrate that this
approach achieves state-of-the-art performance across eight datasets from GLUE,
particularly excelling at high levels of sparsity
LMLFM: Longitudinal Multi-Level Factorization Machine
We consider the problem of learning predictive models from longitudinal data,
consisting of irregularly repeated, sparse observations from a set of
individuals over time. Such data often exhibit {\em longitudinal correlation}
(LC) (correlations among observations for each individual over time), {\em
cluster correlation} (CC) (correlations among individuals that have similar
characteristics), or both. These correlations are often accounted for using
{\em mixed effects models} that include {\em fixed effects} and {\em random
effects}, where the fixed effects capture the regression parameters that are
shared by all individuals, whereas random effects capture those parameters that
vary across individuals. However, the current state-of-the-art methods are
unable to select the most predictive fixed effects and random effects from a
large number of variables, while accounting for complex correlation structure
in the data and non-linear interactions among the variables. We propose
Longitudinal Multi-Level Factorization Machine (LMLFM), to the best of our
knowledge, the first model to address these challenges in learning predictive
models from longitudinal data. We establish the convergence properties, and
analyze the computational complexity, of LMLFM. We present results of
experiments with both simulated and real-world longitudinal data which show
that LMLFM outperforms the state-of-the-art methods in terms of predictive
accuracy, variable selection ability, and scalability to data with large number
of variables. The code and supplemental material is available at
\url{https://github.com/junjieliang672/LMLFM}.Comment: Thirty-Fourth AAAI Conference on Artificial Intelligence, accepte
How Do We Move: Modeling Human Movement with System Dynamics
Modeling how human moves in the space is useful for policy-making in
transportation, public safety, and public health. Human movements can be viewed
as a dynamic process that human transits between states (\eg, locations) over
time. In the human world where intelligent agents like humans or vehicles with
human drivers play an important role, the states of agents mostly describe
human activities, and the state transition is influenced by both the human
decisions and physical constraints from the real-world system (\eg, agents need
to spend time to move over a certain distance). Therefore, the modeling of
state transition should include the modeling of the agent's decision process
and the physical system dynamics. In this paper, we propose \ours to model
state transition in human movement from a novel perspective, by learning the
decision model and integrating the system dynamics. \ours learns the human
movement with Generative Adversarial Imitation Learning and integrates the
stochastic constraints from system dynamics in the learning process. To the
best of our knowledge, we are the first to learn to model the state transition
of moving agents with system dynamics. In extensive experiments on real-world
datasets, we demonstrate that the proposed method can generate trajectories
similar to real-world ones, and outperform the state-of-the-art methods in
predicting the next location and generating long-term future trajectories.Comment: Accepted by AAAI 2021, Appendices included. 12 pages, 8 figures. in
Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence
(AAAI'21), Feb 202
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models
Augmented Language Models (ALMs) blend the reasoning capabilities of Large
Language Models (LLMs) with tools that allow for knowledge retrieval and action
execution. Existing ALM systems trigger LLM thought processes while pulling
observations from these tools in an interleaved fashion. Specifically, an LLM
reasons to call an external tool, gets halted to fetch the tool's response, and
then decides the next action based on all preceding response tokens. Such a
paradigm, though straightforward and easy to implement, often leads to huge
computation complexity from redundant prompts and repeated execution. This
study addresses such challenges for the first time, proposing a modular
paradigm ReWOO (Reasoning WithOut Observation) that detaches the reasoning
process from external observations, thus significantly reducing token
consumption. Comprehensive evaluations across six public NLP benchmarks and a
curated dataset reveal consistent performance enhancements with our proposed
methodology. Notably, ReWOO achieves 5x token efficiency and 4% accuracy
improvement on HotpotQA, a multi-step reasoning benchmark. Furthermore, ReWOO
demonstrates robustness under tool-failure scenarios. Beyond prompt efficiency,
decoupling parametric modules from non-parametric tool calls enables
instruction fine-tuning to offload LLMs into smaller language models, thus
substantially reducing model parameters. Our illustrative work offloads
reasoning ability from 175B GPT3.5 into 7B LLaMA, demonstrating the significant
potential for truly efficient and scalable ALM systems
FP8-BERT: Post-Training Quantization for Transformer
Transformer-based models, such as BERT, have been widely applied in a wide
range of natural language processing tasks. However, one inevitable side effect
is that they require massive memory storage and inference cost when deployed in
production. Quantization is one of the popularized ways to alleviate the cost.
However, the previous 8-bit quantization strategy based on INT8 data format
either suffers from the degradation of accuracy in a Post-Training Quantization
(PTQ) fashion or requires an expensive Quantization-Aware Training (QAT)
process. Recently, a new numeric format FP8 (i.e. floating-point of 8-bits) has
been proposed and supported in commercial AI computing platforms such as H100.
In this paper, we empirically validate the effectiveness of FP8 as a way to do
Post-Training Quantization without significant loss of accuracy, with a simple
calibration and format conversion process. We adopt the FP8 standard proposed
by NVIDIA Corp. (2022) in our extensive experiments of BERT variants on GLUE
and SQuAD v1.1 datasets, and show that PTQ with FP8 can significantly improve
the accuracy upon that with INT8, to the extent of the full-precision model
Students' Perceptions and Preferences of Generative Artificial Intelligence Feedback for Programming
The rapid evolution of artificial intelligence (AI), specifically large
language models (LLMs), has opened opportunities for various educational
applications. This paper explored the feasibility of utilizing ChatGPT, one of
the most popular LLMs, for automating feedback for Java programming assignments
in an introductory computer science (CS1) class. Specifically, this study
focused on three questions: 1) To what extent do students view LLM-generated
feedback as formative? 2) How do students see the comparative affordances of
feedback prompts that include their code, vs. those that exclude it? 3) What
enhancements do students suggest for improving AI-generated feedback? To
address these questions, we generated automated feedback using the ChatGPT API
for four lab assignments in the CS1 class. The survey results revealed that
students perceived the feedback as aligning well with formative feedback
guidelines established by Shute. Additionally, students showed a clear
preference for feedback generated by including the students' code as part of
the LLM prompt, and our thematic study indicated that the preference was mainly
attributed to the specificity, clarity, and corrective nature of the feedback.
Moreover, this study found that students generally expected specific and
corrective feedback with sufficient code examples, but had diverged opinions on
the tone of the feedback. This study demonstrated that ChatGPT could generate
Java programming assignment feedback that students perceived as formative. It
also offered insights into the specific improvements that would make the
ChatGPT-generated feedback useful for students
Dynamic Sparse Training via Balancing the Exploration-Exploitation Trade-off
Over-parameterization of deep neural networks (DNNs) has shown high
prediction accuracy for many applications. Although effective, the large number
of parameters hinders its popularity on resource-limited devices and has an
outsize environmental impact. Sparse training (using a fixed number of nonzero
weights in each iteration) could significantly mitigate the training costs by
reducing the model size. However, existing sparse training methods mainly use
either random-based or greedy-based drop-and-grow strategies, resulting in
local minimal and low accuracy. In this work, we consider the dynamic sparse
training as a sparse connectivity search problem and design an exploitation and
exploration acquisition function to escape from local optima and saddle points.
We further design an acquisition function and provide the theoretical
guarantees for the proposed method and clarify its convergence property.
Experimental results show that sparse models (up to 98\% sparsity) obtained by
our proposed method outperform the SOTA sparse training methods on a wide
variety of deep learning tasks. On VGG-19 / CIFAR-100, ResNet-50 / CIFAR-10,
ResNet-50 / CIFAR-100, our method has even higher accuracy than dense models.
On ResNet-50 / ImageNet, the proposed method has up to 8.2\% accuracy
improvement compared to SOTA sparse training methods