158 research outputs found
Freeze then Train: Towards Provable Representation Learning under Spurious Correlations and Feature Noise
The existence of spurious correlations such as image backgrounds in the
training environment can make empirical risk minimization (ERM) perform badly
in the test environment. To address this problem, Kirichenko et al. (2022)
empirically found that the core features that are related to the outcome can
still be learned well even with the presence of spurious correlations. This
opens a promising strategy to first train a feature learner rather than a
classifier, and then perform linear probing (last layer retraining) in the test
environment. However, a theoretical understanding of when and why this approach
works is lacking. In this paper, we find that core features are only learned
well when their associated non-realizable noise is smaller than that of
spurious features, which is not necessarily true in practice. We provide both
theories and experiments to support this finding and to illustrate the
importance of non-realizable noise. Moreover, we propose an algorithm called
Freeze then Train (FTT), that first freezes certain salient features and then
trains the rest of the features using ERM. We theoretically show that FTT
preserves features that are more beneficial to test time probing. Across two
commonly used spurious correlation datasets, FTT outperforms ERM, IRM, JTT and
CVaR-DRO, with substantial improvement in accuracy (by 4.5%) when the feature
noise is large. FTT also performs better on general distribution shift
benchmarks
A study of conceptual language similarity: comparison and evaluation
An interesting line of research in natural language processing (NLP) aims to
incorporate linguistic typology to bridge linguistic diversity and assist the
research of low-resource languages. While most works construct linguistic
similarity measures based on lexical or typological features, such as word
order and verbal inflection, recent work has introduced a novel approach to
defining language similarity based on how they represent basic concepts, which
is complementary to existing similarity measures. In this work, we study the
conceptual similarity in detail and evaluate it extensively on a binary
classification task
Discovering Latent Knowledge in Language Models Without Supervision
Existing techniques for training language models can be misaligned with the
truth: if we train models with imitation learning, they may reproduce errors
that humans make; if we train them to generate text that humans rate highly,
they may output errors that human evaluators can't detect. We propose
circumventing this issue by directly finding latent knowledge inside the
internal activations of a language model in a purely unsupervised way.
Specifically, we introduce a method for accurately answering yes-no questions
given only unlabeled model activations. It works by finding a direction in
activation space that satisfies logical consistency properties, such as that a
statement and its negation have opposite truth values. We show that despite
using no supervision and no model outputs, our method can recover diverse
knowledge represented in large language models: across 6 models and 10
question-answering datasets, it outperforms zero-shot accuracy by 4\% on
average. We also find that it cuts prompt sensitivity in half and continues to
maintain high accuracy even when models are prompted to generate incorrect
answers. Our results provide an initial step toward discovering what language
models know, distinct from what they say, even when we don't have access to
explicit ground truth labels.Comment: ICLR 202
Crosslingual Transfer Learning for Low-Resource Languages Based on Multilingual Colexification Graphs
In comparative linguistics, colexification refers to the phenomenon of a
lexical form conveying two or more distinct meanings. Existing work on
colexification patterns relies on annotated word lists, limiting scalability
and usefulness in NLP. In contrast, we identify colexification patterns of more
than 2,000 concepts across 1,335 languages directly from an unannotated
parallel corpus. We then propose simple and effective methods to build
multilingual graphs from the colexification patterns: ColexNet and ColexNet+.
ColexNet's nodes are concepts and its edges are colexifications. In ColexNet+,
concept nodes are additionally linked through intermediate nodes, each
representing an ngram in one of 1,334 languages. We use ColexNet+ to train
\overrightarrow{\mbox{ColexNet+}}, high-quality multilingual embeddings that
are well-suited for transfer learning. In our experiments, we first show that
ColexNet achieves high recall on CLICS, a dataset of crosslingual
colexifications. We then evaluate \overrightarrow{\mbox{ColexNet+}} on
roundtrip translation, sentence retrieval and sentence classification and show
that our embeddings surpass several transfer learning baselines. This
demonstrates the benefits of using colexification as a source of information in
multilingual NLP.Comment: EMNLP 2023 Finding
DOF: Accelerating High-order Differential Operators with Forward Propagation
Solving partial differential equations (PDEs) efficiently is essential for
analyzing complex physical systems. Recent advancements in leveraging deep
learning for solving PDE have shown significant promise. However, machine
learning methods, such as Physics-Informed Neural Networks (PINN), face
challenges in handling high-order derivatives of neural network-parameterized
functions. Inspired by Forward Laplacian, a recent method of accelerating
Laplacian computation, we propose an efficient computational framework,
Differential Operator with Forward-propagation (DOF), for calculating general
second-order differential operators without losing any precision. We provide
rigorous proof of the advantages of our method over existing methods,
demonstrating two times improvement in efficiency and reduced memory
consumption on any architectures. Empirical results illustrate that our method
surpasses traditional automatic differentiation (AutoDiff) techniques,
achieving 2x improvement on the MLP structure and nearly 20x improvement on the
MLP with Jacobian sparsity
Effects of Meteorology Changes on Inter-Annual Variations of Aerosol Optical Depth and Surface PM2.5 in China—Implications for PM2.5 Remote Sensing
PM2.5 retrieval from satellite-observed aerosol optical depth (AOD) is still challenging due to the strong impact of meteorology. We investigate influences of meteorology changes on the inter-annual variations of AOD and surface PM2.5 in China between 2006 and 2017 using a nested 3D chemical transport model, GEOS-Chem, by fixing emissions at the 2006 level. We then identify major meteorological elements controlling the inter-annual variations of AOD and surface PM2.5 using multiple linear regression. We find larger influences of meteorology changes on trends of AOD than that of surface PM2.5. On the seasonal scale, meteorology changes are beneficial to AOD and surface PM2.5 reduction in spring (1–50%) but show an adverse effect on aerosol reduction in summer. In addition, major meteorological elements influencing variations of AOD and PM2.5 are similar between spring and fall. In winter, meteorology changes are favorable to AOD reduction (−0.007 yr−1, −1.2% yr−1; p < 0.05) but enhanced surface PM2.5 between 2006 and 2017. The difference in winter is mainly attributed to the stable boundary layer that isolates surface PM2.5 from aloft. The significant decrease in AOD over the years is related to the increase in meridional wind speed at 850 hPa in NCP (p < 0.05). The increase of surface PM2.5 in NCP in winter is possibly related to the increased temperature inversion and more stable stratification in the boundary layer. This suggests that previous estimates of wintertime surface PM2.5 using satellite measurements of AOD corrected by meteorological elements should be used with caution. Our findings provide potential meteorological elements that might improve the retrieval of surface PM2.5 from satellite-observed AOD on the seasonal scale
Towards Revealing the Mystery behind Chain of Thought: a Theoretical Perspective
Recent studies have discovered that Chain-of-Thought prompting (CoT) can
dramatically improve the performance of Large Language Models (LLMs),
particularly when dealing with complex tasks involving mathematics or
reasoning. Despite the enormous empirical success, the underlying mechanisms
behind CoT and how it unlocks the potential of LLMs remain elusive. In this
paper, we take a first step towards theoretically answering these questions.
Specifically, we examine the capacity of LLMs with CoT in solving fundamental
mathematical and decision-making problems. We start by giving an impossibility
result showing that any bounded-depth Transformer cannot directly output
correct answers for basic arithmetic/equation tasks unless the model size grows
super-polynomially with respect to the input length. In contrast, we then prove
by construction that autoregressive Transformers of a constant size suffice to
solve both tasks by generating CoT derivations using a commonly-used math
language format. Moreover, we show LLMs with CoT are capable of solving a
general class of decision-making problems known as Dynamic Programming, thus
justifying its power in tackling complex real-world tasks. Finally, extensive
experiments on four tasks show that, while Transformers always fail to predict
the answers directly, they can consistently learn to generate correct solutions
step-by-step given sufficient CoT demonstrations.Comment: 33 page
- …