74 research outputs found
Margin Maximization in Attention Mechanism
Attention mechanism is a central component of the transformer architecture
which led to the phenomenal success of large language models. However, the
theoretical principles underlying the attention mechanism are poorly
understood, especially its nonconvex optimization dynamics. In this work, we
explore the seminal softmax-attention model , where,
is the token sequence and
are tunable parameters. We
prove that running gradient descent on , or equivalently
, converges in direction to a max-margin solution that
separates tokens from non-optimal ones. This clearly
formalizes attention as a token separation mechanism. Remarkably, our results
are applicable to general data and precisely characterize
of tokens in terms of the value embeddings and problem
geometry. We also provide a broader regularization path analysis that
establishes the margin maximizing nature of attention even for nonlinear
prediction heads. When optimizing and
simultaneously with logistic loss, we identify conditions under which the
regularization paths directionally converge to their respective hard-margin SVM
solutions where separates the input features based on their
labels. Interestingly, the SVM formulation of is influenced by
the support vector geometry of . Finally, we verify our
theoretical findings via numerical experiments and provide insights
Dissecting Chain-of-Thought: A Study on Compositional In-Context Learning of MLPs
Chain-of-thought (CoT) is a method that enables language models to handle
complex reasoning tasks by decomposing them into simpler steps. Despite its
success, the underlying mechanics of CoT are not yet fully understood. In an
attempt to shed light on this, our study investigates the impact of CoT on the
ability of transformers to in-context learn a simple to study, yet general
family of compositional functions: multi-layer perceptrons (MLPs). In this
setting, we reveal that the success of CoT can be attributed to breaking down
in-context learning of a compositional function into two distinct phases:
focusing on data related to each step of the composition and in-context
learning the single-step composition function. Through both experimental and
theoretical evidence, we demonstrate how CoT significantly reduces the sample
complexity of in-context learning (ICL) and facilitates the learning of complex
functions that non-CoT methods struggle with. Furthermore, we illustrate how
transformers can transition from vanilla in-context learning to mastering a
compositional function with CoT by simply incorporating an additional layer
that performs the necessary filtering for CoT via the attention mechanism. In
addition to these test-time benefits, we highlight how CoT accelerates
pretraining by learning shortcuts to represent complex functions and how
filtering plays an important role in pretraining. These findings collectively
provide insights into the mechanics of CoT, inviting further investigation of
its role in complex reasoning tasks
Addressing Variable Dependency in GNN-based SAT Solving
Boolean satisfiability problem (SAT) is fundamental to many applications.
Existing works have used graph neural networks (GNNs) for (approximate) SAT
solving. Typical GNN-based end-to-end SAT solvers predict SAT solutions
concurrently. We show that for a group of symmetric SAT problems, the
concurrent prediction is guaranteed to produce a wrong answer because it
neglects the dependency among Boolean variables in SAT problems. % We propose
AsymSAT, a GNN-based architecture which integrates recurrent neural networks to
generate dependent predictions for variable assignments. The experiment results
show that dependent variable prediction extends the solving capability of the
GNN-based method as it improves the number of solved SAT instances on large
test sets
Mechanics of Next Token Prediction with Self-Attention
Transformer-based language models are trained on large datasets to predict
the next token given an input sequence. Despite this simple training objective,
they have led to revolutionary advances in natural language processing.
Underlying this success is the self-attention mechanism. In this work, we ask:
We show that training
self-attention with gradient descent learns an automaton which generates the
next token in two distinct steps:
Given input sequence, self-attention precisely selects
the associated with
the last input token. It
then creates a convex combination of the high-priority tokens from which the
next token can be sampled. Under suitable conditions, we rigorously
characterize these mechanics through a directed graph over tokens extracted
from the training data. We prove that gradient descent implicitly discovers the
strongly-connected components (SCC) of this graph and self-attention learns to
retrieve the tokens that belong to the highest-priority SCC available in the
context window. Our theory relies on decomposing the model weights into a
directional component and a finite component that correspond to hard retrieval
and soft composition steps respectively. This also formalizes a related
implicit bias formula conjectured in [Tarzanagh et al. 2023]. We hope that
these findings shed light on how self-attention processes sequential data and
pave the path toward demystifying more complex architectures.Comment: Accepted to AISTATS 202
LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching
The recent advancements in text-to-3D generation mark a significant milestone
in generative models, unlocking new possibilities for creating imaginative 3D
assets across various real-world scenarios. While recent advancements in
text-to-3D generation have shown promise, they often fall short in rendering
detailed and high-quality 3D models. This problem is especially prevalent as
many methods base themselves on Score Distillation Sampling (SDS). This paper
identifies a notable deficiency in SDS, that it brings inconsistent and
low-quality updating direction for the 3D model, causing the over-smoothing
effect. To address this, we propose a novel approach called Interval Score
Matching (ISM). ISM employs deterministic diffusing trajectories and utilizes
interval-based score matching to counteract over-smoothing. Furthermore, we
incorporate 3D Gaussian Splatting into our text-to-3D generation pipeline.
Extensive experiments show that our model largely outperforms the
state-of-the-art in quality and training efficiency.Comment: The first two authors contributed equally to this work. Our code will
be available at: https://github.com/EnVision-Research/LucidDreame
Analysis on the response of the dip slope with weak layer to earthquake
Taking the down-slope with weak strata in the south of Fushun west open-pit mine as the reference prototype, numerical simulations of the down-slope with weak strata were conducted through FLAC3D software, which included the simulation of actual ground motion and ground motion input, the boundary conditions of slope model, rock mass parameters, and grid model division. The response rule of down-slope with weak strata under an earthquake was investigated by analyzing the acceleration and velocity of the monitoring points. The results revealed that: (1) The thickness of the weak layer is a critical factor affecting the response characteristics of the slope with a single weak layer under an earthquake, and it has a greater impact on the stability of the slope under earthquake load than the dip angle of a single weak layer. (2) Based on the horizontal velocity of monitoring point 2# at the intersection of the weak layer and the slope surface, it was concluded that the thickness has a significant influence on the velocity in the X direction. (3) The failure response law at the intersection of weak layers and slope changes with an increase in slope height when analyzing the response law of the double weak layer characteristic slope under an earthquake. The acceleration amplitude and velocity change degree of monitoring point 3# with double weak layers are more noticeable than that of monitoring point 2#. The response law of the down-slope under an earthquake is related to the dip angle, thickness, number, and location of weak layers. Therefore, the coupling effect of earthquake and weak layer characteristics on slope stability should be thoroughly considered in the process of slope treatment and protection
Mechanistic study of visible light-driven CdS or g-C<sub>3</sub>N<sub>4</sub>-catalyzed C–H direct trifluoromethylation of (hetero)arenes using CF<sub>3</sub>SO<sub>2</sub>Na as the trifluoromethyl source
The mild and sustainable methods for C–H direct trifluoromethylation of (hetero)arenes without any base or strong oxidants are in extremely high demand. Here, we report that the photo-generated electron-hole pairs of classical semiconductors (CdS or g-C3N4) under visible light excitation are effective to drive C–H trifluoromethylation of (hetero)arenes with stable and inexpensive CF3SO2Na as the trifluoromethyl (TFM) source via radical pathway. Either CdS or g-C3N4 propagated reaction can efficiently transform CF3SO2Na to [rad]CF3 radical and further afford the desired benzotrifluoride derivatives in moderate to good yields. After visible light initiated photocatalytic process, the key elements (such as F, S and C) derived from the starting TFM source of CF3SO2Na exhibited differential chemical forms as compared to those in other oxidative reactions. The photogenerated electron was trapped by chemisorbed O2 on photocatalysts to form superoxide radical anion (O2[rad]−) which will further attack [rad]CF3 radical with the generation of inorganic product F− and CO2. This resulted in a low utilization efficiency of [rad]CF3 (<50%). When nitro aromatic compounds and CF3SO2Na served as the starting materials in inert atmosphere, the photoexcited electrons can be directed to reduce the nitro group to amino group rather than being trapped by O2. Meanwhile, the photogenerated holes oxidize SO2CF3− into [rad]CF3. Both the photogenerated electrons and holes were engaged in reductive and oxidative paths, respectively. The desired product, trifluoromethylated aniline, was obtained successfully via one-pot free-radical synthesis.</p
Establishment of a viable cell detection system for microorganisms in wine based on ethidium monoazide and quantitative PCR
Fermentability and contamination level of wine can be assessed through the detection of viable fermentation-related and spoilage-related microorganisms. Ethidium monoazide in combination with quantitative PCR (EMA-qPCR) has been considered as a promising method to enumerate viable cells. Milling for 80 s by O 500-mu m glass beads is demonstrated to be optimal for DNA extraction from yeasts, lactic acid bacteria (LAB) and acetic acid bacteria (AAB) in wine to be used as a template for PCR. EMA-qPCR results from experiments using DNA extracted by this method correlate well with the results of a plating assay (R-2 > 0.99), and a PCR efficiency between 96% and 105% was obtained. Moreover, for all of these microorganisms, EMA treatment of pure cultures at a low concentration (10 mu g/mL) for 20 min photoactivation resulted in effective differentiation between viable and non-viable cells and had no effect on viable cells. Due to sublethal injury to some cells, underestimation of cell counts was found in most of the wine samples tested using the EMA-qPCR method, and a 40-min incubation in recovery medium could completely offset this error. Our results suggest an optimal glass-bead DNA extraction method and EMA treatment suitable for all of the main microorganisms in wine. The EMA-qPCR method was successfully applied to quantify yeasts. Saccharomyces cerevisiae (S. cerevisiae), LAB, non-Oenococcus oeni LAB (non-O. oeni LAB) and AAB in wine samples. (C) 2012 Elsevier Ltd. All rights reserved
- …