74 research outputs found

    Margin Maximization in Attention Mechanism

    Full text link
    Attention mechanism is a central component of the transformer architecture which led to the phenomenal success of large language models. However, the theoretical principles underlying the attention mechanism are poorly understood, especially its nonconvex optimization dynamics. In this work, we explore the seminal softmax-attention model f(X)=Xv,softmax(XWp)f(\boldsymbol{X})=\langle \boldsymbol{Xv}, \texttt{softmax}(\boldsymbol{XWp})\rangle, where, X\boldsymbol{X} is the token sequence and (v,W,p)(\boldsymbol{v},\boldsymbol{W},\boldsymbol{p}) are tunable parameters. We prove that running gradient descent on p\boldsymbol{p}, or equivalently W\boldsymbol{W}, converges in direction to a max-margin solution that separates locally-optimal\textit{locally-optimal} tokens from non-optimal ones. This clearly formalizes attention as a token separation mechanism. Remarkably, our results are applicable to general data and precisely characterize optimality\textit{optimality} of tokens in terms of the value embeddings Xv\boldsymbol{Xv} and problem geometry. We also provide a broader regularization path analysis that establishes the margin maximizing nature of attention even for nonlinear prediction heads. When optimizing v\boldsymbol{v} and p\boldsymbol{p} simultaneously with logistic loss, we identify conditions under which the regularization paths directionally converge to their respective hard-margin SVM solutions where v\boldsymbol{v} separates the input features based on their labels. Interestingly, the SVM formulation of p\boldsymbol{p} is influenced by the support vector geometry of v\boldsymbol{v}. Finally, we verify our theoretical findings via numerical experiments and provide insights

    Dissecting Chain-of-Thought: A Study on Compositional In-Context Learning of MLPs

    Full text link
    Chain-of-thought (CoT) is a method that enables language models to handle complex reasoning tasks by decomposing them into simpler steps. Despite its success, the underlying mechanics of CoT are not yet fully understood. In an attempt to shed light on this, our study investigates the impact of CoT on the ability of transformers to in-context learn a simple to study, yet general family of compositional functions: multi-layer perceptrons (MLPs). In this setting, we reveal that the success of CoT can be attributed to breaking down in-context learning of a compositional function into two distinct phases: focusing on data related to each step of the composition and in-context learning the single-step composition function. Through both experimental and theoretical evidence, we demonstrate how CoT significantly reduces the sample complexity of in-context learning (ICL) and facilitates the learning of complex functions that non-CoT methods struggle with. Furthermore, we illustrate how transformers can transition from vanilla in-context learning to mastering a compositional function with CoT by simply incorporating an additional layer that performs the necessary filtering for CoT via the attention mechanism. In addition to these test-time benefits, we highlight how CoT accelerates pretraining by learning shortcuts to represent complex functions and how filtering plays an important role in pretraining. These findings collectively provide insights into the mechanics of CoT, inviting further investigation of its role in complex reasoning tasks

    Addressing Variable Dependency in GNN-based SAT Solving

    Full text link
    Boolean satisfiability problem (SAT) is fundamental to many applications. Existing works have used graph neural networks (GNNs) for (approximate) SAT solving. Typical GNN-based end-to-end SAT solvers predict SAT solutions concurrently. We show that for a group of symmetric SAT problems, the concurrent prediction is guaranteed to produce a wrong answer because it neglects the dependency among Boolean variables in SAT problems. % We propose AsymSAT, a GNN-based architecture which integrates recurrent neural networks to generate dependent predictions for variable assignments. The experiment results show that dependent variable prediction extends the solving capability of the GNN-based method as it improves the number of solved SAT instances on large test sets

    Mechanics of Next Token Prediction with Self-Attention

    Full text link
    Transformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying this success is the self-attention mechanism. In this work, we ask: What\textit{What} does\textit{does} a\textit{a} single\textit{single} self-attention\textit{self-attention} layer\textit{layer} learn\textit{learn} from\textit{from} next-token\textit{next-token} prediction?\textit{prediction?} We show that training self-attention with gradient descent learns an automaton which generates the next token in two distinct steps: (1)\textbf{(1)} Hard\textbf{Hard} retrieval:\textbf{retrieval:} Given input sequence, self-attention precisely selects the high-priority\textit{high-priority} input\textit{input} tokens\textit{tokens} associated with the last input token. (2)\textbf{(2)} Soft\textbf{Soft} composition:\textbf{composition:} It then creates a convex combination of the high-priority tokens from which the next token can be sampled. Under suitable conditions, we rigorously characterize these mechanics through a directed graph over tokens extracted from the training data. We prove that gradient descent implicitly discovers the strongly-connected components (SCC) of this graph and self-attention learns to retrieve the tokens that belong to the highest-priority SCC available in the context window. Our theory relies on decomposing the model weights into a directional component and a finite component that correspond to hard retrieval and soft composition steps respectively. This also formalizes a related implicit bias formula conjectured in [Tarzanagh et al. 2023]. We hope that these findings shed light on how self-attention processes sequential data and pave the path toward demystifying more complex architectures.Comment: Accepted to AISTATS 202

    LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching

    Full text link
    The recent advancements in text-to-3D generation mark a significant milestone in generative models, unlocking new possibilities for creating imaginative 3D assets across various real-world scenarios. While recent advancements in text-to-3D generation have shown promise, they often fall short in rendering detailed and high-quality 3D models. This problem is especially prevalent as many methods base themselves on Score Distillation Sampling (SDS). This paper identifies a notable deficiency in SDS, that it brings inconsistent and low-quality updating direction for the 3D model, causing the over-smoothing effect. To address this, we propose a novel approach called Interval Score Matching (ISM). ISM employs deterministic diffusing trajectories and utilizes interval-based score matching to counteract over-smoothing. Furthermore, we incorporate 3D Gaussian Splatting into our text-to-3D generation pipeline. Extensive experiments show that our model largely outperforms the state-of-the-art in quality and training efficiency.Comment: The first two authors contributed equally to this work. Our code will be available at: https://github.com/EnVision-Research/LucidDreame

    Analysis on the response of the dip slope with weak layer to earthquake

    Get PDF
    Taking the down-slope with weak strata in the south of Fushun west open-pit mine as the reference prototype, numerical simulations of the down-slope with weak strata were conducted through FLAC3D software, which included the simulation of actual ground motion and ground motion input, the boundary conditions of slope model, rock mass parameters, and grid model division. The response rule of down-slope with weak strata under an earthquake was investigated by analyzing the acceleration and velocity of the monitoring points. The results revealed that: (1) The thickness of the weak layer is a critical factor affecting the response characteristics of the slope with a single weak layer under an earthquake, and it has a greater impact on the stability of the slope under earthquake load than the dip angle of a single weak layer. (2) Based on the horizontal velocity of monitoring point 2# at the intersection of the weak layer and the slope surface, it was concluded that the thickness has a significant influence on the velocity in the X direction. (3) The failure response law at the intersection of weak layers and slope changes with an increase in slope height when analyzing the response law of the double weak layer characteristic slope under an earthquake. The acceleration amplitude and velocity change degree of monitoring point 3# with double weak layers are more noticeable than that of monitoring point 2#. The response law of the down-slope under an earthquake is related to the dip angle, thickness, number, and location of weak layers. Therefore, the coupling effect of earthquake and weak layer characteristics on slope stability should be thoroughly considered in the process of slope treatment and protection

    Mechanistic study of visible light-driven CdS or g-C<sub>3</sub>N<sub>4</sub>-catalyzed C–H direct trifluoromethylation of (hetero)arenes using CF<sub>3</sub>SO<sub>2</sub>Na as the trifluoromethyl source

    Get PDF
    The mild and sustainable methods for C–H direct trifluoromethylation of (hetero)arenes without any base or strong oxidants are in extremely high demand. Here, we report that the photo-generated electron-hole pairs of classical semiconductors (CdS or g-C3N4) under visible light excitation are effective to drive C–H trifluoromethylation of (hetero)arenes with stable and inexpensive CF3SO2Na as the trifluoromethyl (TFM) source via radical pathway. Either CdS or g-C3N4 propagated reaction can efficiently transform CF3SO2Na to [rad]CF3 radical and further afford the desired benzotrifluoride derivatives in moderate to good yields. After visible light initiated photocatalytic process, the key elements (such as F, S and C) derived from the starting TFM source of CF3SO2Na exhibited differential chemical forms as compared to those in other oxidative reactions. The photogenerated electron was trapped by chemisorbed O2 on photocatalysts to form superoxide radical anion (O2[rad]−) which will further attack [rad]CF3 radical with the generation of inorganic product F− and CO2. This resulted in a low utilization efficiency of [rad]CF3 (&lt;50%). When nitro aromatic compounds and CF3SO2Na served as the starting materials in inert atmosphere, the photoexcited electrons can be directed to reduce the nitro group to amino group rather than being trapped by O2. Meanwhile, the photogenerated holes oxidize SO2CF3− into [rad]CF3. Both the photogenerated electrons and holes were engaged in reductive and oxidative paths, respectively. The desired product, trifluoromethylated aniline, was obtained successfully via one-pot free-radical synthesis.</p

    Establishment of a viable cell detection system for microorganisms in wine based on ethidium monoazide and quantitative PCR

    Get PDF
    Fermentability and contamination level of wine can be assessed through the detection of viable fermentation-related and spoilage-related microorganisms. Ethidium monoazide in combination with quantitative PCR (EMA-qPCR) has been considered as a promising method to enumerate viable cells. Milling for 80 s by O 500-mu m glass beads is demonstrated to be optimal for DNA extraction from yeasts, lactic acid bacteria (LAB) and acetic acid bacteria (AAB) in wine to be used as a template for PCR. EMA-qPCR results from experiments using DNA extracted by this method correlate well with the results of a plating assay (R-2 > 0.99), and a PCR efficiency between 96% and 105% was obtained. Moreover, for all of these microorganisms, EMA treatment of pure cultures at a low concentration (10 mu g/mL) for 20 min photoactivation resulted in effective differentiation between viable and non-viable cells and had no effect on viable cells. Due to sublethal injury to some cells, underestimation of cell counts was found in most of the wine samples tested using the EMA-qPCR method, and a 40-min incubation in recovery medium could completely offset this error. Our results suggest an optimal glass-bead DNA extraction method and EMA treatment suitable for all of the main microorganisms in wine. The EMA-qPCR method was successfully applied to quantify yeasts. Saccharomyces cerevisiae (S. cerevisiae), LAB, non-Oenococcus oeni LAB (non-O. oeni LAB) and AAB in wine samples. (C) 2012 Elsevier Ltd. All rights reserved
    corecore