44 research outputs found
Bit Cipher -- A Simple yet Powerful Word Representation System that Integrates Efficiently with Language Models
While Large Language Models (LLMs) become ever more dominant, classic
pre-trained word embeddings sustain their relevance through computational
efficiency and nuanced linguistic interpretation. Drawing from recent studies
demonstrating that the convergence of GloVe and word2vec optimizations all tend
towards log-co-occurrence matrix variants, we construct a novel word
representation system called Bit-cipher that eliminates the need of
backpropagation while leveraging contextual information and hyper-efficient
dimensionality reduction techniques based on unigram frequency, providing
strong interpretability, alongside efficiency. We use the bit-cipher algorithm
to train word vectors via a two-step process that critically relies on a
hyperparameter -- bits -- that controls the vector dimension. While the first
step trains the bit-cipher, the second utilizes it under two different
aggregation modes -- summation or concatenation -- to produce contextually rich
representations from word co-occurrences. We extend our investigation into
bit-cipher's efficacy, performing probing experiments on part-of-speech (POS)
tagging and named entity recognition (NER) to assess its competitiveness with
classic embeddings like word2vec and GloVe. Additionally, we explore its
applicability in LM training and fine-tuning. By replacing embedding layers
with cipher embeddings, our experiments illustrate the notable efficiency of
cipher in accelerating the training process and attaining better optima
compared to conventional training paradigms. Experiments on the integration of
bit-cipher embedding layers with Roberta, T5, and OPT, prior to or as a
substitute for fine-tuning, showcase a promising enhancement to transfer
learning, allowing rapid model convergence while preserving competitive
performance
Reducing the Need for Backpropagation and Discovering Better Optima With Explicit Optimizations of Neural Networks
Iterative differential approximation methods that rely upon backpropagation
have enabled the optimization of neural networks; however, at present, they
remain computationally expensive, especially when training models at scale. In
this paper, we propose a computationally efficient alternative for optimizing
neural networks that can both reduce the costs of scaling neural networks and
provide high-efficiency optimizations for low-resource applications. We derive
an explicit solution to a simple feed-forward language model (LM) by
mathematically analyzing its gradients. This solution generalizes from
single-layer LMs to the class of all single-layer feed-forward
softmax-activated neural models trained on positive-valued features, as is
demonstrated by our extension of this solution application to MNIST digit
classification. For both LM and digit classifiers, we find computationally that
explicit solutions perform near-optimality in experiments showing that 1)
iterative optimization only marginally improves the explicit solution
parameters and 2) randomly initialized parameters iteratively optimize towards
the explicit solution. We also preliminarily apply the explicit solution
locally by layer in multi-layer networks and discuss how the solution's
computational savings increase with model complexity -- for both single- and
mult-layer applications of the explicit solution, we emphasize that the optima
achieved cannot be reached by backpropagation alone, i.e., better optima appear
discoverable only after explicit solutions are applied. Finally, we discuss the
solution's computational savings alongside its impact on model interpretability
and suggest future directions for the derivation of explicit solutions to
complex- and multi-layer architectures
Explicit Foundation Model Optimization with Self-Attentive Feed-Forward Neural Units
Iterative approximation methods using backpropagation enable the optimization
of neural networks, but they remain computationally expensive, especially when
used at scale. This paper presents an efficient alternative for optimizing
neural networks that reduces the costs of scaling neural networks and provides
high-efficiency optimizations for low-resource applications. We will discuss a
general result about feed-forward neural networks and then extend this solution
to compositional (mult-layer) networks, which are applied to a simplified
transformer block containing feed-forward and self-attention layers. These
models are used to train highly-specified and complex multi-layer neural
architectures that we refer to as self-attentive feed-forward unit (SAFFU)
layers, which we use to develop a transformer that appears to generalize well
over small, cognitively-feasible, volumes of data. Testing demonstrates
explicit solutions outperform models optimized by backpropagation alone.
Moreover, further application of backpropagation after explicit solutions leads
to better optima from smaller scales of data, training effective models from
much less data is enabled by explicit solution warm starts. We then carry out
ablation experiments training a roadmap of about 250 transformer models over
1-million tokens to determine ideal settings. We find that multiple different
architectural variants produce highly-performant models, and discover from this
ablation that some of the best are not the most parameterized. This appears to
indicate well-generalized models could be reached using less data by using
explicit solutions, and that architectural exploration using explicit solutions
pays dividends in guiding the search for efficient variants with fewer
parameters, and which could be incorporated into low-resource hardware where AI
might be embodied
Lexical mechanics: Partitions, mixtures, and context
Highly structured for efficient communication, natural languages are complex systems. Unlike in their computational cousins, functions and meanings in natural languages are relative, frequently prescribed to symbols through unexpected social processes. Despite grammar and definition, the presence of metaphor can leave unwitting language users in the dark, so to speak. This is not problematic, but rather an important operational feature of languages, since the lifting of meaning onto higher-order structures allows individuals to compress descriptions of regularly-conveyed information. This compressed terminology, often only appropriate when taken locally (in context), is beneficial in an enormous world of novel experience. However, what is natural for a human to process can be tremendously difficult for a computer. When a sequence of words (a phrase) is to be taken as a unit, suppose the choice of words in the phrase is subordinate to the choice of the phrase, i.e., there exists an inter-word dependence owed to membership within a common phrase. This word selection process is not one of independent selection, and so is capable of generating word-frequency distributions that are not accessible via independent selection processes. We have shown in Ch. 2 through analysis of thousands of English texts that empirical word-frequency distributions possess these word-dependence anomalies, while phrase-frequency distributions do not. In doing so, this study has also led to the development of a novel, general, and mathematical framework for the generation of frequency data for phrases, opening up the field of mass-preserving mesoscopic lexical analyses. A common oversight in many studies of the generation and interpretation of language is the assumption that separate discourses are independent. However, even when separate texts are each produced by means of independent word selection, it is possible for their composite distribution of words to exhibit dependence. Succinctly, different texts may use a common word or phrase for different meanings, and so exhibit disproportionate usages when juxtaposed. To support this theory, we have shown in Ch. 3 that the act of combining distinct texts to form large \u27corpora\u27 results in word-dependence irregularities. This not only settles a 15-year discussion, challenging the current major theory, but also highlights an important practice necessary for successful computational analysis---the retention of meaningful separations in language. We must also consider how language speakers and listeners navigate such a combinatorially vast space for meaning. Dictionaries (or, the collective editorial communities behind them) are smart. They know all about the lexical objects they define, but we ask about the latent information they hold, or should hold, about related, undefined objects. Based solely on the text as data, in Ch. 4 we build on our result in Ch. 2 and develop a model of context defined by the structural similarities of phrases. We then apply this model to define measures of meaning in a corpus-guided experiment, computationally detecting entries missing from a massive, collaborative online dictionary known as the Wiktionary
Text mixing shapes the anatomy of rank-frequency distributions
Natural languages are full of rules and exceptions. One of the most famous quantitative rules is Zipf\u27s law, which states that the frequency of occurrence of a word is approximately inversely proportional to its rank. Though this law of ranks has been found to hold across disparate texts and forms of data, analyses of increasingly large corpora since the late 1990s have revealed the existence of two scaling regimes. These regimes have thus far been explained by a hypothesis suggesting a separability of languages into core and noncore lexica. Here we present and defend an alternative hypothesis that the two scaling regimes result from the act of aggregating texts. We observe that text mixing leads to an effective decay of word introduction, which we show provides accurate predictions of the location and severity of breaks in scaling. Upon examining large corpora from 10 languages in the Project Gutenberg eBooks collection, we find emphatic empirical support for the universality of our claim