86 research outputs found
Dynamic Masking Rate Schedules for MLM Pretraining
Most works on transformers trained with the Masked Language Modeling (MLM)
objective use the original BERT model's fixed masking rate of 15%. Our work
instead dynamically schedules the masking ratio throughout training. We found
that linearly decreasing the masking rate from 30% to 15% over the course of
pretraining improves average GLUE accuracy by 0.46% in BERT-base, compared to a
standard 15% fixed rate. Further analyses demonstrate that the gains from
scheduling come from being exposed to both high and low masking rate regimes.
Our results demonstrate that masking rate scheduling is a simple way to improve
the quality of masked language models and achieve up to a 1.89x speedup in
pretraining
Reduce, Reuse, Recycle: Improving Training Efficiency with Distillation
Methods for improving the efficiency of deep network training (i.e. the
resources required to achieve a given level of model quality) are of immediate
benefit to deep learning practitioners. Distillation is typically used to
compress models or improve model quality, but it's unclear if distillation
actually improves training efficiency. Can the quality improvements of
distillation be converted into training speed-ups, or do they simply increase
final model quality with no resource savings? We conducted a series of
experiments to investigate whether and how distillation can be used to
accelerate training using ResNet-50 trained on ImageNet and BERT trained on C4
with a masked language modeling objective and evaluated on GLUE, using common
enterprise hardware (8x NVIDIA A100). We found that distillation can speed up
training by up to 1.96x in ResNet-50 trained on ImageNet and up to 1.42x on
BERT when evaluated on GLUE. Furthermore, distillation for BERT yields optimal
results when it is only performed for the first 20-50% of training. We also
observed that training with distillation is almost always more efficient than
training without distillation, even when using the poorest-quality model as a
teacher, in both ResNet-50 and BERT. Finally, we found that it's possible to
gain the benefit of distilling from an ensemble of teacher models, which has
O(n) runtime cost, by randomly sampling a single teacher from the pool of
teacher models on each step, which only has a O(1) runtime cost. Taken
together, these results show that distillation can substantially improve
training efficiency in both image classification and language modeling, and
that a few simple optimizations to distillation protocols can further enhance
these efficiency improvements
Sudden Drops in the Loss: Syntax Acquisition, Phase Transitions, and Simplicity Bias in MLMs
Most interpretability research in NLP focuses on understanding the behavior
and features of a fully trained model. However, certain insights into model
behavior may only be accessible by observing the trajectory of the training
process. We present a case study of syntax acquisition in masked language
models (MLMs) that demonstrates how analyzing the evolution of interpretable
artifacts throughout training deepens our understanding of emergent behavior.
In particular, we study Syntactic Attention Structure (SAS), a naturally
emerging property of MLMs wherein specific Transformer heads tend to focus on
specific syntactic relations. We identify a brief window in pretraining when
models abruptly acquire SAS, concurrent with a steep drop in loss. This
breakthrough precipitates the subsequent acquisition of linguistic
capabilities. We then examine the causal role of SAS by manipulating SAS during
training, and demonstrate that SAS is necessary for the development of
grammatical capabilities. We further find that SAS competes with other
beneficial traits during training, and that briefly suppressing SAS improves
model quality. These findings offer an interpretation of a real-world example
of both simplicity bias and breakthrough training dynamics.Comment: ICLR 2024 camera-read
Hemodialysis Graft with Blind Loop Inflow Segment Treated with Stent Placement
Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/74924/1/j.1525-139X.2008.00460.x.pd
Multiple, distinct intercontinental lineages but isolation of Australian populations in a cosmopolitan lichen-forming Fungal Taxon, Psora decipiens (Psoraceae, Ascomycota)
Multiple drivers shape the spatial distribution of species, including dispersal capacity, niche incumbency, climate variability, orographic barriers, and plate tectonics. However, biogeographic patterns of fungi commonly do not fit conventional expectations based on studies of animals and plants. Fungi, in general, are known to occur across exceedingly broad, intercontinental distributions, including some important components of biological soil crust communities (BSCs). However, molecular data often reveal unexpected biogeographic patterns in lichenized fungal species that are assumed to have cosmopolitan distributions. The lichen-forming fungal species Psora decipiens is found on all continents, except Antarctica and occurs in BSCs across diverse habitats, ranging from hot, arid deserts to alpine habitats. In order to better understand factors that shape population structure in cosmopolitan lichen-forming fungal species, we investigated biogeographic patterns in the cosmopolitan taxon P. decipiens, along with the closely related taxa P. crenata and P. saviczii. We generated a multi-locus sequence dataset based on a worldwide sampling of these taxa in order to reconstruct evolutionary relationships and explore phylogeographic patterns. Both P. crenata and P. decipiens were not recovered as monophyletic; and P. saviczii specimens were recovered as a monophyletic clade closely related to a number of lineages comprised of specimens representing P. decipiens. Striking phylogeographic patterns were observed for P. crenata, with populations from distinct geographic regions belonging to well-separated, monophyletic lineages. South African populations of P. crenata were further divided into well-supported sub-clades. While well-supported phylogenetic substructure was also observed for the nominal taxon P. decipiens, nearly all lineages were comprised of specimens collected from intercontinental populations. However, all Australian specimens representing P. decipiens were recovered within a single well-supported monophyletic clade consisting solely of Australian samples. Our study supports up to 10 candidate species-level lineages in P. decipiens, based on genealogical concordance and coalescent-based species delimitation analyses. Our results support the general pattern of the biogeographic isolation of lichen-forming fungal populations in Australia, even in cases where closely related congeners have documented intercontinental distributions. Our study has important implications for understanding factors influencing diversification and distributions of lichens associated with BSC.This research was funded, in part, by a start-up grant from
BYU College of Life Sciences to SL; MarW’s and MatW’s
work was done within the European Soil Crust Project SCIN
(BĂĽdel et al., 2014) funded by the ERA-Net BiodivERsA
program, with the national funder The Swedish Research Council
for Environment, Agricultural Sciences and Spatial Planning
(FORMAS)
The varved succession of Crawford Lake, Milton, Ontario, Canada as a candidate Global boundary Stratotype Section and Point for the Anthropocene series
An annually laminated succession in Crawford Lake, Ontario, Canada is proposed as the Global boundary Stratotype Section and Point (GSSP) for the Anthropocene as a series/epoch with a base dated at 1950 CE. Varve couplets of organic matter capped by calcite precipitated each summer in alkaline surface waters reflect environmental change at global to local scales. Spheroidal carbonaceous particles and nitrogen isotopes record an increase in fossil fuel combustion in the early 1950s, coinciding with fallout from nuclear and thermonuclear testing—239+240Pu and 14C:12C, the latter more than compensating for the effects of old carbon in this dolomitic basin. Rapid industrial expansion in the North American Great Lakes region led to enhanced leaching of terrigenous elements by acid precipitation during the Great Acceleration, and calcite precipitation was reduced, producing thin calcite laminae around the GSSP that is marked by a sharp decline in elm pollen (Dutch Elm disease). The lack of bioturbation in well-oxygenated bottom waters, supported by the absence of fossil pigments from obligately anaerobic purple sulfur bacteria, is attributed to elevated salinities and high alkalinity below the chemocline. This aerobic depositional environment, unusual in a meromictic lake, inhibits the mobilization of 239Pu, the proposed primary stratigraphic guide for the Anthropocene
- …