35 research outputs found
Contextualized Diachronic Word Representations
International audienceDiachronic word embeddings play a key role in capturing interesting patterns about how language evolves over time. Most of the existing work focuses on studying corpora spanning across several decades, which is understandably still not a possibility when working on social media-based user-generated content. In this work, we address the problem of studying semantic changes in a large Twitter corpus collected over five years, a much shorter period than what is usually the norm in di-achronic studies. We devise a novel attentional model, based on Bernoulli word embeddings, that are conditioned on contextual extra-linguistic (social) features such as network, spatial and socioeconomic variables, which are associated with Twitter users, as well as topic-based features. We posit that these social features provide an inductive bias that helps our model to overcome the narrow time-span regime problem. Our extensive experiments reveal that our proposed model is able to capture subtle semantic shifts without being biased towards frequency cues and also works well when certain con-textual features are absent. Our model fits the data better than current state-of-the-art dynamic word embedding models and therefore is a promising tool to study diachronic semantic changes over small time periods
What does BERT learn about the structure of language?
International audienceBERT is a recent language representation model that has surprisingly performed well in diverse language understanding benchmarks. This result indicates the possibility that BERT networks capture structural information about language. In this work, we provide novel support for this claim by performing a series of experiments to unpack the elements of English language structure learned by BERT. We first show that BERT's phrasal representation captures phrase-level information in the lower layers. We also show that BERT's intermediate layers encode a rich hierarchy of linguistic information, with surface features at the bottom, syntactic features in the middle and semantic features at the top. BERT turns out to require deeper layers when long-distance dependency information is required, e.g.~to track subject-verb agreement. Finally, we show that BERT representations capture linguistic information in a compositional way that mimics classical, tree-like structures
LLM Performance Predictors are good initializers for Architecture Search
Large language models (LLMs) have become an integral component in solving a
wide range of NLP tasks. In this work, we explore a novel use case of using
LLMs to build performance predictors (PP): models that, given a specific deep
neural network architecture, predict its performance on a downstream task. We
design PP prompts for LLMs consisting of: (i) role: description of the role
assigned to the LLM, (ii) instructions: set of instructions to be followed by
the LLM to carry out performance prediction, (iii) hyperparameters: a
definition of each architecture-specific hyperparameter and (iv)
demonstrations: sample architectures along with their efficiency metrics and
'training from scratch' performance. For machine translation (MT) tasks, we
discover that GPT-4 with our PP prompts (LLM-PP) can predict the performance of
architecture with a mean absolute error matching the SOTA and a marginal
degradation in rank correlation coefficient compared to SOTA performance
predictors. Further, we show that the predictions from LLM-PP can be distilled
to a small regression model (LLM-Distill-PP). LLM-Distill-PP models
surprisingly retain the performance of LLM-PP largely and can be a
cost-effective alternative for heavy use cases of performance estimation.
Specifically, for neural architecture search (NAS), we propose a Hybrid-Search
algorithm for NAS (HS-NAS), which uses LLM-Distill-PP for the initial part of
search, resorting to the baseline predictor for rest of the search. We show
that HS-NAS performs very similar to SOTA NAS across benchmarks, reduces search
hours by 50% roughly, and in some cases, improves latency, GFLOPs, and model
size
Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora
International audienceThe problem of comparing two bodies of text and searching for words that differ in their usage between them arises often in digital humanities and computational social science. This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large. However, these methods often require extensive filtering of the vocabulary to perform well, and-as we show in this work-result in unstable, and hence less reliable, results. We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word. The method is simple, interpretable and stable. We demonstrate its effectiveness in 9 different setups, considering different corpus splitting criteria (age, gender and profession of tweet authors, time of tweet) and different languages (English, French and Hebrew)
AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation
Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in
Neural Machine Translation (NMT) tasks. Existing works in MoE mostly consider a
homogeneous design where the same number of experts of the same size are placed
uniformly throughout the network. Furthermore, existing MoE works do not
consider computational constraints (e.g., FLOPs, latency) to guide their
design. To this end, we develop AutoMoE -- a framework for designing
heterogeneous MoE's under computational constraints. AutoMoE leverages Neural
Architecture Search (NAS) to obtain efficient sparse MoE sub-transformers with
4x inference speedup (CPU) and FLOPs reduction over manually designed
Transformers, with parity in BLEU score over dense Transformer and within 1
BLEU point of MoE SwitchTransformer, on aggregate over benchmark datasets for
NMT. Heterogeneous search space with dense and sparsely activated Transformer
modules (e.g., how many experts? where to place them? what should be their
sizes?) allows for adaptive compute -- where different amounts of computations
are used for different tokens in the input. Adaptivity comes naturally from
routing decisions which send tokens to experts of different sizes. AutoMoE
code, data, and trained models are available at https://aka.ms/AutoMoE.Comment: ACL 2023 Finding
Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints
Autocomplete is a task where the user inputs a piece of text, termed prompt,
which is conditioned by the model to generate semantically coherent
continuation. Existing works for this task have primarily focused on datasets
(e.g., email, chat) with high frequency user prompt patterns (or focused
prompts) where word-based language models have been quite effective. In this
work, we study the more challenging setting consisting of low frequency user
prompt patterns (or broad prompts, e.g., prompt about 93rd academy awards) and
demonstrate the effectiveness of character-based language models. We study this
problem under memory-constrained settings (e.g., edge devices and smartphones),
where character-based representation is effective in reducing the overall model
size (in terms of parameters). We use WikiText-103 benchmark to simulate broad
prompts and demonstrate that character models rival word models in exact match
accuracy for the autocomplete task, when controlled for the model size. For
instance, we show that a 20M parameter character model performs similar to an
80M parameter word model in the vanilla setting. We further propose novel
methods to improve character models by incorporating inductive bias in the form
of compositional information and representation transfer from large word
models
ELMoLex: Connecting ELMo and Lexicon features for Dependency Parsing
International audienceIn this paper, we present the details of the neural dependency parser and the neu-ral tagger submitted by our team 'ParisNLP' to the CoNLL 2018 Shared Task on parsing from raw text to Universal Dependencies. We augment the deep Biaffine (BiAF) parser (Dozat and Manning, 2016) with novel features to perform competitively: we utilize an indomain version of ELMo features (Peters et al., 2018) which provide context-dependent word representations; we utilize disambiguated, embedded, morphosyntactic features from lexicons (Sagot, 2018), which complements the existing feature set. Henceforth , we call our system 'ELMoLex'. In addition to incorporating character embed-dings, ELMoLex leverage pre-trained word vectors, ELMo and morphosyntactic features (whenever available) to correctly handle rare or unknown words which are prevalent in languages with complex morphology. ELMoLex 1 ranked 11th by Labeled Attachment Score metric (70.64%), Morphology-aware LAS metric (55.74%) and ranked 9th by Bilexical dependency metric (60.70%). In an extrinsic evaluation setup, ELMoLex ranked 7 th for Event Extraction, Negation Resolution tasks and 11th for Opinion Analysis task by F1 score
Prevalence and architecture of de novo mutations in developmental disorders.
The genomes of individuals with severe, undiagnosed developmental disorders are enriched in damaging de novo mutations (DNMs) in developmentally important genes. Here we have sequenced the exomes of 4,293 families containing individuals with developmental disorders, and meta-analysed these data with data from another 3,287 individuals with similar disorders. We show that the most important factors influencing the diagnostic yield of DNMs are the sex of the affected individual, the relatedness of their parents, whether close relatives are affected and the parental ages. We identified 94 genes enriched in damaging DNMs, including 14 that previously lacked compelling evidence of involvement in developmental disorders. We have also characterized the phenotypic diversity among these disorders. We estimate that 42% of our cohort carry pathogenic DNMs in coding sequences; approximately half of these DNMs disrupt gene function and the remainder result in altered protein function. We estimate that developmental disorders caused by DNMs have an average prevalence of 1 in 213 to 1 in 448 births, depending on parental age. Given current global demographics, this equates to almost 400,000 children born per year
Distinct genetic architectures for syndromic and nonsyndromic congenital heart defects identified by exome sequencing.
Congenital heart defects (CHDs) have a neonatal incidence of 0.8-1% (refs. 1,2). Despite abundant examples of monogenic CHD in humans and mice, CHD has a low absolute sibling recurrence risk (∼2.7%), suggesting a considerable role for de novo mutations (DNMs) and/or incomplete penetrance. De novo protein-truncating variants (PTVs) have been shown to be enriched among the 10% of 'syndromic' patients with extra-cardiac manifestations. We exome sequenced 1,891 probands, including both syndromic CHD (S-CHD, n = 610) and nonsyndromic CHD (NS-CHD, n = 1,281). In S-CHD, we confirmed a significant enrichment of de novo PTVs but not inherited PTVs in known CHD-associated genes, consistent with recent findings. Conversely, in NS-CHD we observed significant enrichment of PTVs inherited from unaffected parents in CHD-associated genes. We identified three genome-wide significant S-CHD disorders caused by DNMs in CHD4, CDK13 and PRKD1. Our study finds evidence for distinct genetic architectures underlying the low sibling recurrence risk in S-CHD and NS-CHD