14 research outputs found
OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining
Pretraining multilingual language models from scratch requires considerable
computational resources and substantial training data. Therefore, a more
efficient method is to adapt existing pretrained language models (PLMs) to new
languages via vocabulary extension and continued pretraining. However, this
method usually randomly initializes the embeddings of new subwords and
introduces substantially more embedding parameters to the language model, thus
weakening the efficiency. To address these issues, we propose a novel
framework: \textbf{O}ne \textbf{F}or \textbf{A}ll (\textbf{\textsc{Ofa}}),
which wisely initializes the embeddings of unseen subwords from target
languages and thus can adapt a PLM to multiple languages efficiently and
effectively. \textsc{Ofa} takes advantage of external well-aligned multilingual
word embeddings and injects the alignment knowledge into the new embeddings. In
addition, \textsc{Ofa} applies matrix factorization and replaces the cumbersome
embeddings with two lower-dimensional matrices, which significantly reduces the
number of parameters while not sacrificing the performance. Through extensive
experiments, we show models initialized by \textsc{Ofa} are efficient and
outperform several baselines. \textsc{Ofa} not only accelerates the convergence
of continued pretraining, which is friendly to a limited computation budget,
but also improves the zero-shot crosslingual transfer on a wide range of
downstream tasks. We make our code and models publicly available
Universal NER:A Gold-Standard Multilingual Named Entity Recognition Benchmark
We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public
Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark
We introduce Universal NER (UNER), an open, community-driven project to
develop gold-standard NER benchmarks in many languages. The overarching goal of
UNER is to provide high-quality, cross-lingually consistent annotations to
facilitate and standardize multilingual NER research. UNER v1 contains 18
datasets annotated with named entities in a cross-lingual consistent schema
across 12 diverse languages. In this paper, we detail the dataset creation and
composition of UNER; we also provide initial modeling baselines on both
in-language and cross-lingual learning settings. We release the data, code, and
fitted models to the public
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, “help” from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should notlimit NLP to a small fraction of the world’s languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at https://github.com/cisnlp/Glot500
Hierarchical Attention Network with Pairwise Loss for Chinese Zero Pronoun Resolution
Recent neural network methods for Chinese zero pronoun resolution didn't take bidirectional attention between zero pronouns and candidate antecedents into consideration, and simply treated the task as a classification task, ignoring the relationship between different candidates of a zero pronoun. To solve these problems, we propose a Hierarchical Attention Network with Pairwise Loss (HAN-PL), for Chinese zero pronoun resolution. In the proposed HAN-PL, we design a two-layer attention model to generate more powerful representations for zero pronouns and candidate antecedents. Furthermore, we propose a novel pairwise loss by introducing the correct-antecedent similarity constraint and the pairwise-margin loss, making the learned model more discriminative. Extensive experiments have been conducted on OntoNotes 5.0 dataset, and our model achieves state-of-the-art performance in the task of Chinese zero pronoun resolution
Universal NER:A Gold-Standard Multilingual Named Entity Recognition Benchmark
We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public
RNA Aptamers with Specificity for Heparosan and Chondroitin Glycosaminoglycans
In this study, two
respective groups of RNA aptamers have been
selected against two main classes of glycosaminoglycans (GAGs), heparosan,
and chondroitin, as they have proven difficult to specifically detect
in biological samples. GAGs are linear, anionic, polydisperse polysaccharides
found ubiquitously in nature, yet their detection remains problematic.
GAGs comprised repeating disaccharide units, consisting of uronic
acid and hexosamine residues that are often also sulfated at various
positions. Monoclonal antibodies are frequently used in biology and
medicine to recognize various biological analytes with high affinity
and specificity. However, GAGs are conserved across the whole animal
phylogenic tree and are nonimmunogenic in hosts traditionally used
for natural antibody generation. Thus, it has been challenging to
obtain high affinity, selective antibodies that recognize various
GAGs. In the absence of anti-GAG antibodies, glycobiologists have
relied on the use of specific enzymes to convert GAGs to oligosaccharides
for analysis by mass spectrometry. Unfortunately, while these methods
are sensitive, they can be labor-intensive and cannot be used for
in situ detection of intact GAGs in cells and tissues. Aptamers are
single-stranded oligonucleotide (DNA or RNA) ligands capable of high
selectivity and high affinity detection of biological analytes. Aptamers
can be developed in vitro by the systematic evolution of ligands by
exponential enrichment (SELEX) to recognize nonimmunogenic targets,
including neutral carbohydrates. This study utilizes the SELEX method
to generate RNA aptamers, which specifically bind to the unmodified
GAGs, heparosan, and chondroitin. Binding confirmation and cross-screening
with other GAGs were performed using confocal microscopy to afford
three specific GAGs to each target. Affinity constant of each RNA
aptamer was obtained by fluorescent output after interaction with
the respective GAG target immobilized on plates; the KD values were determined to be 0.71–1.0 μM
for all aptamers. Upon the success of chemical modification (to stabilize
RNA aptamers in actual biological systems) and fluorescent tagging
(to only visualize RNA aptamers) of these aptamers, they would be
able to serve as a specific detection reagent of these important GAGs
in biological samples
Blending Advertising with Organic Content in E-commerce via Virtual Bids
It has become increasingly common that sponsored content (i.e., paid ads) and non-sponsored content are jointly displayed to users, especially on e-commerce platforms. Thus, both of these contents may interact together to influence their engagement behaviors. In general, sponsored content helps brands achieve their marketing goals and provides ad revenue to the platforms. In contrast, non-sponsored content contributes to the long-term health of the platform through increasing users' engagement. A key conundrum to platforms is learning how to blend both of these contents allowing their interactions to be considered and balancing these business objectives. This paper proposes a system built for this purpose and applied to product detail pages of JD.COM, an e-commerce company. This system achieves three objectives: (a) Optimization of competing business objectives via Virtual Bids allowing the expressiveness of the valuation of the platform for these objectives. (b) Modeling the users' click behaviors considering explicitly the influence exerted by the sponsored and non-sponsored content displayed alongside through a deep learning approach. (c) Consideration of a Vickrey-Clarke-Groves (VCG) Auction design compatible with the allocation of ads and its induced externalities. Experiments are presented demonstrating the performance of the proposed system. Moreover, our approach is fully deployed and serves all traffic through JD.COM's mobile application
Dietary Fiber Intake and Endometrial Cancer Risk: A Systematic Review and Meta-Analysis
Epidemiological studies are inconclusive regarding the association between dietary fiber intake and endometrial cancer risk. Thus, we aimed to conduct a meta-analysis to clarify the association between dietary fiber and endometrial cancer risk. We searched the PubMed and ISI Web databases for relevant studies through March 2018. The association between dietary fiber and endometrial cancer risk was evaluated by conducting a meta-analysis including 3 cohort and 12 case–control studies. A significant negative association was observed between total dietary fiber intake and endometrial cancer risk in 11 case–control studies (odds ratios (OR) 0.76, 95% confidence interval (CI): 0.64–0.89, I2 = 35.2%, p = 0.117), but a marginal positive association was observed in three cohort studies (relative risk (RR) 1.22, 95% CI: 1.00–1.49, I2 = 0.0%, p = 0.995). Particularly, a negative association was observed in North America (OR = 0.70, 95% CI: 0.59–0.83, I2 = 8.9%, p = 0.362). In addition, a positive association was observed in cereal fiber (RR = 1.26, 95% CI: 1.03–1.52, I2 = 0.0%, p = 0.530, 3 cohort studies) and a negative association was observed in vegetable fiber (OR = 0.74, 95% CI: 0.58–0.94, I2 = 0.0%, p = 0.445, 3 case–control studies). In conclusion, negative associations with endometrial cancer risk were observed for higher total dietary fiber intake and higher vegetable fiber intake in the case–control studies. However, results from the cohort studies suggested positive relationships of higher total fiber intake and higher cereal fiber intake with endometrial cancer risk