14 research outputs found

    OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining

    Full text link
    Pretraining multilingual language models from scratch requires considerable computational resources and substantial training data. Therefore, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. However, this method usually randomly initializes the embeddings of new subwords and introduces substantially more embedding parameters to the language model, thus weakening the efficiency. To address these issues, we propose a novel framework: \textbf{O}ne \textbf{F}or \textbf{A}ll (\textbf{\textsc{Ofa}}), which wisely initializes the embeddings of unseen subwords from target languages and thus can adapt a PLM to multiple languages efficiently and effectively. \textsc{Ofa} takes advantage of external well-aligned multilingual word embeddings and injects the alignment knowledge into the new embeddings. In addition, \textsc{Ofa} applies matrix factorization and replaces the cumbersome embeddings with two lower-dimensional matrices, which significantly reduces the number of parameters while not sacrificing the performance. Through extensive experiments, we show models initialized by \textsc{Ofa} are efficient and outperform several baselines. \textsc{Ofa} not only accelerates the convergence of continued pretraining, which is friendly to a limited computation budget, but also improves the zero-shot crosslingual transfer on a wide range of downstream tasks. We make our code and models publicly available

    Universal NER:A Gold-Standard Multilingual Named Entity Recognition Benchmark

    Get PDF
    We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public

    Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

    Full text link
    We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public

    Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

    Get PDF
    The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, “help” from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should notlimit NLP to a small fraction of the world’s languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at https://github.com/cisnlp/Glot500

    Hierarchical Attention Network with Pairwise Loss for Chinese Zero Pronoun Resolution

    No full text
    Recent neural network methods for Chinese zero pronoun resolution didn't take bidirectional attention between zero pronouns and candidate antecedents into consideration, and simply treated the task as a classification task, ignoring the relationship between different candidates of a zero pronoun. To solve these problems, we propose a Hierarchical Attention Network with Pairwise Loss (HAN-PL), for Chinese zero pronoun resolution. In the proposed HAN-PL, we design a two-layer attention model to generate more powerful representations for zero pronouns and candidate antecedents. Furthermore, we propose a novel pairwise loss by introducing the correct-antecedent similarity constraint and the pairwise-margin loss, making the learned model more discriminative. Extensive experiments have been conducted on OntoNotes 5.0 dataset, and our model achieves state-of-the-art performance in the task of Chinese zero pronoun resolution

    Universal NER:A Gold-Standard Multilingual Named Entity Recognition Benchmark

    Get PDF
    We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public

    RNA Aptamers with Specificity for Heparosan and Chondroitin Glycosaminoglycans

    No full text
    In this study, two respective groups of RNA aptamers have been selected against two main classes of glycosaminoglycans (GAGs), heparosan, and chondroitin, as they have proven difficult to specifically detect in biological samples. GAGs are linear, anionic, polydisperse polysaccharides found ubiquitously in nature, yet their detection remains problematic. GAGs comprised repeating disaccharide units, consisting of uronic acid and hexosamine residues that are often also sulfated at various positions. Monoclonal antibodies are frequently used in biology and medicine to recognize various biological analytes with high affinity and specificity. However, GAGs are conserved across the whole animal phylogenic tree and are nonimmunogenic in hosts traditionally used for natural antibody generation. Thus, it has been challenging to obtain high affinity, selective antibodies that recognize various GAGs. In the absence of anti-GAG antibodies, glycobiologists have relied on the use of specific enzymes to convert GAGs to oligosaccharides for analysis by mass spectrometry. Unfortunately, while these methods are sensitive, they can be labor-intensive and cannot be used for in situ detection of intact GAGs in cells and tissues. Aptamers are single-stranded oligonucleotide (DNA or RNA) ligands capable of high selectivity and high affinity detection of biological analytes. Aptamers can be developed in vitro by the systematic evolution of ligands by exponential enrichment (SELEX) to recognize nonimmunogenic targets, including neutral carbohydrates. This study utilizes the SELEX method to generate RNA aptamers, which specifically bind to the unmodified GAGs, heparosan, and chondroitin. Binding confirmation and cross-screening with other GAGs were performed using confocal microscopy to afford three specific GAGs to each target. Affinity constant of each RNA aptamer was obtained by fluorescent output after interaction with the respective GAG target immobilized on plates; the KD values were determined to be 0.71–1.0 μM for all aptamers. Upon the success of chemical modification (to stabilize RNA aptamers in actual biological systems) and fluorescent tagging (to only visualize RNA aptamers) of these aptamers, they would be able to serve as a specific detection reagent of these important GAGs in biological samples

    Blending Advertising with Organic Content in E-commerce via Virtual Bids

    No full text
    It has become increasingly common that sponsored content (i.e., paid ads) and non-sponsored content are jointly displayed to users, especially on e-commerce platforms. Thus, both of these contents may interact together to influence their engagement behaviors. In general, sponsored content helps brands achieve their marketing goals and provides ad revenue to the platforms. In contrast, non-sponsored content contributes to the long-term health of the platform through increasing users' engagement. A key conundrum to platforms is learning how to blend both of these contents allowing their interactions to be considered and balancing these business objectives. This paper proposes a system built for this purpose and applied to product detail pages of JD.COM, an e-commerce company. This system achieves three objectives: (a) Optimization of competing business objectives via Virtual Bids allowing the expressiveness of the valuation of the platform for these objectives. (b) Modeling the users' click behaviors considering explicitly the influence exerted by the sponsored and non-sponsored content displayed alongside through a deep learning approach. (c) Consideration of a Vickrey-Clarke-Groves (VCG) Auction design compatible with the allocation of ads and its induced externalities. Experiments are presented demonstrating the performance of the proposed system. Moreover, our approach is fully deployed and serves all traffic through JD.COM's mobile application

    Dietary Fiber Intake and Endometrial Cancer Risk: A Systematic Review and Meta-Analysis

    No full text
    Epidemiological studies are inconclusive regarding the association between dietary fiber intake and endometrial cancer risk. Thus, we aimed to conduct a meta-analysis to clarify the association between dietary fiber and endometrial cancer risk. We searched the PubMed and ISI Web databases for relevant studies through March 2018. The association between dietary fiber and endometrial cancer risk was evaluated by conducting a meta-analysis including 3 cohort and 12 case–control studies. A significant negative association was observed between total dietary fiber intake and endometrial cancer risk in 11 case–control studies (odds ratios (OR) 0.76, 95% confidence interval (CI): 0.64–0.89, I2 = 35.2%, p = 0.117), but a marginal positive association was observed in three cohort studies (relative risk (RR) 1.22, 95% CI: 1.00–1.49, I2 = 0.0%, p = 0.995). Particularly, a negative association was observed in North America (OR = 0.70, 95% CI: 0.59–0.83, I2 = 8.9%, p = 0.362). In addition, a positive association was observed in cereal fiber (RR = 1.26, 95% CI: 1.03–1.52, I2 = 0.0%, p = 0.530, 3 cohort studies) and a negative association was observed in vegetable fiber (OR = 0.74, 95% CI: 0.58–0.94, I2 = 0.0%, p = 0.445, 3 case–control studies). In conclusion, negative associations with endometrial cancer risk were observed for higher total dietary fiber intake and higher vegetable fiber intake in the case–control studies. However, results from the cohort studies suggested positive relationships of higher total fiber intake and higher cereal fiber intake with endometrial cancer risk
    corecore