Search CORE

14 research outputs found

OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining

Author: Lin Peiqin
Liu Yihong
Schütze Hinrich
Wang Mingyang
Publication venue
Publication date: 15/11/2023
Field of study

Pretraining multilingual language models from scratch requires considerable computational resources and substantial training data. Therefore, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. However, this method usually randomly initializes the embeddings of new subwords and introduces substantially more embedding parameters to the language model, thus weakening the efficiency. To address these issues, we propose a novel framework: \textbf{O}ne \textbf{F}or \textbf{A}ll (\textbf{\textsc{Ofa}}), which wisely initializes the embeddings of unseen subwords from target languages and thus can adapt a PLM to multiple languages efficiently and effectively. \textsc{Ofa} takes advantage of external well-aligned multilingual word embeddings and injects the alignment knowledge into the new embeddings. In addition, \textsc{Ofa} applies matrix factorization and replaces the cumbersome embeddings with two lower-dimensional matrices, which significantly reduces the number of parameters while not sacrificing the performance. Through extensive experiments, we show models initialized by \textsc{Ofa} are efficient and outperform several baselines. \textsc{Ofa} not only accelerates the convergence of continued pretraining, which is friendly to a limited computation budget, but also improves the zero-shot crosslingual transfer on a wide range of downstream tasks. We make our code and models publicly available

arXiv.org e-Print Archive

Universal NER:A Gold-Standard Multilingual Named Entity Recognition Benchmark

Author: Blevins Terra
Gonen Hila
Imperial Joseph Marvin
Karlsson Börje F.
Lin Peiqin
Liu Shuheng
Ljubešić Nikola
Mayhew Stephen
Miranda LJ
Pinter Yuval
Plank Barbara
Riabi Arij
Šuppa Marek
Publication venue: 'Center for Open Science'
Publication date: 15/11/2023
Field of study

We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public

OPUS

Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

Author: Blevins Terra
Gonen Hila
Imperial Joseph Marvin
Karlsson Börje F.
Lin Peiqin
Liu Shuheng
Ljubešić Nikola
Mayhew Stephen
Miranda LJ
Pinter Yuval
Plank Barbara
Riabi Arij
Šuppa Marek
Publication venue
Publication date: 15/11/2023
Field of study

arXiv.org e-Print Archive

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

Author: Boyd-Graber Jordan
Imani Ayyoob
Kargaran Amir Hossein
Kassner Nora
Lin Peiqin
Ma Chunlan
Martins André
Okazaki Naoaki
Rogers Anna
Sabet Masoud Jalili
Schmid Helmut
Schütze Hinrich
Severini Silvia
Yvon François
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 26/05/2023
Field of study

The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, “help” from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should notlimit NLP to a small fraction of the world’s languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at https://github.com/cisnlp/Glot500

arXiv.org e-Print Archive

Open Access LMU

Hierarchical Attention Network with Pairwise Loss for Chinese Zero Pronoun Resolution

Author: Lin Peiqin
Yang Meng
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 03/04/2020
Field of study

Recent neural network methods for Chinese zero pronoun resolution didn't take bidirectional attention between zero pronouns and candidate antecedents into consideration, and simply treated the task as a classification task, ignoring the relationship between different candidates of a zero pronoun. To solve these problems, we propose a Hierarchical Attention Network with Pairwise Loss (HAN-PL), for Chinese zero pronoun resolution. In the proposed HAN-PL, we design a two-layer attention model to generate more powerful representations for zero pronouns and candidate antecedents. Furthermore, we propose a novel pairwise loss by introducing the correct-antecedent similarity constraint and the pairwise-margin loss, making the learned model more discriminative. Extensive experiments have been conducted on OntoNotes 5.0 dataset, and our model achieves state-of-the-art performance in the task of Chinese zero pronoun resolution

Association for the Advancement of Artificial Intelligence: AAAI Publications

Universal NER:A Gold-Standard Multilingual Named Entity Recognition Benchmark

Author: Blevins Terra
Gonen Hila
Imperial Joseph Marvin
Karlsson Börje F.
Lin Peiqin
Liu Shuheng
Ljubešić Nikola
Mayhew Stephen
Miranda LJ
Pinter Yuval
Plank Barbara
Riabi Arij
Šuppa Marek
Publication venue: 'Center for Open Science'
Publication date: 15/11/2023
Field of study

OPUS

RNA Aptamers with Specificity for Heparosan and Chondroitin Glycosaminoglycans

Author: Brady F. Cress
Ke Xia
Lei Lin
Megan Kizer
Peiqin Li
Robert J. Linhardt
Tom T. Jing
Xing Wang
Xing Zhang
Publication venue: 'American Chemical Society (ACS)'
Publication date: 01/10/2018
Field of study

In this study, two respective groups of RNA aptamers have been selected against two main classes of glycosaminoglycans (GAGs), heparosan, and chondroitin, as they have proven difficult to specifically detect in biological samples. GAGs are linear, anionic, polydisperse polysaccharides found ubiquitously in nature, yet their detection remains problematic. GAGs comprised repeating disaccharide units, consisting of uronic acid and hexosamine residues that are often also sulfated at various positions. Monoclonal antibodies are frequently used in biology and medicine to recognize various biological analytes with high affinity and specificity. However, GAGs are conserved across the whole animal phylogenic tree and are nonimmunogenic in hosts traditionally used for natural antibody generation. Thus, it has been challenging to obtain high affinity, selective antibodies that recognize various GAGs. In the absence of anti-GAG antibodies, glycobiologists have relied on the use of specific enzymes to convert GAGs to oligosaccharides for analysis by mass spectrometry. Unfortunately, while these methods are sensitive, they can be labor-intensive and cannot be used for in situ detection of intact GAGs in cells and tissues. Aptamers are single-stranded oligonucleotide (DNA or RNA) ligands capable of high selectivity and high affinity detection of biological analytes. Aptamers can be developed in vitro by the systematic evolution of ligands by exponential enrichment (SELEX) to recognize nonimmunogenic targets, including neutral carbohydrates. This study utilizes the SELEX method to generate RNA aptamers, which specifically bind to the unmodified GAGs, heparosan, and chondroitin. Binding confirmation and cross-screening with other GAGs were performed using confocal microscopy to afford three specific GAGs to each target. Affinity constant of each RNA aptamer was obtained by fluorescent output after interaction with the respective GAG target immobilized on plates; the KD values were determined to be 0.71–1.0 μM for all aptamers. Upon the success of chemical modification (to stabilize RNA aptamers in actual biological systems) and fluorescent tagging (to only visualize RNA aptamers) of these aptamers, they would be able to serve as a specific detection reagent of these important GAGs in biological samples

Crossref

Directory of Open Access Journals

eScholarship - University of California

FigShare

Blending Advertising with Organic Content in E-commerce via Virtual Bids

Author: Bao Yongjun
Carrion Carlos
Chen Wenlong
Gu Peiqin
Jin Junsheng
Lei Yulin
Lin Xiliang
Lin Zhangang
Luo Xianghong
Nair Harikesh
Peng Changping
Shao Jingping
Wang Zenan
Yan Weipeng
Zhu Fanan
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 06/09/2023
Field of study

It has become increasingly common that sponsored content (i.e., paid ads) and non-sponsored content are jointly displayed to users, especially on e-commerce platforms. Thus, both of these contents may interact together to influence their engagement behaviors. In general, sponsored content helps brands achieve their marketing goals and provides ad revenue to the platforms. In contrast, non-sponsored content contributes to the long-term health of the platform through increasing users' engagement. A key conundrum to platforms is learning how to blend both of these contents allowing their interactions to be considered and balancing these business objectives. This paper proposes a system built for this purpose and applied to product detail pages of JD.COM, an e-commerce company. This system achieves three objectives: (a) Optimization of competing business objectives via Virtual Bids allowing the expressiveness of the valuation of the platform for these objectives. (b) Modeling the users' click behaviors considering explicitly the influence exerted by the sponsored and non-sponsored content displayed alongside through a deep learning approach. (c) Consideration of a Vickrey-Clarke-Groves (VCG) Auction design compatible with the allocation of ads and its induced externalities. Experiments are presented demonstrating the performance of the proposed system. Moreover, our approach is fully deployed and serves all traffic through JD.COM's mobile application

Association for the Advancement of Artificial Intelligence: AAAI Publications

Dietary Fiber Intake and Endometrial Cancer Risk: A Systematic Review and Meta-Analysis

Author: Fuxue Chen
Hongwei Wang
Jiajie Zang
Jing Zhao
Kangning Chen
Peiqin Li
Qianyu Zhao
Shuchun Lin
Wanghong Xu
Xiaofan Li
Ying Gao
Ying Xiao
Publication venue: 'MDPI AG'
Publication date: 01/07/2018
Field of study

Epidemiological studies are inconclusive regarding the association between dietary fiber intake and endometrial cancer risk. Thus, we aimed to conduct a meta-analysis to clarify the association between dietary fiber and endometrial cancer risk. We searched the PubMed and ISI Web databases for relevant studies through March 2018. The association between dietary fiber and endometrial cancer risk was evaluated by conducting a meta-analysis including 3 cohort and 12 case–control studies. A significant negative association was observed between total dietary fiber intake and endometrial cancer risk in 11 case–control studies (odds ratios (OR) 0.76, 95% confidence interval (CI): 0.64–0.89, I2 = 35.2%, p = 0.117), but a marginal positive association was observed in three cohort studies (relative risk (RR) 1.22, 95% CI: 1.00–1.49, I2 = 0.0%, p = 0.995). Particularly, a negative association was observed in North America (OR = 0.70, 95% CI: 0.59–0.83, I2 = 8.9%, p = 0.362). In addition, a positive association was observed in cereal fiber (RR = 1.26, 95% CI: 1.03–1.52, I2 = 0.0%, p = 0.530, 3 cohort studies) and a negative association was observed in vegetable fiber (OR = 0.74, 95% CI: 0.58–0.94, I2 = 0.0%, p = 0.445, 3 case–control studies). In conclusion, negative associations with endometrial cancer risk were observed for higher total dietary fiber intake and higher vegetable fiber intake in the case–control studies. However, results from the cohort studies suggested positive relationships of higher total fiber intake and higher cereal fiber intake with endometrial cancer risk

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals