Search CORE

13 research outputs found

Query Rewriting for Retrieval-Augmented Large Language Models

Author: Duan Nan
Gong Yeyun
He Pengcheng
Ma Xinbei
Zhao Hai
Publication venue
Publication date: 22/10/2023
Field of study

Large Language Models (LLMs) play powerful, black-box readers in the retrieve-then-read pipeline, making remarkable progress in knowledge-intensive tasks. This work introduces a new framework, Rewrite-Retrieve-Read instead of the previous retrieve-then-read for the retrieval-augmented LLMs from the perspective of the query rewriting. Unlike prior studies focusing on adapting either the retriever or the reader, our approach pays attention to the adaptation of the search query itself, for there is inevitably a gap between the input text and the needed knowledge in retrieval. We first prompt an LLM to generate the query, then use a web search engine to retrieve contexts. Furthermore, to better align the query to the frozen modules, we propose a trainable scheme for our pipeline. A small language model is adopted as a trainable rewriter to cater to the black-box LLM reader. The rewriter is trained using the feedback of the LLM reader by reinforcement learning. Evaluation is conducted on downstream tasks, open-domain QA and multiple-choice QA. Experiments results show consistent performance improvement, indicating that our framework is proven effective and scalable, and brings a new framework for retrieval-augmented LLM.Comment: EMNLP202

arXiv.org e-Print Archive

CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion

Author: Dong Anlei
Duan Nan
Gong Yeyun
He Xingwei
Jiao Jian
Jin A-Long
Yiu Siu Ming
Zhang Hang
Publication venue
Publication date: 29/10/2023
Field of study

The dual-encoder has become the de facto architecture for dense retrieval. Typically, it computes the latent representations of the query and document independently, thus failing to fully capture the interactions between the query and document. To alleviate this, recent research has focused on obtaining query-informed document representations. During training, it expands the document with a real query, but during inference, it replaces the real query with a generated one. This inconsistency between training and inference causes the dense retrieval model to prioritize query information while disregarding the document when computing the document representation. Consequently, it performs even worse than the vanilla dense retrieval model because its performance heavily relies on the relevance between the generated queries and the real query.In this paper, we propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query. By doing so, the retrieval model learns to extend its attention from the document alone to both the document and query, resulting in high-quality query-informed document representations. Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.Comment: Accetpted to EMNLP 202

arXiv.org e-Print Archive

AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators

Author: Chen Weizhu
Duan Nan
Gong Yeyun
He Xingwei
Jiao Jian
Jin A-Long
Lin Chen
Lin Zhenghao
Yiu Siu Ming
Zhang Hang
Publication venue
Publication date: 29/03/2023
Field of study

Many natural language processing (NLP) tasks rely on labeled data to train machine learning models to achieve high performance. However, data annotation can be a time-consuming and expensive process, especially when the task involves a large amount of data or requires specialized domains. Recently, GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks. In this paper, we first claim that large language models (LLMs), such as GPT-3.5, can serve as an excellent crowdsourced annotator by providing them with sufficient guidance and demonstrated examples. To make LLMs to be better annotators, we propose a two-step approach, 'explain-then-annotate'. To be more precise, we begin by creating prompts for every demonstrated example, which we subsequently utilize to prompt a LLM to provide an explanation for why the specific ground truth answer/label was chosen for that particular example. Following this, we construct the few-shot chain-of-thought prompt with the self-generated explanation and employ it to annotate the unlabeled data. We conduct experiments on three tasks, including user input and keyword relevance assessment, BoolQ and WiC. The annotation results from GPT-3.5 surpasses those from crowdsourced annotation for user input and keyword relevance assessment. Additionally, for the other two tasks, GPT-3.5 achieves results that are comparable to those obtained through crowdsourced annotation

arXiv.org e-Print Archive

Construção de um protótipo de Data Warehouse como suporte ao sistema de informação numa instituição de ensino superior

Author: Auwerx Johan
Cerione Richard A.
Chen Wei
Choi Brian Hyun
Du Jintang
Hao Quan
He Bin
Jiang Hong
Khan Saba
Kim Jun Huyn
Kim Jungwoo
Lin Hening
Su Xiaoyang
Woo Jimin
Yu Jiu Jiu
Zhang Sheng
Zhou Yeyun
Publication venue: 'Universidade de Evora'
Publication date: 01/01/2011
Field of study

Uma das dificuldades que se verifica na extracção de informação numa organização é a falta de integração dos dados existentes dispersos em diversos formatos: ficheiros de processadores de texto, folhas de cálculo, bases de dados, entre outras fontes. A partir deste problema, este trabalho propõe a estruturação de um modelo de Data Warehouse com o objectivo de organizar, armazenar e integrar as informações provenientes de outros formatos e sistemas, numa única base de dados para uma futura utilização no suporte à tomada de decisão. Existem, neste momento, na comunidade de Data Warehousing duas principais abordagens, uma preconizada por William H. Inmon, mais centrada nos dados, e outra por Ralph Kimball, mais centrada no projecto. Assim, com a metodologia proposta foi desenvolvido um caso de estudo com a finalidade de verificar e avaliar a aplicabilidade da metodologia no Instituto Politécnico de Tomar; ABSTRACT: One difficulty that exists in the extraction of information in organizations is the lack of integration of existing data scattered in various formats: word processing files, spreadsheets, databases, among other sources. From this problem, this paper proposes to structure a model of Data Warehouse in order to organize, store and integrate information from other systems and formats in a single database for future use in supporting decision making. There are at present in the community of Data Warehousing two main approaches, one advocated by William H. Inmon, more data-centric, and one by Ralph Kimball, more focused on the project. So with the proposed methodology was developed a case study in order to verify and evaluate the applicability of the methodology at the Polytechnic Institute of Tomar

Infoscience - École polytechnique fédérale de Lausanne

PubMed Central

Repositório Científico da Universidade de Évora

HKU Scholars Hub

PROM: A Phrase-level Copying Mechanism with Pre-training for Abstractive Summarization

Author: Duan Nan
Gong Yeyun
He Pengcheng
Ma Xinbei
Zhao Hai
Publication venue
Publication date: 11/05/2023
Field of study

Based on the remarkable achievements of pre-trained language models in abstractive summarization, the copying mechanism has proved helpful by improving the factuality, stability, and overall performance. This work proposes PROM, a new PhRase-level cOpying Mechanism that enhances attention on n-grams, which can be applied to zero-shot summarization with pre-training. PROM adds an indicator layer to explicitly pick up tokens in n-gram that can be copied from the source, and calculates an auxiliary loss for the copying prediction. Empirical studies show that PROM makes significant improvements in fine-tuning on benchmarks. In zero-shot setting, PROM is utilized in the self-supervised pre-training on raw corpora and provides new general baselines on a wide range of summarization datasets. Further analysis shows that PROM performs more reasonable copying and contributes to faithfulness

arXiv.org e-Print Archive

A Linkage Framework for the China National Emission Trading System (CETS): Insight from Key Global Carbon Markets

Author: Anil Kumar Shrestha
Chunguang Sheng
Chunyu Pan
Guangyu Wang
Jinliang Li
John L. Innes
John-O. Niles
Kevin Xinwei Wang
Nuyun Li
Yeyun He
Publication venue: Multidisciplinary Digital Publishing Institute
Publication date: 01/07/2021
Field of study

Given that international collaborative efforts to reduce greenhouse gas (GHG) emissions are urgent and crucial, a critical understanding of challenges and opportunities of linking China’s newly established national ETS with existing domestic or regional ETSs is essential in order to achieve global emission targets, and may attract other jurisdictions to join in global carbon market development. In this backdrop, we analyzed the experiences, lessons, and insights from three key global carbon markets, namely North America, the EU and China, in terms of the barriers to linking the global carbon market, with a focus on China, using thematic analysis. The four most commonly cited linkage design elements (barriers) were the legal basis; monitoring, reporting, and verification; political feasibility; and the price-management mechanism. Like-minded jurisdictions with similar political views and design features will have a higher chance of linking. Additionally, sustaining market liquidity, widening sectoral coverage, minimizing carbon leakage, ensuring offset quality, and a transparent allowance and cap setting rules are crucial steps towards linkage. These outcomes can be used as an ETS linkage-ready design framework for CETS and ETS under development to overcome barriers to future international ETS linkages.Forestry, Faculty ofNon UBCReviewedFacult

Directory of Open Access Journals

University of British Columbia: cIRcle - UBC's Information Repository

Bamboo as a Nature-Based Solution (NbS) for Climate Change Mitigation: Biomass, Products, and Carbon Credits

Author: Chen Jialu
He Yeyun
Kozak Robert A.
Li Jinliang
Li Nuyun
Pan Chunyu
Sheng Chunguang
Shrestha Anil K.
Wang Guangyu
Zhou Guomo
Publication venue: Multidisciplinary Digital Publishing Institute
Publication date: 01/08/2023
Field of study

Bamboo, a rapidly growing woody grass prevalent in pan-tropical zones, holds promising potential as a nature-based solution (NbS) for climate change mitigation. In this systematic review of 91 research articles, we critically assess the scope and constraints of bamboo’s role in mitigating climate change across three dimensions: as a carbon sink in biomass form, as carbon storage in bamboo products, and as a contributor to carbon project credits. Our analysis reveals that existing studies disproportionately focus on 36 limited species, such as Phyllostachys pubescens and Bambusa vulgaris, with geographic concentration in Asia (91%) and limited studies from Africa (7%) and South America (1%). While many studies emphasize the carbon-saving benefits of bamboo products compared with traditional goods, there is a noticeable gap in comprehensive evaluations of carbon pools from individual bamboo forests encompassing all product varieties. While bamboo forests offer significant carbon trading potential, their global role is restricted by the absence of internationally accepted methodologies and the presence of debates about classifying bamboo as a tree species. This extensive review highlights the multifaceted value of bamboo in climate change mitigation, thereby highlighting its significance as a critical component for informed policymaking and the development of sustainable practices in future climate strategies worldwide.Forestry, Faculty ofNon UBCReviewedFacultyResearche

Directory of Open Access Journals

University of British Columbia: cIRcle - UBC's Information Repository

Genome-Wide Analysis of the Biosynthesis and Deactivation of Gibberellin-Dioxygenases Gene Family in Camellia sinensis (L.) O. Kuntze

Author: Changjun Jiang
Cheng Pan
Jiayue Jiang
Kunhong Tian
Leigang Wang
Qilu Sun
Qiuyan Ban
Yan He
Yeyun Li
Yuanfei Yang
Yuting Pan
Publication venue: 'MDPI AG'
Publication date: 01/09/2017
Field of study

Gibberellins (GAs), a class of diterpenoid phytohormones, play a key role in regulating diverse processes throughout the life cycle of plants. Bioactive GA levels are rapidly regulated by Gibberellin-dioxygenases (GAox), which are involved in the biosynthesis and deactivation of gibberellin. In this manuscript, a comprehensive genome-wide analysis was carried out to find all GAox in Camellia sinensis. For the first time in a tea plant, 14 CsGAox genes, containing two domains, DIOX_N (PF14226) and 2OG-FeII_Oxy, were identified (PF03171). These genes all belong to 2-oxoglutarate-dependent dioxygenases (2-ODD), including four CsGA20ox (EC: 1.14.11.12), three CsGA3ox (EC: 1.14.11.15), and seven CsGA2ox (EC: 1.14.11.13). According to the phylogenetic classification as in Arabidopsis, the CsGAox genes spanned five subgroups. Each CsGAox shows tissue-specific expression patterns, although these vary greatly. Some candidate genes, which may play an important role in response to external abiotic stresses, have been identified with regards to patterns, such as CsGA20ox2, CsGA3ox2, CsGA3ox3, CsGA2ox1, CsGA2ox2, and CsGA2ox4. The bioactive GA levels may be closely related to the GA20ox, GA3ox and GA2ox genes. In addition, the candidate genes could be used as marker genes for abiotic stress resistance breeding in tea plants

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

Uptake, Translocation, Metabolism, and Distribution of Glyphosate in Nontarget Tea Plant (<i>Camellia sinensis</i> L.)

Author: Jie Zhou (28945)
Lili He (502890)
Mengmeng Tong (4378708)
Ruyan Hou (2631115)
Wanjun Gao (4378702)
Weiting Jiao (4378699)
Yeyun Li (4378705)
Publication venue
Publication date
Field of study

The uptake, translocation, metabolism, and distribution behavior of glyphosate in nontarget tea plant were investigated. The negative effects appeared to grown tea saplings when the nutrient solution contained glyphosate above 200 mg L<sup>–1</sup>. Glyphosate was highest in the roots of the tea plant, where it was also metabolized to aminomethyl phosphonic acid (AMPA). The glyphosate and AMPA in the roots were transported through the xylem or phloem to the stems and leaves. The amount of AMPA in the entire tea plant was less than 6.0% of the amount of glyphosate. The glyphosate level in fresh tea shoots was less than that in mature leaves at each day. These results indicated that free glyphosate in the soil can be continuously absorbed by, metabolized in, and transported from the roots of the tea tree into edible leaves, and therefore, free glyphosate residues in the soil should be controlled to produce teas free of glyphosate

FigShare

Genome-Wide Analysis of the Biosynthesis and Deactivation of Gibberellin-Dioxygenases Gene Family in Camellia sinensis (L.) O. Kuntze

Author: Changjun Jiang
Cheng Pan
Coles
De
Gohain
Hall
Jiayue Jiang
Kunhong Tian
Leigang Wang
Luo
Olszewski
Qilu Sun
Qiuyan Ban
Sponsel
Vera-Sirera
Wilkins
Yamaguchi
Yan He
Yeyun Li
Yuanfei Yang
Yue
Yuting Pan
Publication venue: 'MDPI AG'
Publication date
Field of study

Crossref