Search CORE

2,451 research outputs found

Enable Language Models to Implicitly Learn Self-Improvement From Data

Author: Hou Le
Ji Heng
Li Yunxuan
Lu Tianjian
Wang Ziqi
Wu Yuexin
Yu Hongkun
Publication venue
Publication date: 05/10/2023
Field of study

Large Language Models (LLMs) have demonstrated remarkable capabilities in open-ended text generation tasks. However, the inherent open-ended nature of these tasks implies that there is always room for improvement in the quality of model responses. To address this challenge, various approaches have been proposed to enhance the performance of LLMs. There has been a growing focus on enabling LLMs to self-improve their response quality, thereby reducing the reliance on extensive human annotation efforts for collecting diverse and high-quality training data. Recently, prompting-based methods have been widely explored among self-improvement methods owing to their effectiveness, efficiency, and convenience. However, those methods usually require explicitly and thoroughly written rubrics as inputs to LLMs. It is expensive and challenging to manually derive and provide all necessary rubrics with a real-world complex goal for improvement (e.g., being more helpful and less harmful). To this end, we propose an ImPlicit Self-ImprovemenT (PIT) framework that implicitly learns the improvement goal from human preference data. PIT only requires preference data that are used to train reward models without extra human efforts. Specifically, we reformulate the training objective of reinforcement learning from human feedback (RLHF) -- instead of maximizing response quality for a given input, we maximize the quality gap of the response conditioned on a reference response. In this way, PIT is implicitly trained with the improvement goal of better aligning with human preferences. Experiments on two real-world datasets and one synthetic dataset show that our method significantly outperforms prompting-based methods.Comment: 28 pages, 5 figures, 4 table

arXiv.org e-Print Archive

Augmentation with Projection: Towards an Effective and Efficient Data Augmentation Paradigm for Distillation

Author: Hou Le
Ji Heng
Li Jing
Liu Daogao
Liu Frederick
Wang Ziqi
Wu Yuexin
Yu Hongkun
Publication venue
Publication date: 10/03/2023
Field of study

Knowledge distillation is one of the primary methods of transferring knowledge from large to small models. However, it requires massive task-specific data, which may not be plausible in many real-world applications. Data augmentation methods such as representation interpolation, token replacement, or augmentation with models are applied to tackle this problem. However, these data augmentation methods either potentially cause shifts in decision boundaries (representation interpolation), are not expressive enough (token replacement), or introduce too much computational overhead (augmentation with models). To this end, we propose AugPro (Augmentation with Projection), an effective and efficient data augmentation method for distillation. Our method builds on top of representation interpolation augmentation methods to maintain the diversity of expressions and converts the augmented data to tokens to avoid shifting decision boundaries. It uses simple operations that come with little computational overhead. The results on multiple GLUE tasks show that our methods can improve distillation performance by a large margin at a low time cost. Codes are available at https://github.com/google-research/google-research/tree/master/augpro.Comment: 20 pages, 5 figures. Accepted by ICLR 202

arXiv.org e-Print Archive

Speeding up tandem mass spectrometry-based database searching by longest common prefix

Author: Chi Hao
Fu Yan
He Si-Min
Li You
Sun Rui-Xiang
Wang Le-Heng
Wu Yan-Jie
Zhou Chen
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Development of Simple Sequence Repeats (SSR) Markers in Setaria italica (Poaceae) and Cross-Amplification in Related Species

Author: Chang-Sheng Kuoh
Chih-Yun Chiang
Cipriani
Doust
Ellegren
Fukunaga
Graham
Gupta
Hakki
Heng-Sheng Lin
Hirano
Jarne
Jia
Jia
Kalinowski
Kantety
Knapik
Le Thierry D’ennequin
Lunt
Metais
Roder
Rousset
Rozen
ScHontz
Song-Bin Chang
Wang
Zane
Publication venue: Molecular Diversity Preservation International (MDPI)
Publication date: 01/11/2011
Field of study

Foxtail millet is one of the world’s oldest cultivated crops. It has been adopted as a model organism for providing a deeper understanding of plant biology. In this study, 45 simple sequence repeats (SSR) markers of Setaria italica were developed. These markers showing polymorphism were screened in 223 samples from 12 foxtail millet populations around Taiwan. The most common dinucleotide and trinucleotide repeat motifs are AC/TG (84.21%) and CAT (46.15%). The average number of alleles (Na), the average heterozygosities observed (Ho) and expected (He) are 3.73, 0.714, 0.587, respectively. In addition, 24 SSR markers had shown transferability to six related Poaceae species. These new markers provide tools for examining genetic relatedness among foxtail millet populations and other related species. It is suitable for germplasm management and protection in Poaceae

Multidisciplinary Digital Publishing Institute

Crossref

Directory of Open Access Journals

PubMed Central

Line-Monitoring, Hyperspectral Fluorescence Setup for Simultaneous Multi-Analyte Biosensing

Author: Bro
Castillo
Fang
Glasenapp
Heng Shi
Huang
Hui Ma
Le Liu
Liu
Liu
Martinez
Moczko
Moczko
Murchie
Nagarajan
Nicolini
Roy
Schena
Sinclair
Soelberg
Suihua Ma
Sunan Deng
Vala
Wang
Whitesides
Wu
Yanhong Ji
Yonghong He
Zhiyi Liu
Publication venue: Molecular Diversity Preservation International (MDPI)
Publication date: 01/10/2011
Field of study

Conventional fluorescence scanners utilize multiple filters to distinguish different fluorescent labels, and problems arise because of this filter-based mechanism. In this work we propose a line-monitoring, hyperspectral fluorescence technique which is designed and optimized for applications in multi-channel microfluidic systems. In contrast to the filter-based mechanism, which only records fluorescent intensities, the hyperspectral technique records the full spectrum for every point on the sample plane. Multivariate data exploitation is then applied to spectra analysis to determine ratios of different fluorescent labels and eliminate unwanted artifacts. This sensor is designed to monitor multiple fluidic channels simultaneously, providing the potential for multi-analyte biosensing. The detection sensitivity is approximately 0.81 fluors/μm2, and this sensor is proved to act with a good homogeneity. Finally, a model experiment of detecting short oligonucleotides has demonstrated the biomedical application of this hyperspectral fluorescence biosensor

Crossref

Directory of Open Access Journals

PubMed Central

An Updated Search of Steady TeV $\gamma-$ Ray Point Sources in Northern Hemisphere Using the Tibet Air Shower Array

Using the data taken from Tibet II High Density (HD) Array (1997 February-1999 September) and Tibet-III array (1999 November-2005 November), our previous northern sky survey for TeV

\gamma-

ray point sources has now been updated by a factor of 2.8 improved statistics. From

0.0^{\circ}

60.0^{\circ}

in declination (Dec) range, no new TeV

\gamma-

ray point sources with sufficiently high significance were identified while the well-known Crab Nebula and Mrk421 remain to be the brightest TeV

\gamma-

ray sources within the field of view of the Tibet air shower array. Based on the currently available data and at the 90% confidence level (C.L.), the flux upper limits for different power law index assumption are re-derived, which are approximately improved by 1.7 times as compared with our previous reported limits.Comment: This paper has been accepted by hepn

arXiv.org e-Print Archive

Crossref

Human genomic Z-DNA segments probed by the Zα domain of ADAR1

Author: Ambrose
Bacolla
Black
Champ
Charlesworth
Dickerson
Droge
Floridia
Garner
Ha
Heller
Heng Li
Henikoff
Herbert
Herbert
Jie Xiao
Jinming Li
Khuu
Kim
Kouzine
Le Lu
Liu
Liu
Marschall
Muller
Oh
Peter Dröge
Reynolds
Rich
Rothenburg
Schade
Schroth
Schwartz
Shu Feng
Sinden
Sullivan
Sumer
Wahls
Wang
Wang
Wang
Wang
Wittig
Wolfl
Wong
Zimmerman
Publication venue: Oxford University Press
Publication date
Field of study

Double-stranded DNA is a dynamic molecule that adopts different secondary structures. Experimental evidence indicates Z-DNA plays roles in DNA transactions such as transcription, chromatin remodeling and recombination. Furthermore, our computational analysis revealed that sequences with high Z-DNA forming potential at moderate levels of DNA supercoiling are enriched in human promoter regions. However, the actual distribution of Z-DNA segments in genomes of mammalian cells has been elusive due to the unstable nature of Z-DNA and lack of specific probes. Here we present a first human genome map of most stable Z-DNA segments obtained with A549 tumor cells. We used the Z-DNA binding domain, Zα, of the RNA editing enzyme ADAR1 as probe in conjunction with a novel chromatin affinity precipitation strategy. By applying stringent selection criteria, we identified 186 genomic Z-DNA hotspots. Interestingly, 46 hotspots were located in centromeres of 13 human chromosomes. There was a very strong correlation between these hotspots and high densities of single nucleotide polymorphism. Our study indicates that genetic instability and rapid evolution of human centromeres might, at least in part, be driven by Z-DNA segments. Contrary to in silico predictions, however, we found that only two of the 186 hotspots were located in promoter regions

Crossref

PubMed Central