10 research outputs found

    MoDS: Model-oriented Data Selection for Instruction Tuning

    Full text link
    Instruction tuning has become the de facto method to equip large language models (LLMs) with the ability of following user instructions. Usually, hundreds of thousands or millions of instruction-following pairs are employed to fine-tune the foundation LLMs. Recently, some studies show that a small number of high-quality instruction data is enough. However, how to select appropriate instruction data for a given LLM is still an open problem. To address this problem, in this paper we present a model-oriented data selection (MoDS) approach, which selects instruction data based on a new criteria considering three aspects: quality, coverage and necessity. First, our approach utilizes a quality evaluation model to filter out the high-quality subset from the original instruction dataset, and then designs an algorithm to further select from the high-quality subset a seed instruction dataset with good coverage. The seed dataset is applied to fine-tune the foundation LLM to obtain an initial instruction-following LLM. Finally, we develop a necessity evaluation model to find out the instruction data which are performed badly in the initial instruction-following LLM and consider them necessary instructions to further improve the LLMs. In this way, we can get a small high-quality, broad-coverage and high-necessity subset from the original instruction datasets. Experimental results show that, the model fine-tuned with 4,000 instruction pairs selected by our approach could perform better than the model fine-tuned with the full original dataset which includes 214k instruction data

    ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model

    Full text link
    During the development of large language models (LLMs), the scale and quality of the pre-training data play a crucial role in shaping LLMs' capabilities. To accelerate the research of LLMs, several large-scale datasets, such as C4 [1], Pile [2], RefinedWeb [3] and WanJuan [4], have been released to the public. However, most of the released corpus focus mainly on English, and there is still lack of complete tool-chain for extracting clean texts from web data. Furthermore, fine-grained information of the corpus, e.g. the quality of each text, is missing. To address these challenges, we propose in this paper a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data. First, similar to previous work, manually crafted rules are employed to discard explicit noisy texts from the raw crawled web contents. Second, a well-designed evaluation model is leveraged to assess the remaining relatively clean data, and each text is assigned a specific quality score. Finally, we can easily utilize an appropriate threshold to select the high-quality pre-training data for Chinese. Using our proposed approach, we release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score, facilitating the LLM researchers to choose the data according to the desired quality thresholds. We also release a much cleaner subset of 600 GB Chinese data with the quality exceeding 90%

    Sememe knowledge and auxiliary information enhanced approach for sarcasm detection

    No full text
    Sarcasm expression is a pervasive literary technique in which people intentionally express the opposite of what is implied. Accurate detection of sarcasm in a text can facilitate the understanding of speakersā€™ true intentions and promote other natural language processing tasks, especially sentiment analysis tasks. Since sarcasm is a kind of implicit sentiment expression and speakers deliberately confuse the audience, it is challenging to detect sarcasm only by text. Existing approaches based on machine learning and deep learning achieved unsatisfactory performance when handling sarcasm text with complex expression or needing specific background knowledge to understand. Especially, due to the characteristics of the Chinese language itself, sarcasm detection in Chinese is more difficult. To alleviate this dilemma on Chinese sarcasm detection, we propose a sememe and auxiliary enhanced attention neural model, SAAG. At the word level, we introduce sememe knowledge to enhance the representation learning of Chinese words. Sememe is the minimum unit of meaning, which is a fine-grained portrayal of a word. At the sentence level, we leverage some auxiliary information, such as the news title, to learning the representation of the context and background of sarcasm expression. Then, we construct the representation of text expression progressively and dynamically. The evaluation on a sarcasm dateset, consisting of comments on news text, reveals that our proposed approach is effective and outperforms the state-of-the-art models

    Multi-perspective contrastive learning framework guided by sememe knowledge and label information for sarcasm detection

    No full text
    Sarcasm is a prevailing rhetorical device that intentionally uses words that literally meaning opposite the real meaning. Due to this deliberate ambiguity, accurately detecting sarcasm can encourage the comprehension of usersā€™ real intentions. Therefore, sarcasm detection is a critical and challenging task for sentiment analysis. In previous research, neural network-based models are generally unsatisfactory when dealing with complex sarcastic expressions. To ameliorate this situation, we propose a multi-perspective contrastive learning framework for sarcasm detection, called SLGC, which is guided by sememe knowledge and label information based on the pre-trained neural model. For the in-instance perspective, we leverage the sememe, the minimum meaning unit, to guide the contrastive learning to produce high-quality sentence representations. For the between-instance perspective, we utilize label information to guide contrastive learning to mine potential interaction relationships between sarcastic expressions. Experiments on two public benchmark sarcasm detection dataset demonstrate that our approach significantly outperforms the current state-of-the-art model.</p

    Genome-Wide Dissection of Quan 9311A Breeding Process and Application Advantages

    No full text
    Germplasm resource innovation is a crucial factor for cultivar development, particularly within the context of hybrid rice breeding based on the three-line system. Quan 9311A, a cytoplasmic male sterile (CMS) line, has been successfully cultivated using rice restoration materials and extensively employed as a female parent in hybrid breeding program in China. This line was developed by crossing the CMS line Zhong 9A with a two-line restorer line 93-11, with the intention of eliminating the restoring ability of 93-11 while retaining the sterility gene WA352c from Zhong 9A. Quan 9311A effectively amalgamates the most favorable agronomic traits from both parental lines. In this study, the relationship between phenotypic characteristics and the known functional genes of Quan 9311A were analyzed using the rice genome navigation technology based on whole-genome sequencing. The findings revealed that Quan 9311A harbors multiple superior alleles from both 93-11 and Zhong 9A, providing exceptional agronomic traits that are unavailable in earlier CMS lines. Despite the removal of the fertility restorer gene Rf3 from 93-11, numerous chromosomal segments from 93-11 persist in the Quan 9311A genome. Furthermore, the hybrid rice Quanyousimiao (QYSM) and the restorer line Wushansimiao (WSSM) were used as examples to illustrate the important role of Quan 9311A as the female parent in heterosis. It was found that QYSM carries a great number of superior alleles, which accounts for its high grain yield and wide adaptability. These insights not only advanced the utilization of hybrid rice pairing groups but also provided guidance for future breeding endeavors. The study introduced innovative concepts to further integrate genomics with traditional breeding techniques. Ultimately, Quan 9311A signified a significant milestone in rice breeding technology, opening up novel avenues for hybrid rice development

    Exploring the impact of prenatal perfluoroalkyl and polyfluoroalkyl substances exposure on blood pressure in early childhood: A longitudinal analysis

    No full text
    Previous research investigating the correlation between prenatal exposure to per- and polyfluoroalkyl substances (PFAS) and subsequent blood pressure (BP) in offspring has yielded limited and contradictory findings. This study was conducted to investigate the potential relationship between maternal PFAS levels during pregnancy and subsequent BP in early childhood. A total of 129 expectant mothers from the Shanghai Birth Cohort were included in the study. Using high-performance liquid chromatography/tandem mass spectrometry, we measured ten PFAS compounds in maternal plasma throughout the pregnancy. When the children reached the age of 4, we examined their systolic BP (SBP) and diastolic BP (DBP), along with mean arterial pressure (MAP) and pulse pressure (PP). Data interpretation employed multiple linear and logistic regression models, complemented by Bayesian kernel machine regression (BKMR).We found that the majority of PFAS concentrations remained stable during pregnancy. The linear and BKMR models indicated a positive relationship between the PFAS mixture in maternal plasma and offspring's DBP and MAP, with perfluorohexanesulphonic acid (PFHxS) having the most significant influence (PFHxS and DBP [first trimester:Ī²=3.03, 95%CI: (1.01,5.05); second trimester: Ī²=2.35, 95%CI: (0.94,3.75); third trimester: Ī²=2.57, 95%CI:(0.80,4.34)]; MAP [first trimester:Ī²=2.55, 95%CI: (0.64,4.45); second trimester: Ī²=2.28, 95%CI: (0.95,3.61); third trimester: Ī²=2.35, 95%CI:(0.68,4.01)]). Logistic regression highlighted an increased risk of prehypertension and hypertension in offspring with higher maternal PFHxS concentrations during all three trimesters [first trimester: OR=2.53, 95%CI:(1.11,5.79), second trimester: OR=2.05, 95%CI:(1.11,3.78), third trimester: OR=3.08, 95%CI:(1.40,6.79)]. A positive correlation was identified between the half-lives of PFAS and the odds ratio (OR) of prehypertension and hypertension in childhood (Ī²=0.139, P=0.010). In conclusion, this research found maternal plasma PFAS concentrations to be positively associated with BP in offspring, with PFHxS showing the most significant influence. This correlation remained consistent throughout pregnancy, and this effect was proportional to the half-lives of PFAS
    corecore