Search CORE

152 research outputs found

Domain-Agnostic Molecular Generation with Self-feedback

Author: Chen Huajun
Chen Zhuo
Fan Xiaohui
Fang Yin
Zhang Ningyu
Publication venue
Publication date: 01/09/2023
Field of study

The generation of molecules with desired properties has gained tremendous popularity, revolutionizing the way scientists design molecular structures and providing valuable support for chemical and drug design. However, despite the potential of language models in molecule generation, they face numerous challenges such as the generation of syntactically or chemically flawed molecules, narrow domain focus, and limitations in creating diverse and directionally feasible molecules due to a dearth of annotated data or external molecular databases. To this end, we introduce MolGen, a pre-trained molecular language model tailored specifically for molecule generation. MolGen acquires intrinsic structural and grammatical insights by reconstructing over 100 million molecular SELFIES, while facilitating knowledge transfer between different domains through domain-agnostic molecular prefix tuning. Moreover, we present a self-feedback paradigm that inspires the pre-trained model to align with the ultimate goal of producing molecules with desirable properties. Extensive experiments on well-known benchmarks confirm MolGen's optimization capabilities, encompassing penalized logP, QED, and molecular docking properties. Further analysis shows that MolGen can accurately capture molecule distributions, implicitly learn their structural characteristics, and efficiently explore chemical space. The pre-trained model, codes, and datasets are publicly available for future research at https://github.com/zjunlp/MolGen.Comment: Work in progress. Add results of binding affinit

arXiv.org e-Print Archive

Graph Sampling-based Meta-Learning for Molecular Property Prediction

Author: Chen Huajun
Ding Keyan
Fang Yin
Wu Bin
Zhang Qiang
Zhuang Xiang
Publication venue
Publication date: 29/06/2023
Field of study

Molecular property is usually observed with a limited number of samples, and researchers have considered property prediction as a few-shot problem. One important fact that has been ignored by prior works is that each molecule can be recorded with several different properties simultaneously. To effectively utilize many-to-many correlations of molecules and properties, we propose a Graph Sampling-based Meta-learning (GS-Meta) framework for few-shot molecular property prediction. First, we construct a Molecule-Property relation Graph (MPG): molecule and properties are nodes, while property labels decide edges. Then, to utilize the topological information of MPG, we reformulate an episode in meta-learning as a subgraph of the MPG, containing a target property node, molecule nodes, and auxiliary property nodes. Third, as episodes in the form of subgraphs are no longer independent of each other, we propose to schedule the subgraph sampling process with a contrastive loss function, which considers the consistency and discrimination of subgraphs. Extensive experiments on 5 commonly-used benchmarks show GS-Meta consistently outperforms state-of-the-art methods by 5.71%-6.93% in ROC-AUC and verify the effectiveness of each proposed module. Our code is available at https://github.com/HICAI-ZJU/GS-Meta.Comment: Accepted by IJCAI 202

arXiv.org e-Print Archive

Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering

Author: Chen Huajun
Chen Zhuo
Cheng Lei
Fang Yin
Li Fangming
Lu Yanxi
Zhang Wen
Zhang Yichi
Publication venue
Publication date: 11/11/2023
Field of study

Recently, the development of large language models (LLMs) has attracted wide attention in academia and industry. Deploying LLMs to real scenarios is one of the key directions in the current Internet industry. In this paper, we present a novel pipeline to apply LLMs for domain-specific question answering (QA) that incorporates domain knowledge graphs (KGs), addressing an important direction of LLM application. As a real-world application, the content generated by LLMs should be user-friendly to serve the customers. Additionally, the model needs to utilize domain knowledge properly to generate reliable answers. These two issues are the two major difficulties in the LLM application as vanilla fine-tuning can not adequately address them. We think both requirements can be unified as the model preference problem that needs to align with humans to achieve practical application. Thus, we introduce Knowledgeable Preference AlignmenT (KnowPAT), which constructs two kinds of preference set called style preference set and knowledge preference set respectively to tackle the two issues. Besides, we design a new alignment objective to align the LLM preference with human preference, aiming to train a better LLM for real-scenario domain-specific QA to generate reliable and user-friendly answers. Adequate experiments and comprehensive with 15 baseline methods demonstrate that our KnowPAT is an outperforming pipeline for real-scenario domain-specific QA with LLMs. Our code is open-source at https://github.com/zjukg/KnowPAT.Comment: Work in progress. Code is available at https://github.com/zjukg/KnowPA

arXiv.org e-Print Archive

Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models

Author: Chen Huajun
Chen Zhuo
Fan Xiaohui
Fang Yin
Huang Rui
Liang Xiaozhuan
Liu Kangwei
Zhang Ningyu
Publication venue
Publication date: 29/08/2023
Field of study

Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a meticulously curated, comprehensive instruction dataset expressly designed for the biomolecular realm. Mol-Instructions is composed of three pivotal components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions, each curated to enhance the understanding and prediction capabilities of LLMs concerning biomolecular features and behaviors. Through extensive instruction tuning experiments on the representative LLM, we underscore the potency of Mol-Instructions to enhance the adaptability and cognitive acuity of large models within the complex sphere of biomolecular studies, thereby promoting advancements in the biomolecular research community. Mol-Instructions is made publicly accessible for future research endeavors and will be subjected to continual updates for enhanced applicability.Comment: Project homepage: https://github.com/zjunlp/Mol-Instructions. Add quantitative evaluation

arXiv.org e-Print Archive

DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning

Author: Chen Huajun
Chen Jiaoyan
Chen Zhuo
Fang Yin
Geng Yuxia
Huang Yufeng
Pan Jeff Z.
Zhang Wen
Publication venue
Publication date: 16/02/2023
Field of study

Zero-shot learning (ZSL) aims to predict unseen classes whose samples have never appeared during training. One of the most effective and widely used semantic information for zero-shot image classification are attributes which are annotations for class-level visual characteristics. However, the current methods often fail to discriminate those subtle visual distinctions between images due to not only the shortage of fine-grained annotations, but also the attribute imbalance and co-occurrence. In this paper, we present a transformer-based end-to-end ZSL method named DUET, which integrates latent semantic knowledge from the pre-trained language models (PLMs) via a self-supervised multi-modal learning paradigm. Specifically, we (1) developed a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images; (2) applied an attribute-level contrastive learning strategy to further enhance the model's discrimination on fine-grained visual characteristics against the attribute co-occurrence and imbalance; (3) proposed a multi-task learning policy for considering multi-model objectives. We find that our DUET can achieve state-of-the-art performance on three standard ZSL benchmarks and a knowledge graph equipped ZSL benchmark. Its components are effective and its predictions are interpretable.Comment: AAAI 2023 (Oral). Repository: https://github.com/zjukg/DUE

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Massively parallel pyrosequencing-based transcriptome analyses of small brown planthopper (Laodelphax striatellus), a vector insect transmitting rice stripe virus (RSV)

Author: Chen Xiaoying
Fang Rongxiang
Guo Hongyan
Qian Wei
Wang Shengyue
Zhang Fujie
Zheng Huajun
Zhou Tong
Zhou Yijun
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background The small brown planthopper (<it>Laodelphax striatellus</it>) is an important agricultural pest that not only damages rice plants by sap-sucking, but also acts as a vector that transmits rice stripe virus (RSV), which can cause even more serious yield loss. Despite being a model organism for studying entomology, population biology, plant protection, molecular interactions among plants, viruses and insects, only a few genomic sequences are available for this species. To investigate its transcriptome and determine the differences between viruliferous and naïve <it>L. striatellus</it>, we employed 454-FLX high-throughput pyrosequencing to generate EST databases of this insect. Results We obtained 201,281 and 218,681 high-quality reads from viruliferous and naïve <it>L. striatellus</it>, respectively, with an average read length as 230 bp. These reads were assembled into contigs and two EST databases were generated. When all reads were combined, 16,885 contigs and 24,607 singletons (a total of 41,492 unigenes) were obtained, which represents a transcriptome of the insect. BlastX search against the NCBI-NR database revealed that only 6,873 (16.6%) of these unigenes have significant matches. Comparison of the distribution of GO classification among viruliferous, naïve, and combined EST databases indicated that these libraries are broadly representative of the <it>L. striatellus </it>transcriptomes. Functionally diverse transcripts from RSV, endosymbiotic bacteria <it>Wolbachia </it>and yeast-like symbiotes were identified, which reflects the possible lifestyles of these microbial symbionts that live in the cells of the host insect. Comparative genomic analysis revealed that <it>L. striatellus </it>encodes similar innate immunity regulatory systems as other insects, such as RNA interference, JAK/STAT and partial Imd cascades, which might be involved in defense against viral infection. In addition, we determined the differences in gene expression between vector and naïve samples, which generated a list of candidate genes that are potentially involved in the symbiosis of <it>L. striatellus </it>and RSV. Conclusions To our knowledge, the present study is the first description of a genomic project for <it>L. striatellus</it>. The identification of transcripts from RSV, <it>Wolbachia</it>, yeast-like symbiotes and genes abundantly expressed in viruliferous insect, provided a starting-point for investigating the molecular basis of symbiosis among these organisms.</p

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Rethinking Uncertainly Missing and Ambiguous Visual Modality in Multi-Modal Entity Alignment

Author: Chen Huajun
Chen Jiaoyan
Chen Zhuo
Fang Yin
Guo Lingbing
Li Yangning
Pan Jeff Z.
Zhang Wen
Zhang Yichi
Publication venue: Springer Nature Switzerland AG
Publication date: 27/10/2023
Field of study

Edinburgh Research Explorer

Mechanical and Electrical Properties of a CFETR CSMC Conductor under Transverse Mechanical Loadings

Author: Hao Qiangwang
Jin Huan
Liu Fang
Liu Huajun
Nijhuis A.
Qin Jinggang
Shi Yi
Wu Yu
Yagotintsev K.
Zhou Chao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2018
Field of study

The central solenoid model coil (CSMC) project of the China Fusion Engineering Test Reactor was launched in 2014 to verify the technological feasibility of a large-scale superconducting magnet at the Institute of Plasma and Physics Chinese Academy of Sciences. The short twist pitch design recommended by CEA is chosen for the CSMC Nb3Sn cable-in-conduit conductors. In order to better understand the evolution of transport properties and coupling losses related to the effect of electromagnetic load cycles, the mechanical and electrical properties were measured and investigated employing a special cryogenic press facility for the transverse mechanical loadings. The results show that the transverse compression (dy) versus applied load force (Fy ) is different from first to subsequent loading cycles. This mechanical behavior can be interpreted by the combination of strands bending between the crossovers and strands deformation at the crossovers. The fitting relations of dy versus Fy are also presented. The evolution of interstrand contact resistance (Rc) in the cabling stages with cyclic history and pressure effects are discussed. In addition, a fitting relation of Rc versus Fy is presented based on a combination of strand's microsliding and copper matrix resistivity. A clear correlation between intrapetal resistance Rc and coupling loss is also found

University of Twente Research Information

Functional soil organic matter fractions in response to long-term fertilization in upland and paddy systems in South China

Author: Blagodatskaya Evgenia
Fang Huajun
Kuzyakov Yakov
Li Zhongfang
Liu Kailou
Lou Yilai
Meersmans Jeroen
Tian Jing
Yang Fan
Yang Hao
Zhou Yi
Publication venue: 'Elsevier BV'
Publication date: 01/03/2018
Field of study

Soil organic matter (SOM) and its fractions play key roles in optimizing crop yield and improving soil quality. However, how functional SOM fractions responded to long-term fertilization and their relative importance for C sequestration were less addressed. In this study, we determined the effects of long-term fertilization on six functional SOM fractions (unprotected, physically protected, physico-biochemically protected, physico-chemically protected, chemically protected, and biochemically protected) based on two long-term fertilization experiments carried out in South China. The unprotected coarse particulate organic matter (cPOM), the biochemically and chemically protected silt-sized fractions (NH-dSilt and H-dSilt) were the primary C storage fractions under long-term fertilization, accounting for 23.6–46.2%, 15.7–19.4%, and 14.4–17.4% of the total soil organic carbon (SOC) content in upland soil and 19.5–29.3%, 9.9–15.5%, and 14.2–17.2% of the total SOC content in paddy soil, respectively. Compared with the control treatment (CK) in upland soil, the application of manure combined with mineral NPK (NPKM) resulted in an increase in the SOC content in the cPOM, pure physically protected fraction (iPOM), the physico-chemically protected (H-μSilt), and the chemically protected (H-dSilt) fraction by 233%, 166%, 124%, and 58%, respectively. Besides, the SOC increase in upland soil expressed as SOC content per unit of total SOC for iPOM, H-μSilt, cPOM and H-dSilt were the highest and as large as 283%, 248%, 194%, and 105% respectively. In paddy soil, the highest increase per unit of total SOC was H-dSilt (190%), followed by H-dClay (156%) and H-μSilt (155%). These results suggested that the upland soil could stabilize more C through the pure physical, whereas the chemical protection mechanism played a more important role in paddy soil. Chemical protection mechanism within the microaggregates played important roles in sequestrating C in both upland and paddy soils. Overall, the different responses of functional SOM fractions to long-term fertilization indicate different mechanisms for SOM cycling in terms of C sequestration under upland and paddy systems

Crossref

Cranfield CERES