Search CORE

141 research outputs found

Domain-Agnostic Molecular Generation with Self-feedback

Author: Chen Huajun
Chen Zhuo
Fan Xiaohui
Fang Yin
Zhang Ningyu
Publication venue
Publication date: 01/09/2023
Field of study

The generation of molecules with desired properties has gained tremendous popularity, revolutionizing the way scientists design molecular structures and providing valuable support for chemical and drug design. However, despite the potential of language models in molecule generation, they face numerous challenges such as the generation of syntactically or chemically flawed molecules, narrow domain focus, and limitations in creating diverse and directionally feasible molecules due to a dearth of annotated data or external molecular databases. To this end, we introduce MolGen, a pre-trained molecular language model tailored specifically for molecule generation. MolGen acquires intrinsic structural and grammatical insights by reconstructing over 100 million molecular SELFIES, while facilitating knowledge transfer between different domains through domain-agnostic molecular prefix tuning. Moreover, we present a self-feedback paradigm that inspires the pre-trained model to align with the ultimate goal of producing molecules with desirable properties. Extensive experiments on well-known benchmarks confirm MolGen's optimization capabilities, encompassing penalized logP, QED, and molecular docking properties. Further analysis shows that MolGen can accurately capture molecule distributions, implicitly learn their structural characteristics, and efficiently explore chemical space. The pre-trained model, codes, and datasets are publicly available for future research at https://github.com/zjunlp/MolGen.Comment: Work in progress. Add results of binding affinit

arXiv.org e-Print Archive

Differential Responses and Controls of Soil CO2 and N2O Fluxes to Experimental Warming and Nitrogen Fertilization in a Subalpine Coniferous Spruce (Picea asperata Mast.) Plantation Forest

Author: Hui Dafeng
Li Dandan
Liu Qing
Luo Yiqi
Yin Huajun
Publication venue: Digital Scholarship @ Tennessee State University
Publication date: 17/09/2019
Field of study

Emissions of greenhouse gases (GHG) such as CO2 and N2O from soils are affected by many factors such as climate change, soil carbon content, and soil nutrient conditions. However, the response patterns and controls of soil CO2 and N2O fluxes to global warming and nitrogen (N) fertilization are still not clear in subalpine forests. To address this issue, we conducted an eight-year field experiment with warming and N fertilization treatments in a subalpine coniferous spruce (Picea asperata Mast.) plantation forest in China. Soil CO2 and N2O fluxes were measured using a static chamber method, and soils were sampled to analyze soil carbon and N contents, soil microbial substrate utilization (MSU) patterns, and microbial functional diversity. Results showed that the mean annual CO2 and N2O fluxes were 36.04 ± 3.77 mg C m−2 h−1 and 0.51 ± 0.11 µg N m−2 h−1, respectively. Soil CO2 flux was only affected by warming while soil N2O flux was significantly enhanced by N fertilization and its interaction with warming. Warming enhanced dissolve organic carbon (DOC) and MSU, reduced soil organic carbon (SOC) and microbial biomass carbon (MBC), and constrained the microbial metabolic activity and microbial functional diversity, resulting in a decrease in soil CO2 emission. The analysis of structural equation model indicated that MSU had dominant direct negative effect on soil CO2 flux but had direct positive effect on soil N2O flux. DOC and MBC had indirect positive effects on soil CO2 flux while soil NH4+-N had direct negative effect on soil CO2 and N2O fluxes. This study revealed different response patterns and controlling factors of soil CO2 and N2O fluxes in the subalpine plantation forest, and highlighted the importance of soil microbial contributions to GHG fluxes under climate warming and N deposition

Digital Scholarship @ Tennessee State University

Graph Sampling-based Meta-Learning for Molecular Property Prediction

Author: Chen Huajun
Ding Keyan
Fang Yin
Wu Bin
Zhang Qiang
Zhuang Xiang
Publication venue
Publication date: 29/06/2023
Field of study

Molecular property is usually observed with a limited number of samples, and researchers have considered property prediction as a few-shot problem. One important fact that has been ignored by prior works is that each molecule can be recorded with several different properties simultaneously. To effectively utilize many-to-many correlations of molecules and properties, we propose a Graph Sampling-based Meta-learning (GS-Meta) framework for few-shot molecular property prediction. First, we construct a Molecule-Property relation Graph (MPG): molecule and properties are nodes, while property labels decide edges. Then, to utilize the topological information of MPG, we reformulate an episode in meta-learning as a subgraph of the MPG, containing a target property node, molecule nodes, and auxiliary property nodes. Third, as episodes in the form of subgraphs are no longer independent of each other, we propose to schedule the subgraph sampling process with a contrastive loss function, which considers the consistency and discrimination of subgraphs. Extensive experiments on 5 commonly-used benchmarks show GS-Meta consistently outperforms state-of-the-art methods by 5.71%-6.93% in ROC-AUC and verify the effectiveness of each proposed module. Our code is available at https://github.com/HICAI-ZJU/GS-Meta.Comment: Accepted by IJCAI 202

arXiv.org e-Print Archive

Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering

Author: Chen Huajun
Chen Zhuo
Cheng Lei
Fang Yin
Li Fangming
Lu Yanxi
Zhang Wen
Zhang Yichi
Publication venue
Publication date: 11/11/2023
Field of study

Recently, the development of large language models (LLMs) has attracted wide attention in academia and industry. Deploying LLMs to real scenarios is one of the key directions in the current Internet industry. In this paper, we present a novel pipeline to apply LLMs for domain-specific question answering (QA) that incorporates domain knowledge graphs (KGs), addressing an important direction of LLM application. As a real-world application, the content generated by LLMs should be user-friendly to serve the customers. Additionally, the model needs to utilize domain knowledge properly to generate reliable answers. These two issues are the two major difficulties in the LLM application as vanilla fine-tuning can not adequately address them. We think both requirements can be unified as the model preference problem that needs to align with humans to achieve practical application. Thus, we introduce Knowledgeable Preference AlignmenT (KnowPAT), which constructs two kinds of preference set called style preference set and knowledge preference set respectively to tackle the two issues. Besides, we design a new alignment objective to align the LLM preference with human preference, aiming to train a better LLM for real-scenario domain-specific QA to generate reliable and user-friendly answers. Adequate experiments and comprehensive with 15 baseline methods demonstrate that our KnowPAT is an outperforming pipeline for real-scenario domain-specific QA with LLMs. Our code is open-source at https://github.com/zjukg/KnowPAT.Comment: Work in progress. Code is available at https://github.com/zjukg/KnowPA

arXiv.org e-Print Archive

Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models

Author: Chen Huajun
Chen Zhuo
Fan Xiaohui
Fang Yin
Huang Rui
Liang Xiaozhuan
Liu Kangwei
Zhang Ningyu
Publication venue
Publication date: 29/08/2023
Field of study

Large Language Models (LLMs), with their remarkable task-handling capabilities and innovative outputs, have catalyzed significant advancements across a spectrum of fields. However, their proficiency within specialized domains such as biomolecular studies remains limited. To address this challenge, we introduce Mol-Instructions, a meticulously curated, comprehensive instruction dataset expressly designed for the biomolecular realm. Mol-Instructions is composed of three pivotal components: molecule-oriented instructions, protein-oriented instructions, and biomolecular text instructions, each curated to enhance the understanding and prediction capabilities of LLMs concerning biomolecular features and behaviors. Through extensive instruction tuning experiments on the representative LLM, we underscore the potency of Mol-Instructions to enhance the adaptability and cognitive acuity of large models within the complex sphere of biomolecular studies, thereby promoting advancements in the biomolecular research community. Mol-Instructions is made publicly accessible for future research endeavors and will be subjected to continual updates for enhanced applicability.Comment: Project homepage: https://github.com/zjunlp/Mol-Instructions. Add quantitative evaluation

arXiv.org e-Print Archive

DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning

Author: Chen Huajun
Chen Jiaoyan
Chen Zhuo
Fang Yin
Geng Yuxia
Huang Yufeng
Pan Jeff Z.
Zhang Wen
Publication venue
Publication date: 16/02/2023
Field of study

Zero-shot learning (ZSL) aims to predict unseen classes whose samples have never appeared during training. One of the most effective and widely used semantic information for zero-shot image classification are attributes which are annotations for class-level visual characteristics. However, the current methods often fail to discriminate those subtle visual distinctions between images due to not only the shortage of fine-grained annotations, but also the attribute imbalance and co-occurrence. In this paper, we present a transformer-based end-to-end ZSL method named DUET, which integrates latent semantic knowledge from the pre-trained language models (PLMs) via a self-supervised multi-modal learning paradigm. Specifically, we (1) developed a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images; (2) applied an attribute-level contrastive learning strategy to further enhance the model's discrimination on fine-grained visual characteristics against the attribute co-occurrence and imbalance; (3) proposed a multi-task learning policy for considering multi-model objectives. We find that our DUET can achieve state-of-the-art performance on three standard ZSL benchmarks and a knowledge graph equipped ZSL benchmark. Its components are effective and its predictions are interpretable.Comment: AAAI 2023 (Oral). Repository: https://github.com/zjukg/DUE

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Shrub type dominates the vertical distribution of leaf C : N : P stoichiometry across an extensive altitudinal gradient

Author: Hu Jun
Li Dandan
Li Ting
Liu Qing
Reich Peter B. (R16861)
Yin Chunying
Yin Huajun
Yu Qiannan
Zhao Chunzhang
Zhao Ning
Zhao Wenqiang
Publication venue: 'Copernicus GmbH'
Publication date: 01/01/2018
Field of study

Understanding leaf stoichiometric patterns is crucial for improving predictions of plant responses to environmental changes. Leaf stoichiometry of terrestrial ecosystems has been widely investigated along latitudinal and longitudinal gradients. However, very little is known about the vertical distribution of leaf C :N: P and the relative effects of environmental parameters, especially for shrubs. Here, we analyzed the shrub leaf C, N and P patterns in 125 mountainous sites over an extensive altitudinal gradient (523-4685 m) on the Tibetan Plateau. Results showed that the shrub leaf C and C :N were 7.3-47.5% higher than those of other regional and global flora, whereas the leaf N and N: P were 10.2-75.8% lower. Leaf C increased with rising altitude and decreasing temperature, supporting the physiological acclimation mechanism that high leaf C (e.g., alpine or evergreen shrub) could balance the cell osmotic pressure and resist freezing. The largest leaf N and high leaf P occurred in valley region (altitude 1500 m), likely due to the large nutrient leaching from higher elevations, faster litter decomposition and nutrient resorption ability of deciduous broadleaf shrub. Leaf N: P ratio further indicated increasing N limitation at higher altitudes. Interestingly, drought severity was the only climatic factor positively correlated with leaf N and P, which was more appropriate for evaluating the impact of water status than precipitation. Among the shrub ecosystem and functional types (alpine, subalpine, montane, valley, evergreen, deciduous, broadleaf, and conifer), their leaf element contents and responses to environments were remarkably different. Shrub type was the largest contributor to the total variations in leaf stoichiometry, while climate indirectly affected the leaf C :N: P via its interactive effects on shrub type or soil. Collectively, the large heterogeneity in shrub type was the most important factor explaining the overall leaf C :N: P variations, despite the broad climate gradient on the plateau. Temperature and drought induced shifts in shrub type distribution will influence the nutrient accumulation in mountainous shrubs. © Author(s) 2018

Western Sydney ResearchDirect

Rethinking Uncertainly Missing and Ambiguous Visual Modality in Multi-Modal Entity Alignment

Author: Chen Huajun
Chen Jiaoyan
Chen Zhuo
Fang Yin
Guo Lingbing
Li Yangning
Pan Jeff Z.
Zhang Wen
Zhang Yichi
Publication venue: Springer Nature Switzerland AG
Publication date: 27/10/2023
Field of study

Edinburgh Research Explorer

Towards Semantic e-Science for Traditional Chinese Medicine

Author: Chen Huajun
Cui Meng
Deng Shuiguang
Feng Yi
Jiang Xiaohong
Mao Yuxin
Tang Jinming
Wu Zhaohui
Yin Aining
Zheng Xiaoqing
Zhou Chunying
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Recent advances in Web and information technologies with the increasing decentralization of organizational structures have resulted in massive amounts of information resources and domain-specific services in Traditional Chinese Medicine. The massive volume and diversity of information and services available have made it difficult to achieve seamless and interoperable e-Science for knowledge-intensive disciplines like TCM. Therefore, information integration and service coordination are two major challenges in e-Science for TCM. We still lack sophisticated approaches to integrate scientific data and services for TCM e-Science. Results We present a comprehensive approach to build dynamic and extendable e-Science applications for knowledge-intensive disciplines like TCM based on semantic and knowledge-based techniques. The semantic e-Science infrastructure for TCM supports large-scale database integration and service coordination in a virtual organization. We use domain ontologies to integrate TCM database resources and services in a semantic cyberspace and deliver a semantically superior experience including browsing, searching, querying and knowledge discovering to users. We have developed a collection of semantic-based toolkits to facilitate TCM scientists and researchers in information sharing and collaborative research. Conclusion Semantic and knowledge-based techniques are suitable to knowledge-intensive disciplines like TCM. It's possible to build on-demand e-Science system for TCM based on existing semantic and knowledge-based techniques. The presented approach in the paper integrates heterogeneous distributed TCM databases and services, and provides scientists with semantically superior experience to support collaborative research in TCM discipline.</p

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

MEAformer: Multi-modal Entity Alignment Transformer for Meta Modality Hybrid

Author: Chen Huajun
Chen Jiaoyan
Chen Zhuo
Fang Yin
Geng Yuxia
Guo Lingbing
Huang Yufeng
Pan Jeff Z.
Song Wenting
Zhang Wen
Zhang Yichi
Publication venue
Publication date: 20/04/2023
Field of study

As an important variant of entity alignment (EA), multi-modal entity alignment (MMEA) aims to discover identical entities across different knowledge graphs (KGs) with relevant images attached. We noticed that current MMEA algorithms all globally adopt the KG-level modality fusion strategies for multi-modal entity representation but ignore the variation in modality preferences for individual entities, hurting the robustness to potential noise involved in modalities (e.g., blurry images and relations). In this paper, we present MEAformer, a multi-modal entity alignment transformer approach for meta modality hybrid, which dynamically predicts the mutual correlation coefficients among modalities for entity-level feature aggregation. A modal-aware hard entity replay strategy is further proposed for addressing vague entity details. Experimental results show that our model not only achieves SOTA performance on multiple training scenarios including supervised, unsupervised, iterative, and low resource, but also has a comparable number of parameters, optimistic speed, and good interpretability. Our code and data are available at https://github.com/zjukg/MEAformer.Comment: Repository: https://github.com/zjukg/MEAforme

arXiv.org e-Print Archive