141 research outputs found
Domain-Agnostic Molecular Generation with Self-feedback
The generation of molecules with desired properties has gained tremendous
popularity, revolutionizing the way scientists design molecular structures and
providing valuable support for chemical and drug design. However, despite the
potential of language models in molecule generation, they face numerous
challenges such as the generation of syntactically or chemically flawed
molecules, narrow domain focus, and limitations in creating diverse and
directionally feasible molecules due to a dearth of annotated data or external
molecular databases. To this end, we introduce MolGen, a pre-trained molecular
language model tailored specifically for molecule generation. MolGen acquires
intrinsic structural and grammatical insights by reconstructing over 100
million molecular SELFIES, while facilitating knowledge transfer between
different domains through domain-agnostic molecular prefix tuning. Moreover, we
present a self-feedback paradigm that inspires the pre-trained model to align
with the ultimate goal of producing molecules with desirable properties.
Extensive experiments on well-known benchmarks confirm MolGen's optimization
capabilities, encompassing penalized logP, QED, and molecular docking
properties. Further analysis shows that MolGen can accurately capture molecule
distributions, implicitly learn their structural characteristics, and
efficiently explore chemical space. The pre-trained model, codes, and datasets
are publicly available for future research at https://github.com/zjunlp/MolGen.Comment: Work in progress. Add results of binding affinit
Differential Responses and Controls of Soil CO2 and N2O Fluxes to Experimental Warming and Nitrogen Fertilization in a Subalpine Coniferous Spruce (Picea asperata Mast.) Plantation Forest
Emissions of greenhouse gases (GHG) such as CO2 and N2O from soils are affected by many factors such as climate change, soil carbon content, and soil nutrient conditions. However, the response patterns and controls of soil CO2 and N2O fluxes to global warming and nitrogen (N) fertilization are still not clear in subalpine forests. To address this issue, we conducted an eight-year field experiment with warming and N fertilization treatments in a subalpine coniferous spruce (Picea asperata Mast.) plantation forest in China. Soil CO2 and N2O fluxes were measured using a static chamber method, and soils were sampled to analyze soil carbon and N contents, soil microbial substrate utilization (MSU) patterns, and microbial functional diversity. Results showed that the mean annual CO2 and N2O fluxes were 36.04 ± 3.77 mg C m−2 h−1 and 0.51 ± 0.11 µg N m−2 h−1, respectively. Soil CO2 flux was only affected by warming while soil N2O flux was significantly enhanced by N fertilization and its interaction with warming. Warming enhanced dissolve organic carbon (DOC) and MSU, reduced soil organic carbon (SOC) and microbial biomass carbon (MBC), and constrained the microbial metabolic activity and microbial functional diversity, resulting in a decrease in soil CO2 emission. The analysis of structural equation model indicated that MSU had dominant direct negative effect on soil CO2 flux but had direct positive effect on soil N2O flux. DOC and MBC had indirect positive effects on soil CO2 flux while soil NH4+-N had direct negative effect on soil CO2 and N2O fluxes. This study revealed different response patterns and controlling factors of soil CO2 and N2O fluxes in the subalpine plantation forest, and highlighted the importance of soil microbial contributions to GHG fluxes under climate warming and N deposition
Graph Sampling-based Meta-Learning for Molecular Property Prediction
Molecular property is usually observed with a limited number of samples, and
researchers have considered property prediction as a few-shot problem. One
important fact that has been ignored by prior works is that each molecule can
be recorded with several different properties simultaneously. To effectively
utilize many-to-many correlations of molecules and properties, we propose a
Graph Sampling-based Meta-learning (GS-Meta) framework for few-shot molecular
property prediction. First, we construct a Molecule-Property relation Graph
(MPG): molecule and properties are nodes, while property labels decide edges.
Then, to utilize the topological information of MPG, we reformulate an episode
in meta-learning as a subgraph of the MPG, containing a target property node,
molecule nodes, and auxiliary property nodes. Third, as episodes in the form of
subgraphs are no longer independent of each other, we propose to schedule the
subgraph sampling process with a contrastive loss function, which considers the
consistency and discrimination of subgraphs. Extensive experiments on 5
commonly-used benchmarks show GS-Meta consistently outperforms state-of-the-art
methods by 5.71%-6.93% in ROC-AUC and verify the effectiveness of each proposed
module. Our code is available at https://github.com/HICAI-ZJU/GS-Meta.Comment: Accepted by IJCAI 202
Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering
Recently, the development of large language models (LLMs) has attracted wide
attention in academia and industry. Deploying LLMs to real scenarios is one of
the key directions in the current Internet industry. In this paper, we present
a novel pipeline to apply LLMs for domain-specific question answering (QA) that
incorporates domain knowledge graphs (KGs), addressing an important direction
of LLM application. As a real-world application, the content generated by LLMs
should be user-friendly to serve the customers. Additionally, the model needs
to utilize domain knowledge properly to generate reliable answers. These two
issues are the two major difficulties in the LLM application as vanilla
fine-tuning can not adequately address them. We think both requirements can be
unified as the model preference problem that needs to align with humans to
achieve practical application. Thus, we introduce Knowledgeable Preference
AlignmenT (KnowPAT), which constructs two kinds of preference set called style
preference set and knowledge preference set respectively to tackle the two
issues. Besides, we design a new alignment objective to align the LLM
preference with human preference, aiming to train a better LLM for
real-scenario domain-specific QA to generate reliable and user-friendly
answers. Adequate experiments and comprehensive with 15 baseline methods
demonstrate that our KnowPAT is an outperforming pipeline for real-scenario
domain-specific QA with LLMs. Our code is open-source at
https://github.com/zjukg/KnowPAT.Comment: Work in progress. Code is available at
https://github.com/zjukg/KnowPA
Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models
Large Language Models (LLMs), with their remarkable task-handling
capabilities and innovative outputs, have catalyzed significant advancements
across a spectrum of fields. However, their proficiency within specialized
domains such as biomolecular studies remains limited. To address this
challenge, we introduce Mol-Instructions, a meticulously curated, comprehensive
instruction dataset expressly designed for the biomolecular realm.
Mol-Instructions is composed of three pivotal components: molecule-oriented
instructions, protein-oriented instructions, and biomolecular text
instructions, each curated to enhance the understanding and prediction
capabilities of LLMs concerning biomolecular features and behaviors. Through
extensive instruction tuning experiments on the representative LLM, we
underscore the potency of Mol-Instructions to enhance the adaptability and
cognitive acuity of large models within the complex sphere of biomolecular
studies, thereby promoting advancements in the biomolecular research community.
Mol-Instructions is made publicly accessible for future research endeavors and
will be subjected to continual updates for enhanced applicability.Comment: Project homepage: https://github.com/zjunlp/Mol-Instructions. Add
quantitative evaluation
DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning
Zero-shot learning (ZSL) aims to predict unseen classes whose samples have
never appeared during training. One of the most effective and widely used
semantic information for zero-shot image classification are attributes which
are annotations for class-level visual characteristics. However, the current
methods often fail to discriminate those subtle visual distinctions between
images due to not only the shortage of fine-grained annotations, but also the
attribute imbalance and co-occurrence. In this paper, we present a
transformer-based end-to-end ZSL method named DUET, which integrates latent
semantic knowledge from the pre-trained language models (PLMs) via a
self-supervised multi-modal learning paradigm. Specifically, we (1) developed a
cross-modal semantic grounding network to investigate the model's capability of
disentangling semantic attributes from the images; (2) applied an
attribute-level contrastive learning strategy to further enhance the model's
discrimination on fine-grained visual characteristics against the attribute
co-occurrence and imbalance; (3) proposed a multi-task learning policy for
considering multi-model objectives. We find that our DUET can achieve
state-of-the-art performance on three standard ZSL benchmarks and a knowledge
graph equipped ZSL benchmark. Its components are effective and its predictions
are interpretable.Comment: AAAI 2023 (Oral). Repository: https://github.com/zjukg/DUE
Shrub type dominates the vertical distribution of leaf C : N : P stoichiometry across an extensive altitudinal gradient
Understanding leaf stoichiometric patterns is crucial for improving predictions of plant responses to environmental changes. Leaf stoichiometry of terrestrial ecosystems has been widely investigated along latitudinal and longitudinal gradients. However, very little is known about the vertical distribution of leaf C :N: P and the relative effects of environmental parameters, especially for shrubs. Here, we analyzed the shrub leaf C, N and P patterns in 125 mountainous sites over an extensive altitudinal gradient (523-4685 m) on the Tibetan Plateau. Results showed that the shrub leaf C and C :N were 7.3-47.5% higher than those of other regional and global flora, whereas the leaf N and N: P were 10.2-75.8% lower. Leaf C increased with rising altitude and decreasing temperature, supporting the physiological acclimation mechanism that high leaf C (e.g., alpine or evergreen shrub) could balance the cell osmotic pressure and resist freezing. The largest leaf N and high leaf P occurred in valley region (altitude 1500 m), likely due to the large nutrient leaching from higher elevations, faster litter decomposition and nutrient resorption ability of deciduous broadleaf shrub. Leaf N: P ratio further indicated increasing N limitation at higher altitudes. Interestingly, drought severity was the only climatic factor positively correlated with leaf N and P, which was more appropriate for evaluating the impact of water status than precipitation. Among the shrub ecosystem and functional types (alpine, subalpine, montane, valley, evergreen, deciduous, broadleaf, and conifer), their leaf element contents and responses to environments were remarkably different. Shrub type was the largest contributor to the total variations in leaf stoichiometry, while climate indirectly affected the leaf C :N: P via its interactive effects on shrub type or soil. Collectively, the large heterogeneity in shrub type was the most important factor explaining the overall leaf C :N: P variations, despite the broad climate gradient on the plateau. Temperature and drought induced shifts in shrub type distribution will influence the nutrient accumulation in mountainous shrubs. © Author(s) 2018
Towards Semantic e-Science for Traditional Chinese Medicine
<p>Abstract</p> <p>Background</p> <p>Recent advances in Web and information technologies with the increasing decentralization of organizational structures have resulted in massive amounts of information resources and domain-specific services in Traditional Chinese Medicine. The massive volume and diversity of information and services available have made it difficult to achieve seamless and interoperable e-Science for knowledge-intensive disciplines like TCM. Therefore, information integration and service coordination are two major challenges in e-Science for TCM. We still lack sophisticated approaches to integrate scientific data and services for TCM e-Science.</p> <p>Results</p> <p>We present a comprehensive approach to build dynamic and extendable e-Science applications for knowledge-intensive disciplines like TCM based on semantic and knowledge-based techniques. The semantic e-Science infrastructure for TCM supports large-scale database integration and service coordination in a virtual organization. We use domain ontologies to integrate TCM database resources and services in a semantic cyberspace and deliver a semantically superior experience including browsing, searching, querying and knowledge discovering to users. We have developed a collection of semantic-based toolkits to facilitate TCM scientists and researchers in information sharing and collaborative research.</p> <p>Conclusion</p> <p>Semantic and knowledge-based techniques are suitable to knowledge-intensive disciplines like TCM. It's possible to build on-demand e-Science system for TCM based on existing semantic and knowledge-based techniques. The presented approach in the paper integrates heterogeneous distributed TCM databases and services, and provides scientists with semantically superior experience to support collaborative research in TCM discipline.</p
MEAformer: Multi-modal Entity Alignment Transformer for Meta Modality Hybrid
As an important variant of entity alignment (EA), multi-modal entity
alignment (MMEA) aims to discover identical entities across different knowledge
graphs (KGs) with relevant images attached. We noticed that current MMEA
algorithms all globally adopt the KG-level modality fusion strategies for
multi-modal entity representation but ignore the variation in modality
preferences for individual entities, hurting the robustness to potential noise
involved in modalities (e.g., blurry images and relations). In this paper, we
present MEAformer, a multi-modal entity alignment transformer approach for meta
modality hybrid, which dynamically predicts the mutual correlation coefficients
among modalities for entity-level feature aggregation. A modal-aware hard
entity replay strategy is further proposed for addressing vague entity details.
Experimental results show that our model not only achieves SOTA performance on
multiple training scenarios including supervised, unsupervised, iterative, and
low resource, but also has a comparable number of parameters, optimistic speed,
and good interpretability. Our code and data are available at
https://github.com/zjukg/MEAformer.Comment: Repository: https://github.com/zjukg/MEAforme
- …