152 research outputs found
Domain-Agnostic Molecular Generation with Self-feedback
The generation of molecules with desired properties has gained tremendous
popularity, revolutionizing the way scientists design molecular structures and
providing valuable support for chemical and drug design. However, despite the
potential of language models in molecule generation, they face numerous
challenges such as the generation of syntactically or chemically flawed
molecules, narrow domain focus, and limitations in creating diverse and
directionally feasible molecules due to a dearth of annotated data or external
molecular databases. To this end, we introduce MolGen, a pre-trained molecular
language model tailored specifically for molecule generation. MolGen acquires
intrinsic structural and grammatical insights by reconstructing over 100
million molecular SELFIES, while facilitating knowledge transfer between
different domains through domain-agnostic molecular prefix tuning. Moreover, we
present a self-feedback paradigm that inspires the pre-trained model to align
with the ultimate goal of producing molecules with desirable properties.
Extensive experiments on well-known benchmarks confirm MolGen's optimization
capabilities, encompassing penalized logP, QED, and molecular docking
properties. Further analysis shows that MolGen can accurately capture molecule
distributions, implicitly learn their structural characteristics, and
efficiently explore chemical space. The pre-trained model, codes, and datasets
are publicly available for future research at https://github.com/zjunlp/MolGen.Comment: Work in progress. Add results of binding affinit
Graph Sampling-based Meta-Learning for Molecular Property Prediction
Molecular property is usually observed with a limited number of samples, and
researchers have considered property prediction as a few-shot problem. One
important fact that has been ignored by prior works is that each molecule can
be recorded with several different properties simultaneously. To effectively
utilize many-to-many correlations of molecules and properties, we propose a
Graph Sampling-based Meta-learning (GS-Meta) framework for few-shot molecular
property prediction. First, we construct a Molecule-Property relation Graph
(MPG): molecule and properties are nodes, while property labels decide edges.
Then, to utilize the topological information of MPG, we reformulate an episode
in meta-learning as a subgraph of the MPG, containing a target property node,
molecule nodes, and auxiliary property nodes. Third, as episodes in the form of
subgraphs are no longer independent of each other, we propose to schedule the
subgraph sampling process with a contrastive loss function, which considers the
consistency and discrimination of subgraphs. Extensive experiments on 5
commonly-used benchmarks show GS-Meta consistently outperforms state-of-the-art
methods by 5.71%-6.93% in ROC-AUC and verify the effectiveness of each proposed
module. Our code is available at https://github.com/HICAI-ZJU/GS-Meta.Comment: Accepted by IJCAI 202
Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering
Recently, the development of large language models (LLMs) has attracted wide
attention in academia and industry. Deploying LLMs to real scenarios is one of
the key directions in the current Internet industry. In this paper, we present
a novel pipeline to apply LLMs for domain-specific question answering (QA) that
incorporates domain knowledge graphs (KGs), addressing an important direction
of LLM application. As a real-world application, the content generated by LLMs
should be user-friendly to serve the customers. Additionally, the model needs
to utilize domain knowledge properly to generate reliable answers. These two
issues are the two major difficulties in the LLM application as vanilla
fine-tuning can not adequately address them. We think both requirements can be
unified as the model preference problem that needs to align with humans to
achieve practical application. Thus, we introduce Knowledgeable Preference
AlignmenT (KnowPAT), which constructs two kinds of preference set called style
preference set and knowledge preference set respectively to tackle the two
issues. Besides, we design a new alignment objective to align the LLM
preference with human preference, aiming to train a better LLM for
real-scenario domain-specific QA to generate reliable and user-friendly
answers. Adequate experiments and comprehensive with 15 baseline methods
demonstrate that our KnowPAT is an outperforming pipeline for real-scenario
domain-specific QA with LLMs. Our code is open-source at
https://github.com/zjukg/KnowPAT.Comment: Work in progress. Code is available at
https://github.com/zjukg/KnowPA
Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models
Large Language Models (LLMs), with their remarkable task-handling
capabilities and innovative outputs, have catalyzed significant advancements
across a spectrum of fields. However, their proficiency within specialized
domains such as biomolecular studies remains limited. To address this
challenge, we introduce Mol-Instructions, a meticulously curated, comprehensive
instruction dataset expressly designed for the biomolecular realm.
Mol-Instructions is composed of three pivotal components: molecule-oriented
instructions, protein-oriented instructions, and biomolecular text
instructions, each curated to enhance the understanding and prediction
capabilities of LLMs concerning biomolecular features and behaviors. Through
extensive instruction tuning experiments on the representative LLM, we
underscore the potency of Mol-Instructions to enhance the adaptability and
cognitive acuity of large models within the complex sphere of biomolecular
studies, thereby promoting advancements in the biomolecular research community.
Mol-Instructions is made publicly accessible for future research endeavors and
will be subjected to continual updates for enhanced applicability.Comment: Project homepage: https://github.com/zjunlp/Mol-Instructions. Add
quantitative evaluation
DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning
Zero-shot learning (ZSL) aims to predict unseen classes whose samples have
never appeared during training. One of the most effective and widely used
semantic information for zero-shot image classification are attributes which
are annotations for class-level visual characteristics. However, the current
methods often fail to discriminate those subtle visual distinctions between
images due to not only the shortage of fine-grained annotations, but also the
attribute imbalance and co-occurrence. In this paper, we present a
transformer-based end-to-end ZSL method named DUET, which integrates latent
semantic knowledge from the pre-trained language models (PLMs) via a
self-supervised multi-modal learning paradigm. Specifically, we (1) developed a
cross-modal semantic grounding network to investigate the model's capability of
disentangling semantic attributes from the images; (2) applied an
attribute-level contrastive learning strategy to further enhance the model's
discrimination on fine-grained visual characteristics against the attribute
co-occurrence and imbalance; (3) proposed a multi-task learning policy for
considering multi-model objectives. We find that our DUET can achieve
state-of-the-art performance on three standard ZSL benchmarks and a knowledge
graph equipped ZSL benchmark. Its components are effective and its predictions
are interpretable.Comment: AAAI 2023 (Oral). Repository: https://github.com/zjukg/DUE
Massively parallel pyrosequencing-based transcriptome analyses of small brown planthopper (Laodelphax striatellus), a vector insect transmitting rice stripe virus (RSV)
<p>Abstract</p> <p>Background</p> <p>The small brown planthopper (<it>Laodelphax striatellus</it>) is an important agricultural pest that not only damages rice plants by sap-sucking, but also acts as a vector that transmits rice stripe virus (RSV), which can cause even more serious yield loss. Despite being a model organism for studying entomology, population biology, plant protection, molecular interactions among plants, viruses and insects, only a few genomic sequences are available for this species. To investigate its transcriptome and determine the differences between viruliferous and naïve <it>L. striatellus</it>, we employed 454-FLX high-throughput pyrosequencing to generate EST databases of this insect.</p> <p>Results</p> <p>We obtained 201,281 and 218,681 high-quality reads from viruliferous and naïve <it>L. striatellus</it>, respectively, with an average read length as 230 bp. These reads were assembled into contigs and two EST databases were generated. When all reads were combined, 16,885 contigs and 24,607 singletons (a total of 41,492 unigenes) were obtained, which represents a transcriptome of the insect. BlastX search against the NCBI-NR database revealed that only 6,873 (16.6%) of these unigenes have significant matches. Comparison of the distribution of GO classification among viruliferous, naïve, and combined EST databases indicated that these libraries are broadly representative of the <it>L. striatellus </it>transcriptomes. Functionally diverse transcripts from RSV, endosymbiotic bacteria <it>Wolbachia </it>and yeast-like symbiotes were identified, which reflects the possible lifestyles of these microbial symbionts that live in the cells of the host insect. Comparative genomic analysis revealed that <it>L. striatellus </it>encodes similar innate immunity regulatory systems as other insects, such as RNA interference, JAK/STAT and partial Imd cascades, which might be involved in defense against viral infection. In addition, we determined the differences in gene expression between vector and naïve samples, which generated a list of candidate genes that are potentially involved in the symbiosis of <it>L. striatellus </it>and RSV.</p> <p>Conclusions</p> <p>To our knowledge, the present study is the first description of a genomic project for <it>L. striatellus</it>. The identification of transcripts from RSV, <it>Wolbachia</it>, yeast-like symbiotes and genes abundantly expressed in viruliferous insect, provided a starting-point for investigating the molecular basis of symbiosis among these organisms.</p
Mechanical and Electrical Properties of a CFETR CSMC Conductor under Transverse Mechanical Loadings
The central solenoid model coil (CSMC) project of the China Fusion Engineering Test Reactor was launched in 2014 to verify the technological feasibility of a large-scale superconducting magnet at the Institute of Plasma and Physics Chinese Academy of Sciences. The short twist pitch design recommended by CEA is chosen for the CSMC Nb3Sn cable-in-conduit conductors. In order to better understand the evolution of transport properties and coupling losses related to the effect of electromagnetic load cycles, the mechanical and electrical properties were measured and investigated employing a special cryogenic press facility for the transverse mechanical loadings. The results show that the transverse compression (dy) versus applied load force (Fy ) is different from first to subsequent loading cycles. This mechanical behavior can be interpreted by the combination of strands bending between the crossovers and strands deformation at the crossovers. The fitting relations of dy versus Fy are also presented. The evolution of interstrand contact resistance (Rc) in the cabling stages with cyclic history and pressure effects are discussed. In addition, a fitting relation of Rc versus Fy is presented based on a combination of strand's microsliding and copper matrix resistivity. A clear correlation between intrapetal resistance Rc and coupling loss is also found
Functional soil organic matter fractions in response to long-term fertilization in upland and paddy systems in South China
Soil organic matter (SOM) and its fractions play key roles in optimizing crop yield and improving soil quality. However, how functional SOM fractions responded to long-term fertilization and their relative importance for C sequestration were less addressed. In this study, we determined the effects of long-term fertilization on six functional SOM fractions (unprotected, physically protected, physico-biochemically protected, physico-chemically protected, chemically protected, and biochemically protected) based on two long-term fertilization experiments carried out in South China. The unprotected coarse particulate organic matter (cPOM), the biochemically and chemically protected silt-sized fractions (NH-dSilt and H-dSilt) were the primary C storage fractions under long-term fertilization, accounting for 23.6–46.2%, 15.7–19.4%, and 14.4–17.4% of the total soil organic carbon (SOC) content in upland soil and 19.5–29.3%, 9.9–15.5%, and 14.2–17.2% of the total SOC content in paddy soil, respectively. Compared with the control treatment (CK) in upland soil, the application of manure combined with mineral NPK (NPKM) resulted in an increase in the SOC content in the cPOM, pure physically protected fraction (iPOM), the physico-chemically protected (H-μSilt), and the chemically protected (H-dSilt) fraction by 233%, 166%, 124%, and 58%, respectively. Besides, the SOC increase in upland soil expressed as SOC content per unit of total SOC for iPOM, H-μSilt, cPOM and H-dSilt were the highest and as large as 283%, 248%, 194%, and 105% respectively. In paddy soil, the highest increase per unit of total SOC was H-dSilt (190%), followed by H-dClay (156%) and H-μSilt (155%). These results suggested that the upland soil could stabilize more C through the pure physical, whereas the chemical protection mechanism played a more important role in paddy soil. Chemical protection mechanism within the microaggregates played important roles in sequestrating C in both upland and paddy soils. Overall, the different responses of functional SOM fractions to long-term fertilization indicate different mechanisms for SOM cycling in terms of C sequestration under upland and paddy systems
- …