6,390 research outputs found
Large-scale event extraction from literature with multi-level gene normalization
Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons -Attribution - Share Alike (CC BY-SA) license
TopologyNet: Topology based deep convolutional neural networks for biomolecular property predictions
Although deep learning approaches have had tremendous success in image, video
and audio processing, computer vision, and speech recognition, their
applications to three-dimensional (3D) biomolecular structural data sets have
been hindered by the entangled geometric complexity and biological complexity.
We introduce topology, i.e., element specific persistent homology (ESPH), to
untangle geometric complexity and biological complexity. ESPH represents 3D
complex geometry by one-dimensional (1D) topological invariants and retains
crucial biological information via a multichannel image representation. It is
able to reveal hidden structure-function relationships in biomolecules. We
further integrate ESPH and convolutional neural networks to construct a
multichannel topological neural network (TopologyNet) for the predictions of
protein-ligand binding affinities and protein stability changes upon mutation.
To overcome the limitations to deep learning arising from small and noisy
training sets, we present a multitask topological convolutional neural network
(MT-TCNN). We demonstrate that the present TopologyNet architectures outperform
other state-of-the-art methods in the predictions of protein-ligand binding
affinities, globular protein mutation impacts, and membrane protein mutation
impacts.Comment: 20 pages, 8 figures, 5 table
Pathway-Based Genomics Prediction using Generalized Elastic Net.
We present a novel regularization scheme called The Generalized Elastic Net (GELnet) that incorporates gene pathway information into feature selection. The proposed formulation is applicable to a wide variety of problems in which the interpretation of predictive features using known molecular interactions is desired. The method naturally steers solutions toward sets of mechanistically interlinked genes. Using experiments on synthetic data, we demonstrate that pathway-guided results maintain, and often improve, the accuracy of predictors even in cases where the full gene network is unknown. We apply the method to predict the drug response of breast cancer cell lines. GELnet is able to reveal genetic determinants of sensitivity and resistance for several compounds. In particular, for an EGFR/HER2 inhibitor, it finds a possible trans-differentiation resistance mechanism missed by the corresponding pathway agnostic approach
REinforcement learning based Adaptive samPling: REAPing Rewards by Exploring Protein Conformational Landscapes
One of the key limitations of Molecular Dynamics simulations is the
computational intractability of sampling protein conformational landscapes
associated with either large system size or long timescales. To overcome this
bottleneck, we present the REinforcement learning based Adaptive samPling
(REAP) algorithm that aims to efficiently sample conformational space by
learning the relative importance of each reaction coordinate as it samples the
landscape. To achieve this, the algorithm uses concepts from the field of
reinforcement learning, a subset of machine learning, which rewards sampling
along important degrees of freedom and disregards others that do not facilitate
exploration or exploitation. We demonstrate the effectiveness of REAP by
comparing the sampling to long continuous MD simulations and least-counts
adaptive sampling on two model landscapes (L-shaped and circular), and
realistic systems such as alanine dipeptide and Src kinase. In all four
systems, the REAP algorithm consistently demonstrates its ability to explore
conformational space faster than the other two methods when comparing the
expected values of the landscape discovered for a given amount of time. The key
advantage of REAP is on-the-fly estimation of the importance of collective
variables, which makes it particularly useful for systems with limited
structural information
Model Reduction Tools For Phenomenological Modeling of Input-Controlled Biological Circuits
We present a Python-based software package to automatically obtain phenomenological models of input-controlled synthetic biological circuits that guide the design using chemical reaction-level descriptive models. From the parts and mechanism description of a synthetic biological circuit, it is easy to obtain a chemical reaction model of the circuit under the assumptions of mass-action kinetics using various existing tools. However, using these models to guide design decisions during an experiment is difficult due to a large number of reaction rate parameters and species in the model. Hence, phenomenological models are often developed that describe the effective relationships among the circuit inputs, outputs, and only the key states and parameters. In this paper, we present an algorithm to obtain these phenomenological models in an automated manner using a Python package for circuits with inputs that control the desired outputs. This model reduction approach combines the common assumptions of time-scale separation, conservation laws, and species' abundance to obtain the reduced models that can be used for design of synthetic biological circuits. We consider an example of a simple gene expression circuit and another example of a layered genetic feedback control circuit to demonstrate the use of the model reduction procedure
Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models
Large Language Models (LLMs), with their remarkable task-handling
capabilities and innovative outputs, have catalyzed significant advancements
across a spectrum of fields. However, their proficiency within specialized
domains such as biomolecular studies remains limited. To address this
challenge, we introduce Mol-Instructions, a meticulously curated, comprehensive
instruction dataset expressly designed for the biomolecular realm.
Mol-Instructions is composed of three pivotal components: molecule-oriented
instructions, protein-oriented instructions, and biomolecular text
instructions, each curated to enhance the understanding and prediction
capabilities of LLMs concerning biomolecular features and behaviors. Through
extensive instruction tuning experiments on the representative LLM, we
underscore the potency of Mol-Instructions to enhance the adaptability and
cognitive acuity of large models within the complex sphere of biomolecular
studies, thereby promoting advancements in the biomolecular research community.
Mol-Instructions is made publicly accessible for future research endeavors and
will be subjected to continual updates for enhanced applicability.Comment: Project homepage: https://github.com/zjunlp/Mol-Instructions. Add
quantitative evaluation
Similarity-based virtual screening using 2D fingerprints
This paper summarises recent work at the University of Sheffield on virtual screening methods that use 2D fingerprint measures of structural similarity. A detailed comparison of a large number of similarity coefficients demonstrates that the well-known Tanimoto coefficient remains the method of choice for the computation of fingerprint-based similarity, despite possessing some inherent biases related to the sizes of the molecules that are being sought. Group fusion involves combining the results of similarity searches based on multiple reference structures and a single similarity measure. We demonstrate the effectiveness of this approach to screening, and also describe an approximate form of group fusion, turbo similarity searching, that can be used when just a single reference structure is available
- …