10,045 research outputs found
Using Neural Networks for Relation Extraction from Biomedical Literature
Using different sources of information to support automated extracting of
relations between biomedical concepts contributes to the development of our
understanding of biological systems. The primary comprehensive source of these
relations is biomedical literature. Several relation extraction approaches have
been proposed to identify relations between concepts in biomedical literature,
namely, using neural networks algorithms. The use of multichannel architectures
composed of multiple data representations, as in deep neural networks, is
leading to state-of-the-art results. The right combination of data
representations can eventually lead us to even higher evaluation scores in
relation extraction tasks. Thus, biomedical ontologies play a fundamental role
by providing semantic and ancestry information about an entity. The
incorporation of biomedical ontologies has already been proved to enhance
previous state-of-the-art results.Comment: Artificial Neural Networks book (Springer) - Chapter 1
Diffusion Component Analysis: Unraveling Functional Topology in Biological Networks
Complex biological systems have been successfully modeled by biochemical and
genetic interaction networks, typically gathered from high-throughput (HTP)
data. These networks can be used to infer functional relationships between
genes or proteins. Using the intuition that the topological role of a gene in a
network relates to its biological function, local or diffusion based
"guilt-by-association" and graph-theoretic methods have had success in
inferring gene functions. Here we seek to improve function prediction by
integrating diffusion-based methods with a novel dimensionality reduction
technique to overcome the incomplete and noisy nature of network data. In this
paper, we introduce diffusion component analysis (DCA), a framework that plugs
in a diffusion model and learns a low-dimensional vector representation of each
node to encode the topological properties of a network. As a proof of concept,
we demonstrate DCA's substantial improvement over state-of-the-art
diffusion-based approaches in predicting protein function from molecular
interaction networks. Moreover, our DCA framework can integrate multiple
networks from heterogeneous sources, consisting of genomic information,
biochemical experiments and other resources, to even further improve function
prediction. Yet another layer of performance gain is achieved by integrating
the DCA framework with support vector machines that take our node vector
representations as features. Overall, our DCA framework provides a novel
representation of nodes in a network that can be used as a plug-in architecture
to other machine learning algorithms to decipher topological properties of and
obtain novel insights into interactomes.Comment: RECOMB 201
edge2vec: Representation learning using edge semantics for biomedical knowledge discovery
Representation learning provides new and powerful graph analytical approaches
and tools for the highly valued data science challenge of mining knowledge
graphs. Since previous graph analytical methods have mostly focused on
homogeneous graphs, an important current challenge is extending this
methodology for richly heterogeneous graphs and knowledge domains. The
biomedical sciences are such a domain, reflecting the complexity of biology,
with entities such as genes, proteins, drugs, diseases, and phenotypes, and
relationships such as gene co-expression, biochemical regulation, and
biomolecular inhibition or activation. Therefore, the semantics of edges and
nodes are critical for representation learning and knowledge discovery in real
world biomedical problems. In this paper, we propose the edge2vec model, which
represents graphs considering edge semantics. An edge-type transition matrix is
trained by an Expectation-Maximization approach, and a stochastic gradient
descent model is employed to learn node embedding on a heterogeneous graph via
the trained transition matrix. edge2vec is validated on three biomedical domain
tasks: biomedical entity classification, compound-gene bioactivity prediction,
and biomedical information retrieval. Results show that by considering
edge-types into node embedding learning in heterogeneous graphs,
\textbf{edge2vec}\ significantly outperforms state-of-the-art models on all
three tasks. We propose this method for its added value relative to existing
graph analytical methodology, and in the real world context of biomedical
knowledge discovery applicability.Comment: 10 page
Identification of disease-causing genes using microarray data mining and gene ontology
Background: One of the best and most accurate methods for identifying disease-causing genes is monitoring gene expression values in different samples using microarray technology. One of the shortcomings of microarray data is that they provide a small quantity of samples with respect to the number of genes. This problem reduces the classification accuracy of the methods, so gene selection is essential to improve the predictive accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVMRFE) has become one of the leading methods, but its performance can be reduced because of the small sample size, noisy data and the fact that the method does not remove redundant genes.
Methods: We propose a novel framework for gene selection which uses the advantageous features of conventional methods and addresses their weaknesses. In fact, we have combined the Fisher method and SVMRFE to utilize the advantages of a filtering method as well as an embedded method. Furthermore, we have added a redundancy reduction stage to address the weakness of the Fisher method and SVMRFE. In addition to gene expression values, the proposed method uses Gene Ontology which is a reliable source of information on genes. The use of Gene Ontology can compensate, in part, for the limitations of microarrays, such as having a small number of samples and erroneous measurement results.
Results: The proposed method has been applied to colon, Diffuse Large B-Cell Lymphoma (DLBCL) and prostate cancer datasets. The empirical results show that our method has improved classification performance in terms of accuracy, sensitivity and specificity. In addition, the study of the molecular function of selected genes strengthened the hypothesis that these genes are involved in the process of cancer growth.
Conclusions: The proposed method addresses the weakness of conventional methods by adding a redundancy reduction stage and utilizing Gene Ontology information. It predicts marker genes for colon, DLBCL and prostate cancer with a high accuracy. The predictions made in this study can serve as a list of candidates for subsequent wet-lab verification and might help in the search for a cure for cancers
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Effective Feature Representation for Clinical Text Concept Extraction
Crucial information about the practice of healthcare is recorded only in
free-form text, which creates an enormous opportunity for high-impact NLP.
However, annotated healthcare datasets tend to be small and expensive to
obtain, which raises the question of how to make maximally efficient uses of
the available data. To this end, we develop an LSTM-CRF model for combining
unsupervised word representations and hand-built feature representations
derived from publicly available healthcare ontologies. We show that this
combined model yields superior performance on five datasets of diverse kinds of
healthcare text (clinical, social, scientific, commercial). Each involves the
labeling of complex, multi-word spans that pick out different healthcare
concepts. We also introduce a new labeled dataset for identifying the treatment
relations between drugs and diseases
- …