2,893 research outputs found
edge2vec: Representation learning using edge semantics for biomedical knowledge discovery
Representation learning provides new and powerful graph analytical approaches
and tools for the highly valued data science challenge of mining knowledge
graphs. Since previous graph analytical methods have mostly focused on
homogeneous graphs, an important current challenge is extending this
methodology for richly heterogeneous graphs and knowledge domains. The
biomedical sciences are such a domain, reflecting the complexity of biology,
with entities such as genes, proteins, drugs, diseases, and phenotypes, and
relationships such as gene co-expression, biochemical regulation, and
biomolecular inhibition or activation. Therefore, the semantics of edges and
nodes are critical for representation learning and knowledge discovery in real
world biomedical problems. In this paper, we propose the edge2vec model, which
represents graphs considering edge semantics. An edge-type transition matrix is
trained by an Expectation-Maximization approach, and a stochastic gradient
descent model is employed to learn node embedding on a heterogeneous graph via
the trained transition matrix. edge2vec is validated on three biomedical domain
tasks: biomedical entity classification, compound-gene bioactivity prediction,
and biomedical information retrieval. Results show that by considering
edge-types into node embedding learning in heterogeneous graphs,
\textbf{edge2vec}\ significantly outperforms state-of-the-art models on all
three tasks. We propose this method for its added value relative to existing
graph analytical methodology, and in the real world context of biomedical
knowledge discovery applicability.Comment: 10 page
TOP2A and EZH2 Provide Early Detection of an Aggressive Prostate Cancer Subgroup.
Purpose: Current clinical parameters do not stratify indolent from aggressive prostate cancer. Aggressive prostate cancer, defined by the progression from localized disease to metastasis, is responsible for the majority of prostate cancer–associated mortality. Recent gene expression profiling has proven successful in predicting the outcome of prostate cancer patients; however, they have yet to provide targeted therapy approaches that could inhibit a patient\u27s progression to metastatic disease. Experimental Design: We have interrogated a total of seven primary prostate cancer cohorts (n = 1,900), two metastatic castration-resistant prostate cancer datasets (n = 293), and one prospective cohort (n = 1,385) to assess the impact of TOP2A and EZH2 expression on prostate cancer cellular program and patient outcomes. We also performed IHC staining for TOP2A and EZH2 in a cohort of primary prostate cancer patients (n = 89) with known outcome. Finally, we explored the therapeutic potential of a combination therapy targeting both TOP2A and EZH2 using novel prostate cancer–derived murine cell lines. Results: We demonstrate by genome-wide analysis of independent primary and metastatic prostate cancer datasets that concurrent TOP2A and EZH2 mRNA and protein upregulation selected for a subgroup of primary and metastatic patients with more aggressive disease and notable overlap of genes involved in mitotic regulation. Importantly, TOP2A and EZH2 in prostate cancer cells act as key driving oncogenes, a fact highlighted by sensitivity to combination-targeted therapy. Conclusions: Overall, our data support further assessment of TOP2A and EZH2 as biomarkers for early identification of patients with increased metastatic potential that may benefit from adjuvant or neoadjuvant targeted therapy approaches. ©2017 AACR
Transforming Graph Representations for Statistical Relational Learning
Relational data representations have become an increasingly important topic
due to the recent proliferation of network datasets (e.g., social, biological,
information networks) and a corresponding increase in the application of
statistical relational learning (SRL) algorithms to these domains. In this
article, we examine a range of representation issues for graph-based relational
data. Since the choice of relational data representation for the nodes, links,
and features can dramatically affect the capabilities of SRL algorithms, we
survey approaches and opportunities for relational representation
transformation designed to improve the performance of these algorithms. This
leads us to introduce an intuitive taxonomy for data representation
transformations in relational domains that incorporates link transformation and
node transformation as symmetric representation tasks. In particular, the
transformation tasks for both nodes and links include (i) predicting their
existence, (ii) predicting their label or type, (iii) estimating their weight
or importance, and (iv) systematically constructing their relevant features. We
motivate our taxonomy through detailed examples and use it to survey and
compare competing approaches for each of these tasks. We also discuss general
conditions for transforming links, nodes, and features. Finally, we highlight
challenges that remain to be addressed
Projection Based Models for High Dimensional Data
In recent years, many machine learning applications have arisen which deal with the
problem of finding patterns in high dimensional data. Principal component analysis
(PCA) has become ubiquitous in this setting. PCA performs dimensionality reduction
by estimating latent factors which minimise the reconstruction error between
the original data and its low-dimensional projection. We initially consider a situation
where influential observations exist within the dataset which have a large,
adverse affect on the estimated PCA model. We propose a measure of “predictive
influence” to detect these points based on the contribution of each point to the
leave-one-out reconstruction error of the model using an analytic PRedicted REsidual
Sum of Squares (PRESS) statistic. We then develop a robust alternative to PCA
to deal with the presence of influential observations and outliers which minimizes
the predictive reconstruction error.
In some applications there may be unobserved clusters in the data, for which
fitting PCA models to subsets of the data would provide a better fit. This is known
as the subspace clustering problem. We develop a novel algorithm for subspace
clustering which iteratively fits PCA models to subsets of the data and assigns observations
to clusters based on their predictive influence on the reconstruction error.
We study the convergence of the algorithm and compare its performance to a number
of subspace clustering methods on simulated data and in real applications from
computer vision involving clustering object trajectories in video sequences and images
of faces.
We extend our predictive clustering framework to a setting where two high-dimensional
views of data have been obtained. Often, only either clustering or predictive modelling is performed between the views. Instead, we aim to recover
clusters which are maximally predictive between the views. In this setting two block
partial least squares (TB-PLS) is a useful model. TB-PLS performs dimensionality
reduction in both views by estimating latent factors that are highly predictive. We
fit TB-PLS models to subsets of data and assign points to clusters based on their
predictive influence under each model which is evaluated using a PRESS statistic.
We compare our method to state of the art algorithms in real applications in webpage
and document clustering and find that our approach to predictive clustering
yields superior results.
Finally, we propose a method for dynamically tracking multivariate data streams
based on PLS. Our method learns a linear regression function from multivariate
input and output streaming data in an incremental fashion while also performing
dimensionality reduction and variable selection. Moreover, the recursive regression
model is able to adapt to sudden changes in the data generating mechanism and also
identifies the number of latent factors. We apply our method to the enhanced index
tracking problem in computational finance
- …