2,893 research outputs found

    edge2vec: Representation learning using edge semantics for biomedical knowledge discovery

    Full text link
    Representation learning provides new and powerful graph analytical approaches and tools for the highly valued data science challenge of mining knowledge graphs. Since previous graph analytical methods have mostly focused on homogeneous graphs, an important current challenge is extending this methodology for richly heterogeneous graphs and knowledge domains. The biomedical sciences are such a domain, reflecting the complexity of biology, with entities such as genes, proteins, drugs, diseases, and phenotypes, and relationships such as gene co-expression, biochemical regulation, and biomolecular inhibition or activation. Therefore, the semantics of edges and nodes are critical for representation learning and knowledge discovery in real world biomedical problems. In this paper, we propose the edge2vec model, which represents graphs considering edge semantics. An edge-type transition matrix is trained by an Expectation-Maximization approach, and a stochastic gradient descent model is employed to learn node embedding on a heterogeneous graph via the trained transition matrix. edge2vec is validated on three biomedical domain tasks: biomedical entity classification, compound-gene bioactivity prediction, and biomedical information retrieval. Results show that by considering edge-types into node embedding learning in heterogeneous graphs, \textbf{edge2vec}\ significantly outperforms state-of-the-art models on all three tasks. We propose this method for its added value relative to existing graph analytical methodology, and in the real world context of biomedical knowledge discovery applicability.Comment: 10 page

    TOP2A and EZH2 Provide Early Detection of an Aggressive Prostate Cancer Subgroup.

    Get PDF
    Purpose: Current clinical parameters do not stratify indolent from aggressive prostate cancer. Aggressive prostate cancer, defined by the progression from localized disease to metastasis, is responsible for the majority of prostate cancer–associated mortality. Recent gene expression profiling has proven successful in predicting the outcome of prostate cancer patients; however, they have yet to provide targeted therapy approaches that could inhibit a patient\u27s progression to metastatic disease. Experimental Design: We have interrogated a total of seven primary prostate cancer cohorts (n = 1,900), two metastatic castration-resistant prostate cancer datasets (n = 293), and one prospective cohort (n = 1,385) to assess the impact of TOP2A and EZH2 expression on prostate cancer cellular program and patient outcomes. We also performed IHC staining for TOP2A and EZH2 in a cohort of primary prostate cancer patients (n = 89) with known outcome. Finally, we explored the therapeutic potential of a combination therapy targeting both TOP2A and EZH2 using novel prostate cancer–derived murine cell lines. Results: We demonstrate by genome-wide analysis of independent primary and metastatic prostate cancer datasets that concurrent TOP2A and EZH2 mRNA and protein upregulation selected for a subgroup of primary and metastatic patients with more aggressive disease and notable overlap of genes involved in mitotic regulation. Importantly, TOP2A and EZH2 in prostate cancer cells act as key driving oncogenes, a fact highlighted by sensitivity to combination-targeted therapy. Conclusions: Overall, our data support further assessment of TOP2A and EZH2 as biomarkers for early identification of patients with increased metastatic potential that may benefit from adjuvant or neoadjuvant targeted therapy approaches. ©2017 AACR

    Transforming Graph Representations for Statistical Relational Learning

    Full text link
    Relational data representations have become an increasingly important topic due to the recent proliferation of network datasets (e.g., social, biological, information networks) and a corresponding increase in the application of statistical relational learning (SRL) algorithms to these domains. In this article, we examine a range of representation issues for graph-based relational data. Since the choice of relational data representation for the nodes, links, and features can dramatically affect the capabilities of SRL algorithms, we survey approaches and opportunities for relational representation transformation designed to improve the performance of these algorithms. This leads us to introduce an intuitive taxonomy for data representation transformations in relational domains that incorporates link transformation and node transformation as symmetric representation tasks. In particular, the transformation tasks for both nodes and links include (i) predicting their existence, (ii) predicting their label or type, (iii) estimating their weight or importance, and (iv) systematically constructing their relevant features. We motivate our taxonomy through detailed examples and use it to survey and compare competing approaches for each of these tasks. We also discuss general conditions for transforming links, nodes, and features. Finally, we highlight challenges that remain to be addressed

    Projection Based Models for High Dimensional Data

    Get PDF
    In recent years, many machine learning applications have arisen which deal with the problem of finding patterns in high dimensional data. Principal component analysis (PCA) has become ubiquitous in this setting. PCA performs dimensionality reduction by estimating latent factors which minimise the reconstruction error between the original data and its low-dimensional projection. We initially consider a situation where influential observations exist within the dataset which have a large, adverse affect on the estimated PCA model. We propose a measure of “predictive influence” to detect these points based on the contribution of each point to the leave-one-out reconstruction error of the model using an analytic PRedicted REsidual Sum of Squares (PRESS) statistic. We then develop a robust alternative to PCA to deal with the presence of influential observations and outliers which minimizes the predictive reconstruction error. In some applications there may be unobserved clusters in the data, for which fitting PCA models to subsets of the data would provide a better fit. This is known as the subspace clustering problem. We develop a novel algorithm for subspace clustering which iteratively fits PCA models to subsets of the data and assigns observations to clusters based on their predictive influence on the reconstruction error. We study the convergence of the algorithm and compare its performance to a number of subspace clustering methods on simulated data and in real applications from computer vision involving clustering object trajectories in video sequences and images of faces. We extend our predictive clustering framework to a setting where two high-dimensional views of data have been obtained. Often, only either clustering or predictive modelling is performed between the views. Instead, we aim to recover clusters which are maximally predictive between the views. In this setting two block partial least squares (TB-PLS) is a useful model. TB-PLS performs dimensionality reduction in both views by estimating latent factors that are highly predictive. We fit TB-PLS models to subsets of data and assign points to clusters based on their predictive influence under each model which is evaluated using a PRESS statistic. We compare our method to state of the art algorithms in real applications in webpage and document clustering and find that our approach to predictive clustering yields superior results. Finally, we propose a method for dynamically tracking multivariate data streams based on PLS. Our method learns a linear regression function from multivariate input and output streaming data in an incremental fashion while also performing dimensionality reduction and variable selection. Moreover, the recursive regression model is able to adapt to sudden changes in the data generating mechanism and also identifies the number of latent factors. We apply our method to the enhanced index tracking problem in computational finance

    Large-scale Machine Learning in High-dimensional Datasets

    Get PDF
    • …
    corecore