14 research outputs found

    Bipartite Heterogeneous Network Method Based on Co-neighbor for MiRNA-Disease Association Prediction

    Get PDF
    In recent years, miRNA variation and dysregulation have been found to be closely related to human tumors, and identifying miRNA-disease associations is helpful for understanding the mechanisms of disease or tumor development and is greatly significant for the prognosis, diagnosis, and treatment of human diseases. This article proposes a Bipartite Heterogeneous network link prediction method based on co-neighbor to predict miRNA-disease association (BHCN). According to the structural characteristics of the bipartite network, the concept of bipartite network co-neighbors is proposed, and the co-neighbors were used to represent the probability of association between disease and miRNA. To predict the isolated diseases and the new miRNA based on the association probability expressed by co-neighbors, we utilized the similarity between disease nodes and the similarity between miRNA nodes in heterogeneous networks to represent the association probability between disease and miRNA. The model's predictive performance was evaluated by the leave-one-out cross validation (LOOCV) on different datasets. The AUC value of BHCN on the gold benchmark dataset was 0.7973, and the AUC obtained on the prediction dataset was 0.9349, which was better than that of the classic global algorithm. In this case study, we conducted predictive studies on breast neoplasms and colon neoplasms. Most of the top 50 predicted results were confirmed by three databases, namely, HMDD, miR2disease, and dbDEMC, with accuracy rates of 96 and 82%. In addition, BHCN can be used for predicting isolated diseases (without any known associated diseases) and new miRNAs (without any known associated miRNAs). In the isolated disease case study, the top 50 of breast neoplasm and colon neoplasm potentials associated with miRNAs predicted an accuracy of 100 and 96%, respectively, thereby demonstrating the favorable predictive power of BHCN for potentially relevant miRNAs

    LLCMDA: A Novel Method for Predicting miRNA Gene and Disease Relationship Based on Locality-Constrained Linear Coding

    Get PDF
    MiRNAs are small non-coding regulatory RNAs which are associated with multiple diseases. Increasing evidence has shown that miRNAs play important roles in various biological and physiological processes. Therefore, the identification of potential miRNA-disease associations could provide new clues to understanding the mechanism of pathogenesis. Although many traditional methods have been successfully applied to discover part of the associations, they are in general time-consuming and expensive. Consequently, computational-based methods are urgently needed to predict the potential miRNA-disease associations in a more efficient and resources-saving way. In this paper, we propose a novel method to predict miRNA-disease associations based on Locality-constrained Linear Coding (LLC). Specifically, we first reconstruct similarity networks for both miRNAs and diseases using LLC and then apply label propagation on the similarity networks to get relevant scores. To comprehensively verify the performance of the proposed method, we compare our method with several state-of-the-art methods under different evaluation metrics. Moreover, two types of case studies conducted on two common diseases further demonstrate the validity and utility of our method. Extensive experimental results indicate that our method can effectively predict potential associations between miRNAs and diseases

    MDA-SKF: Similarity Kernel Fusion for Accurately Discovering miRNA-Disease Association

    Get PDF
    Identifying accurate associations between miRNAs and diseases is beneficial for diagnosis and treatment of human diseases. It is especially important to develop an efficient method to detect the association between miRNA and disease. Traditional experimental method has high precision, but its process is complicated and time-consuming. Various computational methods have been developed to uncover potential associations based on an assumption that similar miRNAs are always related to similar diseases. In this paper, we propose an accurate method, MDA-SKF, to uncover potential miRNA-disease associations. We first extract three miRNA similarity kernels (miRNA functional similarity, miRNA sequence similarity, Hamming profile similarity for miRNA) and three disease similarity kernels (disease semantic similarity, disease functional similarity, Hamming profile similarity for disease) in two subspaces, respectively. Then, due to limitations that some initial information may be lost in the process and some noises may be exist in integrated similarity kernel, we propose a novel Similarity Kernel Fusion (SKF) method to integrate multiple similarity kernels. Finally, we utilize the Laplacian Regularized Least Squares (LapRLS) method on the integrated kernel to find potential associations. MDA-SKF is evaluated by three evaluation methods, including global leave-one-out cross validation (LOOCV) and local LOOCV and 5-fold cross validation (CV), and achieves AUCs of 0.9576, 0.8356, and 0.9557, respectively. Compared with existing seven methods, MDA-SKF has outstanding performance on global LOOCV and 5-fold. We also test case studies to further analyze the performance of MDA-SKF on 32 diseases. Furthermore, 3200 candidate associations are obtained and a majority of them can be confirmed. It demonstrates that MDA-SKF is an accurate and efficient computational tool for guiding traditional experiments

    mintRULS: Prediction of miRNA-mRNA Target Site Interactions Using Regularized Least Square Method

    Get PDF
    Identification of miRNA-mRNA interactions is critical to understand the new paradigms in gene regulation. Existing methods show suboptimal performance owing to inappropriate feature selection and limited integration of intuitive biological features of both miRNAs and mRNAs. The present regularized least square-based method, mintRULS, employs features of miRNAs and their target sites using pairwise similarity metrics based on free energy, sequence and repeat identities, and target site accessibility to predict miRNA-target site interactions. We hypothesized that miRNAs sharing similar structural and functional features are more likely to target the same mRNA, and conversely, mRNAs with similar features can be targeted by the same miRNA. Our prediction model achieved an impressive AUC of 0.93 and 0.92 in LOOCV and LmiTOCV settings, respectively. In comparison, other popular tools such as miRDB, TargetScan, MBSTAR, RPmirDIP, and STarMir scored AUCs at 0.73, 0.77, 0.55, 0.84, and 0.67, respectively, in LOOCV setting. Similarly, mintRULS outperformed other methods using metrics such as accuracy, sensitivity, specificity, and MCC. Our method also demonstrated high accuracy when validated against experimentally derived data from condition- and cell-specific studies and expression studies of miRNAs and target genes, both in human and mouse

    Assisted Network Analysis in Cancer Genomics

    Get PDF
    Cancer is a molecular disease. In the past two decades, we have witnessed a surge of high- throughput profiling in cancer research and corresponding development of high-dimensional statistical techniques. In this dissertation, the focus is on gene expression, which has played a uniquely important role in cancer research. Compared to some other types of molecular measurements, for example DNA changes, gene expressions are “closer” to cancer outcomes. In addition, processed gene expression data have good statistical properties, in particular, continuity. In the “early” cancer gene expression data analysis, attention has been on marginal properties such as mean and variance. Genes function in a coordinated way. As such, techniques that take a system perspective have been developed to also take into account the interconnections among genes. Among such techniques, graphical models, with lucid biological interpretations and satisfactory statistical properties, have attracted special attention. Graphical model-based analysis can not only lead to a deeper understanding of genes’ properties but also serve as a basis for other analyses, for example, regression and clustering. Cancer molecular studies usually have limited sizes. In the graphical model- based analysis, the number of parameters to be estimated gets squared. Combined together, they lead to a serious lack of information.The overarching goal of this dissertation is to conduct more effective graphical model analysis for cancer gene expression studies. One literature review and three methodological projects have been conducted. The overall strategy is to borrow strength from additional information so as to assist gene expression graphical model estimation. In the first chapter, the literature review is conducted. The methods developed in Chapter 2 and Chapter 4 take advantage of information on regulators of gene expressions (such as methylation, copy number variation, microRNA, and others). As they belong to the vertical data integration framework, we first provide a review of such data integration for gene expression data in Chapter 1. Additional, graphical model-based analysis for gene expression data is reviewed. Research reported in this chapter has led to a paper published in Briefings in Bioinformat- ics. In Chapters 2-4, to accommodate the extreme complexity of information-borrowing for graphical models, three different approaches have been proposed. In Chapter 2, two graphical models, with a gene-expression-only one and a gene-expression-regulator one, are simultaneously considered. A biologically sensible hierarchy between the sparsity structures of these two networks is developed, which is the first of its kind. This hierarchy is then used to link the estimation of the two graphical models. This work has led to a paper published in Genetic Epidemiology. In Chapter 3, additional information is mined from published literature, for example, those deposited at PubMed. The consideration is that published studies have been based on many independent experiments and can contain valuable in- formation on genes’ interconnections. The challenge is to recognize that such information can be partial or even wrong. A two-step approach, consisting of information-guided and information-incorporated estimations, is developed. This work has led to a paper published in Biometrics. In Chapter 4, we slightly shift attention and examine the difference in graphs, which has important implications for understanding cancer development and progression. Our strategy is to link changes in gene expression graphs with those in regulator graphs, which means additional information for estimation. It is noted that to make individual chapters standing-alone, there can be minor overlapping in descriptions. All methodological developments in this research fit the advanced penalization paradigm, which has been popular for cancer gene expression and other molecular data analysis. This methodological coherence is highly desirable. For the methods described in Chapters 2- 4, we have developed new penalized estimations which have lucid interpretations and can directly lead to variable selection (and so sparse and interpretable graphs). We have also developed effective computational algorithms and R codes, which have been made publicly available at Dr. Shuangge Ma’s Github software repository. For the methods described in Chapters 2 and 3, statistical properties under ultrahigh dimensional settings and mild regularity conditions have been established, providing the proposed methods a uniquely strong ground. Statistical properties for the method developed in Chapter 4 are relatively straightforward and hence are omitted. For all the proposed methods, we have conducted extensive simulations, comparisons with the most relevant competitors, and data analysis. The practical advantage is fully established. Overall, this research has delivered a practically sensible information-incorporating strategy for improving graphical model-based analysis for cancer gene expression data, multiple highly competitive methods, R programs that can have broad utilization, and new findings for multiple cancer types

    Learning by Fusing Heterogeneous Data

    Get PDF
    It has become increasingly common in science and technology to gather data about systems at different levels of granularity or from different perspectives. This often gives rise to data that are represented in totally different input spaces. A basic premise behind the study of learning from heterogeneous data is that in many such cases, there exists some correspondence among certain input dimensions of different input spaces. In our work we found that a key bottleneck that prevents us from better understanding and truly fusing heterogeneous data at large scales is identifying the kind of knowledge that can be transferred between related data views, entities and tasks. We develop interesting and accurate data fusion methods for predictive modeling, which reduce or entirely eliminate some of the basic feature engineering steps that were needed in the past when inferring prediction models from disparate data. In addition, our work has a wide range of applications of which we focus on those from molecular and systems biology: it can help us predict gene functions, forecast pharmacological actions of small chemicals, prioritize genes for further studies, mine disease associations, detect drug toxicity and regress cancer patient survival data. Another important aspect of our research is the study of latent factor models. We aim to design latent models with factorized parameters that simultaneously tackle multiple types of data heterogeneity, where data diversity spans across heterogeneous input spaces, multiple types of features, and a variety of related prediction tasks. Our algorithms are capable of retaining the relational structure of a data system during model inference, which turns out to be vital for good performance of data fusion in certain applications. Our recent work included the study of network inference from many potentially nonidentical data distributions and its application to cancer genomic data. We also model the epistasis, an important concept from genetics, and propose algorithms to efficiently find the ordering of genes in cellular pathways. A central topic of our Thesis is also the analysis of large data compendia as predictions about certain phenomena, such as associations between diseases and involvement of genes in a certain phenotype, are only possible when dealing with lots of data. Among others, we analyze 30 heterogeneous data sets to assess drug toxicity and over 40 human gene association data collections, the largest number of data sets considered by a collective latent factor model up to date. We also make interesting observations about deciding which data should be considered for fusion and develop a generic approach that can estimate the sensitivities between different data sets

    Following the trail of cellular signatures : computational methods for the analysis of molecular high-throughput profiles

    Get PDF
    Over the last three decades, high-throughput techniques, such as next-generation sequencing, microarrays, or mass spectrometry, have revolutionized biomedical research by enabling scientists to generate detailed molecular profiles of biological samples on a large scale. These profiles are usually complex, high-dimensional, and often prone to technical noise, which makes a manual inspection practically impossible. Hence, powerful computational methods are required that enable the analysis and exploration of these data sets and thereby help researchers to gain novel insights into the underlying biology. In this thesis, we present a comprehensive collection of algorithms, tools, and databases for the integrative analysis of molecular high-throughput profiles. We developed these tools with two primary goals in mind. The detection of deregulated biological processes in complex diseases, like cancer, and the identification of driving factors within those processes. Our first contribution in this context are several major extensions of the GeneTrail web service that make it one of the most comprehensive toolboxes for the analysis of deregulated biological processes and signaling pathways. GeneTrail offers a collection of powerful enrichment and network analysis algorithms that can be used to examine genomic, epigenomic, transcriptomic, miRNomic, and proteomic data sets. In addition to approaches for the analysis of individual -omics types, our framework also provides functionality for the integrative analysis of multi-omics data sets, the investigation of time-resolved expression profiles, and the exploration of single-cell experiments. Besides the analysis of deregulated biological processes, we also focus on the identification of driving factors within those processes, in particular, miRNAs and transcriptional regulators. For miRNAs, we created the miRNA pathway dictionary database miRPathDB, which compiles links between miRNAs, target genes, and target pathways. Furthermore, it provides a variety of tools that help to study associations between them. For the analysis of transcriptional regulators, we developed REGGAE, a novel algorithm for the identification of key regulators that have a significant impact on deregulated genes, e.g., genes that show large expression differences in a comparison between disease and control samples. To analyze the influence of transcriptional regulators on deregulated biological processes,, we also created the RegulatorTrail web service. In addition to REGGAE, this tool suite compiles a range of powerful algorithms that can be used to identify key regulators in transcriptomic, proteomic, and epigenomic data sets. Moreover, we evaluate the capabilities of our tool suite through several case studies that highlight the versatility and potential of our framework. In particular, we used our tools to conducted a detailed analysis of a Wilms' tumor data set. Here, we could identify a circuitry of regulatory mechanisms, including new potential biomarkers, that might contribute to the blastemal subtype's increased malignancy, which could potentially lead to new therapeutic strategies for Wilms' tumors. In summary, we present and evaluate a comprehensive framework of powerful algorithms, tools, and databases to analyze molecular high-throughput profiles. The provided methods are of broad interest to the scientific community and can help to elucidate complex pathogenic mechanisms.Heutzutage werden molekulare Hochdurchsatzmessverfahren, wie Hochdurchsatzsequenzierung, Microarrays, oder Massenspektrometrie, regelmäßig angewendet, um Zellen im großen Stil und auf verschiedenen molekularen Ebenen zu charakterisieren. Die dabei generierten Datensätze sind in der Regel hochdimensional und oft verrauscht. Daher werden leistungsfähige computergestützte Anwendungen benötigt, um deren Analyse zu ermöglichen. In dieser Arbeit präsentieren wir eine Reihe von effektiven Algorithmen, Programmen, und Datenbaken für die Analyse von molekularen Hochdurchsetzdatensätzen. Diese Ansätze wurden entwickelt, um deregulierte biologische Prozesse zu untersuchen und in diesen wichtige Schlüsselmoleküle zu identifizieren. Zusätzlich wurden eine Reihe von Analysen durchgeführt um die verschiedenen Methoden zu evaluieren. Zu diesem Zweck haben wir insbesondere eine Wilmstumor Studie durchgeführt, in der wir verschiedene regulatorische Mechanismen und dazugehörige Biomarker identifizieren konnten, die für die erhöhte Malignität von Wilmstumoren mit blastemreichen Subtyp verantwortlich sein könnten. Diese Erkenntnisse könnten in der Zukunft zu einer verbesserten Behandlung dieser Tumore führen. Diese Ergebnisse zeigen eindrucksvoll, dass unsere Ansätze in der Lage sind, verschiedene molekulare Hochdurchsatzmessungen auszuwerten und dabei helfen können pathogene Mechanismen im Zusammenhang mit Krebs oder anderen komplexen Krankheiten aufzuklären

    Predicting MicroRNA-Disease Associations Using Kronecker Regularized Least Squares Based on Heterogeneous Omics Data

    No full text

    Modularity-based approaches to community detection in multilayer networks with applications toward precision medicine

    Get PDF
    Networks have become an important tool for the analysis of complex systems across many different disciplines including computer science, biology, chemistry, social sciences, and importantly, cancer medicine. Networks in the real world typically exhibit many forms of higher order organization. The subfield of networks analysis known as community detection aims to provide tools for discovering and interpreting the global structure of a networks-based on the connectivity patterns of its edges. In this thesis, we provide an overview of the methods for community detection in networks with an emphasis on modularity-based approaches. We discuss several caveats and drawbacks of currently available methods. We also review the success that network analyses have had in interpreting large scale 'omics' data in the context of cancer biology. In the second and third chapters, we present CHAMP and multimodbp, two useful community detection tools that seek to overcome several of the deficiencies in modularity-based community detection. In the final chapter, we develop a networks-based significance test for addressing an important question in the field of oncology: are mutations in DNA damage repair genes associated with elevated levels of tumor mutational burden. We apply the tools of network analysis to this question and showcase how this approach yields new insight into the structure of the problem, revealing what we call the TMB Paradox. We close by demonstrating the clinical utility of our findings in predicting patient response to novel immunotherapies.Doctor of Philosoph

    Big Data Analytics and Information Science for Business and Biomedical Applications

    Get PDF
    The analysis of Big Data in biomedical as well as business and financial research has drawn much attention from researchers worldwide. This book provides a platform for the deep discussion of state-of-the-art statistical methods developed for the analysis of Big Data in these areas. Both applied and theoretical contributions are showcased
    corecore