92 research outputs found

    Gene2DisCo : gene to disease using disease commonalities

    Get PDF
    OBJECTIVE: Finding the human genes co-causing complex diseases, also known as "disease-genes", is one of the emerging and challenging tasks in biomedicine. This process, termed gene prioritization (GP), is characterized by a scarcity of known disease-genes for most diseases, and by a vast amount of heterogeneous data, usually encoded into networks describing different types of functional relationships between genes. In addition, different diseases may share common profiles (e.g. genetic or therapeutic profiles), and exploiting disease commonalities may significantly enhance the performance of GP methods. This work aims to provide a systematic comparison of several disease similarity measures, and to embed disease similarities and heterogeneous data into a flexible framework for gene prioritization which specifically handles the lack of known disease-genes. METHODS: We present a novel network-based method, Gene2DisCo, based on generalized linear models (GLMs) to effectively prioritize genes by exploiting data regarding disease-genes, gene interaction networks and disease similarities. The scarcity of disease-genes is addressed by applying an efficient negative selection procedure, together with imbalance-aware GLMs. Gene2DisCo is a flexible framework, in the sense it is not dependent upon specific types of data, and/or upon specific disease ontologies. RESULTS: On a benchmark dataset composed of nine human networks and 708 medical subject headings (MeSH) diseases, Gene2DisCo largely outperformed the best benchmark algorithm, kernelized score functions, in terms of both area under the ROC curve (0.94 against 0.86) and precision at given recall levels (for recall levels from 0.1 to 1 with steps 0.1). Furthermore, we enriched and extended the benchmark data to the whole human genome and provided the top-ranked unannotated candidate genes even for MeSH disease terms without known annotations

    Cross‐species systems analysis of evolutionary toolkits of neurogenomic response to social challenge

    Full text link
    Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/147855/1/gbb12502.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/147855/2/gbb12502-sup-0002-TableS1.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/147855/3/gbb12502_am.pd

    Understanding co-expressed gene sets by identifying regulators and modeling genomic elements

    Get PDF
    Genomic researchers commonly study complex phenotypes by identifying experimentally derived sets of functionally related genes with similar transcriptional profiles. These gene sets are then frequently subjected to statistical tests of association relating them to previously characterized gene sets from literature and public databases. However, few tools exist examining the non-coding, regulatory sequence of gene sets for evidence of a shared regulatory signature that may signal the involvement of important DNA-binding proteins called transcription factors (TFs). Here, we proposed and developed new computational methods for identifying major regulatory features of co-expressed gene sets that incorporate TF-DNA binding specificities (“motifs”) with other important features such as sequence conservation and chromatin structure. We additionally demonstrated a novel approach for discovering regulatory signatures that are shared across gene sets from multiple experimental conditions or tissues. Given the co-expressed genes of a particular cell type, we also attempted to annotate their specific regulatory sequences (“enhancers”) by constructing models of enhancer activity that incorporate the expression and binding specificities of the relevant transcription factors. We first developed and tested these models in well-characterized cell types, and then evaluated the extent to which these models were applicable using only minimal experimental evidence to poorly characterized systems without known transcriptional regulators and functional enhancers. Finally, we developed a network-based algorithm for examining novel gene sets that integrates many diverse types of biological evidences and relationships to better discover functionally related genes. This novel approach processed a comprehensive, heterogeneous network of biological knowledge and ranked genes and molecular properties represented in the network for their relevance to the given set of co-expressed genes

    큰 그래프 상에서의 개인화된 페이지 랭크에 대한 빠른 계산 기법

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2020. 8. 이상구.Computation of Personalized PageRank (PPR) in graphs is an important function that is widely utilized in myriad application domains such as search, recommendation, and knowledge discovery. Because the computation of PPR is an expensive process, a good number of innovative and efficient algorithms for computing PPR have been developed. However, efficient computation of PPR within very large graphs with over millions of nodes is still an open problem. Moreover, previously proposed algorithms cannot handle updates efficiently, thus, severely limiting their capability of handling dynamic graphs. In this paper, we present a fast converging algorithm that guarantees high and controlled precision. We improve the convergence rate of traditional Power Iteration method by adopting successive over-relaxation, and initial guess revision, a vector reuse strategy. The proposed method vastly improves on the traditional Power Iteration in terms of convergence rate and computation time, while retaining its simplicity and strictness. Since it can reuse the previously computed vectors for refreshing PPR vectors, its update performance is also greatly enhanced. Also, since the algorithm halts as soon as it reaches a given error threshold, we can flexibly control the trade-off between accuracy and time, a feature lacking in both sampling-based approximation methods and fully exact methods. Experiments show that the proposed algorithm is at least 20 times faster than the Power Iteration and outperforms other state-of-the-art algorithms.그래프 내에서 개인화된 페이지랭크 (P ersonalized P age R ank, PPR 를 계산하는 것은 검색 , 추천 , 지식발견 등 여러 분야에서 광범위하게 활용되는 중요한 작업 이다 . 개인화된 페이지랭크를 계산하는 것은 고비용의 과정이 필요하므로 , 개인화된 페이지랭크를 계산하는 효율적이고 혁신적인 방법들이 다수 개발되어왔다 . 그러나 수백만 이상의 노드를 가진 대용량 그래프에 대한 효율적인 계산은 여전히 해결되지 않은 문제이다 . 그에 더하여 , 기존 제시된 알고리듬들은 그래프 갱신을 효율적으로 다루지 못하여 동적으로 변화하는 그래프를 다루는 데에 한계점이 크다 . 본 연구에서는 높은 정밀도를 보장하고 정밀도를 통제 가능한 , 빠르게 수렴하는 개인화된 페이지랭크 계산 알고리듬을 제시한다 . 전통적인 거듭제곱법 (Power 에 축차가속완화법 (Successive Over Relaxation) 과 초기 추측 값 보정법 (Initial Guess 을 활용한 벡터 재사용 전략을 적용하여 수렴 속도를 개선하였다 . 제시된 방법은 기존 거듭제곱법의 장점인 단순성과 엄밀성을 유지 하면서 도 수렴율과 계산속도를 크게 개선 한다 . 또한 개인화된 페이지랭크 벡터의 갱신을 위하여 이전에 계산 되어 저장된 벡터를 재사용하 여 , 갱신 에 드는 시간이 크게 단축된다 . 본 방법은 주어진 오차 한계에 도달하는 즉시 결과값을 산출하므로 정확도와 계산시간을 유연하게 조절할 수 있으며 이는 표본 기반 추정방법이나 정확한 값을 산출하는 역행렬 기반 방법 이 가지지 못한 특성이다 . 실험 결과 , 본 방법은 거듭제곱법에 비하여 20 배 이상 빠르게 수렴한다는 것이 확인되었으며 , 기 제시된 최고 성능 의 알고리 듬 보다 우수한 성능을 보이는 것 또한 확인되었다1 Introduction 1 2 Preliminaries: Personalized PageRank 4 2.1 Random Walk, PageRank, and Personalized PageRank. 5 2.1.1 Basics on Random Walk 5 2.1.2 PageRank. 6 2.1.3 Personalized PageRank 8 2.2 Characteristics of Personalized PageRank. 9 2.3 Applications of Personalized PageRank. 12 2.4 Previous Work on Personalized PageRank Computation. 17 2.4.1 Basic Algorithms 17 2.4.2 Enhanced Power Iteration 18 2.4.3 Bookmark Coloring Algorithm. 20 2.4.4 Dynamic Programming 21 2.4.5 Monte-Carlo Sampling. 22 2.4.6 Enhanced Direct Solving 24 2.5 Summary 26 3 Personalized PageRank Computation with Initial Guess Revision 30 3.1 Initial Guess Revision and Relaxation 30 3.2 Finding Optimal Weight of Successive Over Relaxation for PPR. 34 3.3 Initial Guess Construction Algorithm for Personalized PageRank. 36 4 Fully Personalized PageRank Algorithm with Initial Guess Revision 42 4.1 FPPR with IGR. 42 4.2 Optimization. 49 4.3 Experiments. 52 5 Personalized PageRank Query Processing with Initial Guess Revision 56 5.1 PPR Query Processing with IGR 56 5.2 Optimization. 64 5.3 Experiments. 67 6 Conclusion 74 Bibliography 77 Appendix 88 Abstract (In Korean) 90Docto

    Identifying disease-associated genes based on artificial intelligence

    Get PDF
    Identifying disease-gene associations can help improve the understanding of disease mechanisms, which has a variety of applications, such as early diagnosis and drug development. Although experimental techniques, such as linkage analysis, genome-wide association studies (GWAS), have identified a large number of associations, identifying disease genes is still challenging since experimental methods are usually time-consuming and expensive. To solve these issues, computational methods are proposed to predict disease-gene associations. Based on the characteristics of existing computational algorithms in the literature, we can roughly divide them into three categories: network-based methods, machine learning-based methods, and other methods. No matter what models are used to predict disease genes, the proper integration of multi-level biological data is the key to improving prediction accuracy. This thesis addresses some limitations of the existing computational algorithms, and integrates multi-level data via artificial intelligence techniques. The thesis starts with a comprehensive review of computational methods, databases, and evaluation methods used in predicting disease-gene associations, followed by one network-based method and four machine learning-based methods. The first chapter introduces the background information, objectives of the studies and structure of the thesis. After that, a comprehensive review is provided in the second chapter to discuss the existing algorithms as well as the databases and evaluation methods used in existing studies. Having the objectives and future directions, the thesis then presents five computational methods for predicting disease-gene associations. The first method proposed in Chapter 3 considers the issue of non-disease gene selection. A shortest path-based strategy is used to select reliable non-disease genes from a disease gene network and a differential network. The selected genes are then used by a network-energy model to improve its performance. The second method proposed in Chapter 4 constructs sample-based networks for case samples and uses them to predict disease genes. This strategy improves the quality of protein-protein interaction (PPI) networks, which further improves the prediction accuracy. Chapter 5 presents a generic model which applies multimodal deep belief nets (DBN) to fuse different types of data. Network embeddings extracted from PPI networks and gene ontology (GO) data are fused with the multimodal DBN to obtain cross-modality representations. Chapter 6 presents another deep learning model which uses a convolutional neural network (CNN) to integrate gene similarities with other types of data. Finally, the fifth method proposed in Chapter 7 is a nonnegative matrix factorization (NMF)-based method. This method maps diseases and genes onto a lower-dimensional manifold, and the geodesic distance between diseases and genes are used to predict their associations. The method can predict disease genes even if the disease under consideration has no known associated genes. In summary, this thesis has proposed several artificial intelligence-based computational algorithms to address the typical issues existing in computational algorithms. Experimental results have shown that the proposed methods can improve the accuracy of disease-gene prediction

    GeneSet MAPR: Characterization of gene sets through heterogeneous network patterns

    Get PDF
    Often, machine learning and big data concepts are applied to problems without a proper appreciation of their limitations or domain context. At the same time there is a growing appreciation for the ability of networks to represent more complex connections between data points than previous structures. However, established machine learning approaches rarely take advantage of such structures and must be adapted. We present here a method that utilizes patterns of connections within heterogeneous networks to score items by their similarity to an input set. We apply the idea of meta-paths as an abstraction to counteract typical big data problems of noise and overfitting. We also aim to demystify the black-box nature of machine learning by providing intuitive feedback about why items are considered similar. While the method presented here is generalizable to any domain, the specific examples explored are within the genomics domain. The final tool, GeneSet MAPR, is especially useful in a domain with little ground truth and a huge volume of noisy, uncertain data. We show that GeneSet MAPR performs better at discovering related but concealed data points than an approach using the same data without abstraction, as well as a an established state-of-the-art approach that works on a network but ignores the heterogeneous patterns. It does this while providing details the other methods cannot

    Role of network topology based methods in discovering novel gene-phenotype associations

    Get PDF
    The cell is governed by the complex interactions among various types of biomolecules. Coupled with environmental factors, variations in DNA can cause alterations in normal gene function and lead to a disease condition. Often, such disease phenotypes involve coordinated dysregulation of multiple genes that implicate inter-connected pathways. Towards a better understanding and characterization of mechanisms underlying human diseases, here, I present GUILD, a network-based disease-gene prioritization framework. GUILD associates genes with diseases using the global topology of the protein-protein interaction network and an initial set of genes known to be implicated in the disease. Furthermore, I investigate the mechanistic relationships between disease-genes and explain the robustness emerging from these relationships. I also introduce GUILDify, an online and user-friendly tool which prioritizes genes for their association to any user-provided phenotype. Finally, I describe current state-of-the-art systems-biology approaches where network modeling has helped extending our view on diseases such as cancer.La cèl•lula es regeix per interaccions complexes entre diferents tipus de biomolècules. Juntament amb factors ambientals, variacions en el DNA poden causar alteracions en la funció normal dels gens i provocar malalties. Sovint, aquests fenotips de malaltia involucren una desregulació coordinada de múltiples gens implicats en vies interconnectades. Per tal de comprendre i caracteritzar millor els mecanismes subjacents en malalties humanes, en aquesta tesis presento el programa GUILD, una plataforma que prioritza gens relacionats amb una malaltia en concret fent us de la topologia de xarxe. A partir d’un conjunt conegut de gens implicats en una malaltia, GUILD associa altres gens amb la malaltia mitjancant la topologia global de la xarxa d’interaccions de proteïnes. A més a més, analitzo les relacions mecanístiques entre gens associats a malalties i explico la robustesa es desprèn d’aquesta anàlisi. També presento GUILDify, un servidor web de fácil ús per la priorització de gens i la seva associació a un determinat fenotip. Finalment, descric els mètodes més recents en què el model•latge de xarxes ha ajudat extendre el coneixement sobre malalties complexes, com per exemple a càncer

    Computational Labeling, Partitioning, and Balancing of Molecular Networks

    Get PDF
    Recent advances in high throughput techniques enable large-scale molecular quantification with high accuracy, including mRNAs, proteins and metabolites. Differential expression of these molecules in case and control samples provides a way to select phenotype-associated molecules with statistically significant changes. However, given the significance ranking list of molecular changes, how those molecules work together to drive phenotype formation is still unclear. In particular, the changes in molecular quantities are insufficient to interpret the changes in their functional behavior. My study is aimed at answering this question by integrating molecular network data to systematically model and estimate the changes of molecular functional behaviors. We build three computational models to label, partition, and balance molecular networks using modern machine learning techniques. (1) Due to the incompleteness of protein functional annotation, we develop AptRank, an adaptive PageRank model for protein function prediction on bilayer networks. By integrating Gene Ontology (GO) hierarchy with protein-protein interaction network, our AptRank outperforms four state-of-the-art methods in a comprehensive evaluation using benchmark datasets. (2) We next extend our AptRank into a network partitioning method, BioSweeper, to identify functional network modules in which molecules share similar functions and also densely connect to each other. Compared to traditional network partitioning methods using only network connections, BioSweeper, which integrates the GO hierarchy, can automatically identify functionally enriched network modules. (3) Finally, we conduct a differential interaction analysis, namely difFBA, on protein-protein interaction networks by simulating protein fluxes using flux balance analysis (FBA). We test difFBA using quantitative proteomic data from colon cancer, and demonstrate that difFBA offers more insights into functional changes in molecular behavior than does protein quantity changes alone. We conclude that our integrative network model increases the observational dimensions of complex biological systems, and enables us to more deeply understand the causal relationships between genotypes and phenotypes
    corecore