2,947 research outputs found

    Laplacian normalization and random walk on heterogeneous networks for disease-gene prioritization

    Full text link
    ยฉ 2015 Elsevier Ltd. All rights reserved. Random walk on heterogeneous networks is a recently emerging approach to effective disease gene prioritization. Laplacian normalization is a technique capable of normalizing the weight of edges in a network. We use this technique to normalize the gene matrix and the phenotype matrix before the construction of the heterogeneous network, and also use this idea to define the transition matrices of the heterogeneous network. Our method has remarkably better performance than the existing methods for recovering known gene-phenotype relationships. The Shannon information entropy of the distribution of the transition probabilities in our networks is found to be smaller than the networks constructed by the existing methods, implying that a higher number of top-ranked genes can be verified as disease genes. In fact, the most probable gene-phenotype relationships ranked within top 3 or top 5 in our gene lists can be confirmed by the OMIM database for many cases. Our algorithms have shown remarkably superior performance over the state-of-the-art algorithms for recovering gene-phenotype relationships. All Matlab codes can be available upon email request

    Integration of multiple data sources to prioritize candidate genes using discounted rating system

    Get PDF
    Background: Identifying disease gene from a list of candidate genes is an important task in bioinformatics. The main strategy is to prioritize candidate genes based on their similarity to known disease genes. Most of existing gene prioritization methods access only one genomic data source, which is noisy and incomplete. Thus, there is a need for the integration of multiple data sources containing different information. Results: In this paper, we proposed a combination strategy, called discounted rating system (DRS). We performed leave one out cross validation to compare it with N-dimensional order statistics (NDOS) used in Endeavour. Results showed that the AUC (Area Under the Curve) values achieved by DRS were comparable with NDOS on most of the disease families. But DRS worked much faster than NDOS, especially when the number of data sources increases. When there are 100 candidate genes and 20 data sources, DRS works more than 180 times faster than NDOS. In the framework of DRS, we give different weights for different data sources. The weighted DRS achieved significantly higher AUC values than NDOS. Conclusion: The proposed DRS algorithm is a powerful and effective framework for candidate gene prioritization. If weights of different data sources are proper given, the DRS algorithm will perform better

    Gene prioritization and clustering by multi-view text mining

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Text mining has become a useful tool for biologists trying to understand the genetics of diseases. In particular, it can help identify the most interesting candidate genes for a disease for further experimental analysis. Many text mining approaches have been introduced, but the effect of disease-gene identification varies in different text mining models. Thus, the idea of incorporating more text mining models may be beneficial to obtain more refined and accurate knowledge. However, how to effectively combine these models still remains a challenging question in machine learning. In particular, it is a non-trivial issue to guarantee that the integrated model performs better than the best individual model.</p> <p>Results</p> <p>We present a multi-view approach to retrieve biomedical knowledge using different controlled vocabularies. These controlled vocabularies are selected on the basis of nine well-known bio-ontologies and are applied to index the vast amounts of gene-based free-text information available in the MEDLINE repository. The text mining result specified by a vocabulary is considered as a view and the obtained multiple views are integrated by multi-source learning algorithms. We investigate the effect of integration in two fundamental computational disease gene identification tasks: gene prioritization and gene clustering. The performance of the proposed approach is systematically evaluated and compared on real benchmark data sets. In both tasks, the multi-view approach demonstrates significantly better performance than other comparing methods.</p> <p>Conclusions</p> <p>In practical research, the relevance of specific vocabulary pertaining to the task is usually unknown. In such case, multi-view text mining is a superior and promising strategy for text-based disease gene identification.</p

    MetaRanker 2.0: a web server for prioritization of genetic variation data

    Get PDF
    MetaRanker 2.0 is a web server for prioritization of common and rare frequency genetic variation data. Based on heterogeneous data sets including genetic association data, proteinโ€“protein interactions, large-scale text-mining data, copy number variation data and gene expression experiments, MetaRanker 2.0 prioritizes the protein-coding part of the human genome to shortlist candidate genes for targeted follow-up studies. MetaRanker 2.0 is made freely available at www.cbs.dtu.dk/services/MetaRanker-2.0

    RNA-Seq ๋ฐ์ดํ„ฐ์—์„œ ์œ ์ „์ž์˜ ๋žญํ‚น์„ ์ฑ…์ •ํ•˜๊ธฐ ์œ„ํ•œ ๋„คํŠธ์›Œํฌ ์ ‘๊ทผ๋ฒ•์„ ์‚ฌ์šฉํ•œ ์ •๋ณด ๊ณผํ•™ ์‹œ์Šคํ…œ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :์ž์—ฐ๊ณผํ•™๋Œ€ํ•™ ํ˜‘๋™๊ณผ์ • ์ƒ๋ฌผ์ •๋ณดํ•™์ „๊ณต,2019. 8. ๊น€์„ .RNA-seq ๊ธฐ์ˆ ์€ ๊ฒŒ๋†ˆ ๊ทœ๋ชจ์˜ ์ „์‚ฌ์ฒด๋ฅผ ๊ณ ํ•ด์ƒ๋„๋กœ ๋ถ„์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ค์—ˆ์œผ๋‚˜, ์ผ๋ฐ˜์ ์œผ๋กœ ์ „์‚ฌ์ฒด ๋ฐ์ดํ„ฐ์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ์œ ์ „์ž์˜ ์ˆ˜๋Š” ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ์ถ”๊ฐ€ ๋ถ„์„ ์—†์ด ์—ฐ๊ตฌ ๋ชฉํ‘œ์™€ ๊ด€๋ จ๋œ ์œ ์ „์ž๋ฅผ ์‹๋ณ„ํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต๋‹ค. ๋”ฐ๋ผ์„œ ์ „์‚ฌ์ฒด ๋ฐ์ดํ„ฐ ๋ถ„์„์€ ์ข…์ข… ์ƒ๋ฌผ ๋„คํŠธ์›Œํฌ, ์œ ์ „์ž ์ •๋ณด ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค, ๋ฌธํ—Œ ์ •๋ณด ๊ฐ™์ด ์„œ๋กœ ๋‹ค๋ฅธ ์ž์›์„ ํ™œ์šฉํ•˜์—ฌ ๋ถ„์„ํ•˜๊ฒŒ ๋œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ž์›๋“ค ๊ฐ„์˜ ๊ด€๊ณ„๋Š” ์ด์งˆ์ ์ธ ๋ถ€๋ถ„์ด ์กด์žฌํ•˜์—ฌ ์„œ๋กœ ์ง์ ‘์ ์œผ๋กœ ์—ฐ๊ฒฐํ•˜์—ฌ ํ•ด์„ํ•˜๊ธฐ ์–ด๋ ค์šฐ๋ฉฐ ์–ด๋– ํ•œ ์œ ์ „์ž๊ฐ€ ์‹คํ—˜ ๋ชฉํ‘œ์™€ ๊ด€๋ จ์ด ์žˆ๋Š”์ง€๋ฅผ ๊ตฌ์ฒด์ ์œผ๋กœ ์ดํ•ดํ•˜๊ธฐ ํž˜๋“ค๋‹ค. ๋”ฐ๋ผ์„œ ํŠน์ • ์—ฐ๊ตฌ ๋ชฉํ‘œ์™€ ๊ด€๋ จ ์žˆ๋Š” ํ•ต์‹ฌ ์œ ์ „์ž๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ฒฐ์ •ํ•˜๊ณ  ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ด๋Ÿฌํ•œ ์ด์งˆ์ ์ธ ์ž์›์„ ํšจ๊ณผ์ ์œผ๋กœ ํ†ตํ•ฉํ•  ๊ฐ•๋ ฅํ•œ ์ „์‚ฐ ๊ธฐ๋ฒ•์ด ํ•„์š”ํ•˜๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋„คํŠธ์›Œํฌ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์ „์‚ฌ์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๊ณ  ์‹คํ—˜ ๋ชฉํ‘œ์™€ ๊ด€๋ จ ์žˆ๋Š” ์œ ์ „์ž๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•œ ์„ธ ๊ฐ€์ง€ ์ƒ๋ฌผ ์ •๋ณด ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ–ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์—ฐ๊ตฌ๋Š” RNA-Seq ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ์ƒ˜ํ”Œ ์ˆ˜๊ฐ€ ์ ์€ ์œ ์ „์ž ๋…น์•„์›ƒ (KO) ๋งˆ์šฐ์Šค ์‹คํ—˜์—์„œ ์ค‘์š”ํ•œ ์œ ์ „์ž๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•œ ์ •๋ณดํ•™ ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. ์ด ์‹œ์Šคํ…œ์€ ์œ ์ „์ž ์กฐ์ ˆ ๋„คํŠธ์›Œํฌ (GRN)์™€ ํŒจ์Šค์›จ์ด ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์œ ์˜ํ•จ์ด ์ ์€ Differentially Expressed Gene (DEG)๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๋‹จ์ผ ์—ผ๊ธฐ ๋ณ€์ด (SNV) ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ˜ํ”Œ ๊ฐ„ ์œ ์ „์  ์ฐจ์ด๋กœ ์ธํ•ด ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋Š” ์œ ์ „์ž๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค. ์ด ์—ฐ๊ตฌ๋Š” ๋„คํŠธ์›Œํฌ์™€ SNV ์ •๋ณด์˜ ํ†ตํ•ฉ์„ ํ†ตํ•ด์„œ ํ›„๋ณด ์œ ์ „์ž์˜ ์ˆ˜๋ฅผ ์œ ์˜๋ฏธํ•˜๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ์—ฐ๊ตฌ๋Š” ์‚ฌ์šฉ์ž์˜ ์‹คํ—˜ ๋ชฉํ‘œ๋ฅผ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋Š” ์œ ์ „์ž ๋žญํ‚น ์‹œ์Šคํ…œ์ธ CLIP-GENE์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. CLIP-GENE์€ ์ฅ์˜ ์ „์‚ฌ์ธ์ž KO ์‹คํ—˜์—์„œ ์œ ์ „์ž๋ฅผ ๋žญํ‚นํ•˜๊ธฐ ์œ„ํ•œ ํ†ตํ•ฉ ๋ถ„์„ ์›น ์„œ๋น„์Šค์ด๋‹ค. CLIP-GENE์€ ํ›„๋ณด ์œ ์ „์ž์— ๋žญํ‚น์„ ๋ถ€์—ฌํ•˜๊ธฐ ์œ„ํ•ด GRN, SNV ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ์ƒ˜ํ”Œ ๊ฐœ์ฒด ๊ฐ„์˜ ์ฐจ์ด๊ฐ€ ์žˆ๊ณ  ๋œ ์œ ์˜๋ฏธํ•œ ํ›„๋ณด ์œ ์ „์ž๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ํ…์ŠคํŠธ ๋งˆ์ด๋‹ ๊ธฐ์ˆ ๊ณผ ๋‹จ๋ฐฑ์งˆ-๋‹จ๋ฐฑ์งˆ ์ƒํ˜ธ์ž‘์šฉ ๋„คํŠธ์›Œํฌ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ์‚ฌ์šฉ์ž์˜ ์‹คํ—˜ ๋ชฉํ‘œ์™€ ๊ด€๋ จ๋œ ์œ ์ „์ž๋ฅผ ๋žญํ‚นํ•œ๋‹ค. ๋งˆ์ง€๋ง‰ ์—ฐ๊ตฌ๋Š” ๋ฒค ๋‹ค์ด์–ด๊ทธ๋žจ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์ˆ˜์˜ RNA-Seq ์‹คํ—˜์„ ๋น„๊ต๋ถ„์„ ํ• ์ˆ˜ ์žˆ๋Š” ์ •๋ณด ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. RNA-Seq ์‹คํ—˜์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๋น„๊ต ๋ฐ ๋Œ€์กฐ๊ตฐ์˜ ์ƒ˜ํ”Œ์„ ๋น„๊ตํ•˜์—ฌ DEG๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ๋ฒค ๋‹ค์ด์–ด๊ทธ๋žจ์„ ํ†ตํ•˜์—ฌ ์ƒ˜ํ”Œ ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ๋ถ„์„ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฒค ๋‹ค์ด์–ด๊ทธ๋žจ ์ƒ์—์„œ์˜ ๊ฐ ์˜์—ญ์€ ๋‹ค์–‘ํ•œ ๋น„์œจ์˜ DEG๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ํŠน์ • ์˜์—ญ์˜ DEG๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋น„๊ต๊ตฐ(ํ˜น์€ ๋Œ€์กฐ๊ตฐ)์— ์˜ํ•œ DEG์ด๊ธฐ์— ๋‹จ์ˆœํžˆ ์œ ์ „์ž ๋ชฉ๋ก ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ๋น„๊ตํ•˜๋Š” ๊ฒƒ์€ ์ ์ ˆํ•˜์ง€ ๋ชปํ•˜๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ฒค ๋‹ค์ด์–ด๊ทธ๋žจ๊ณผ ๋„คํŠธ์›Œํฌ ์ „ํŒŒ(Network Propagation)๋ฅผ ์‚ฌ์šฉํ•œ ํ†ตํ•ฉ ๋ถ„์„ ํ”„๋ ˆ์ž„์›Œํฌ์ธ Venn-diaNet์ด ๊ฐœ๋ฐœํ–ˆ๋‹ค. Venn-diaNet์€ ๋‹ค์ˆ˜์˜ DEG ๋ชฉ๋ก์ด ์žˆ๋Š” ์‹คํ—˜์˜ ์œ ์ „์ž๋ฅผ ๋žญํ‚นํ•  ์ˆ˜ ์žˆ๋Š” ์ •๋ณด ์‹œ์Šคํ…œ์ด๋‹ค. ์šฐ๋ฆฌ๋Š” Venn-diaNet์ด ์„œ๋กœ ๋‹ค๋ฅธ ์กฐ๊ฑด์—์„œ ์ƒ๋ฌผํ•™์  ์‹คํ—˜์„ ๋น„๊ตํ•จ์œผ๋กœ์จ ์›๋ณธ ๋…ผ๋ฌธ์— ๋ณด๊ณ ๋œ ์—ฐ๊ตฌ ๊ฒฐ๊ณผ๋ฅผ ์žฌํ˜„ ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ์ •๋ฆฌํ•˜๋ฉด ์ด ๋…ผ๋ฌธ์€ ์ „์‚ฌ์ฒด ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์œ ์ „์ž๋ฅผ ๋žญํ‚นํ•  ์ˆ˜์žˆ๋Š” ์ •๋ณด ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•˜๊ธฐ ์œ„ํ•ด ๋„คํŠธ์›Œํฌ ๊ธฐ๋ฐ˜ ๋ถ„์„๋ฒ•์„ ๋‹ค์–‘ํ•œ ์ž์›๋“ค๊ณผ ๊ฒฐํ•ฉํ•˜์˜€์œผ๋ฉฐ, ๋‹ค๋ฅธ ์—ฐ๊ตฌ์ž์˜ ํŽธ๋ฆฌํ•œ ์‚ฌ์šฉ ๊ฒฝํ—˜์„ ์œ„ํ•ด ์นœํ™”์ ์ธ UI๋ฅผ ๊ฐ€์ง„ ์›น๋„๊ตฌ ๋˜๋Š” ์†Œํ”„ํŠธ์›จ์–ด ํŒจํ‚ค์ง€๋กœ ์ œ์ž‘ ๋ฐ ๋ฐฐํฌํ•˜์˜€๋‹ค.Transcriptomic analysis, the measurement of transcripts on the genome scale, is now routinely performed in high resolution. Since the number of genes obtained in the transcriptome data is usually large, it is difficult for researchers to identify genes that are relevant to their research goals, without additional analysis. Analysis of transcriptome data is often performed utilizing heterogeneous resources such as biological networks, annotated gene information, and published literature. However, the relationship among heterogeneous resources is often too complicated to decipher which genes are relevant to the experimental design. Therefore, powerful computational methods should be coupled with these heterogeneous resources in order to effectively determine and illustrate key genes that are relevant to specific research goals. In my doctoral study, I have developed three bioinformatics systems that use network approaches to analyze transcriptome data and rank genes that are relevant to the experimental design. The first study was conducted to develop a bioinformatics system that could be used to analyze RNA-Seq data of gene knockout (KO) mice, where the sample number is small. In this case, the main objectives were to investigate how the KO gene affects the expression of other genes and identify the key genes that contribute significantly to the phenotypic difference. To address these questions, I developed a gene prioritization system that utilizes the characteristics of RNA-Seq data. The system prioritizes genes by removing the less informative differentially expressed genes (DEGs) using gene regulatory network (GRN) and biological pathways. Next, it filters out genes that might be different due to genetic differences between samples using single nucleotide variant (SNV) information. Consequently, this study demonstrated that the integration of networks and SNV information was able to increase the performance of gene prioritization. The second study was conducted to develop a gene prioritization system that allows the user to specify the context of the experiment. This study was inspired by the fact that the currently available analysis methods for transcriptome data do not fully consider the experimental design of gene KO studies. Therefore, I envisaged that users would prefer an analysis method that took into consideration the characteristics of the KO experiments and could be guided by the context of the researcher who has designed and performed the biological experiment. Therefore, I developed CLIP-GENE, a web service of the condition-specific context-laid integrative analysis for prioritizing genes in mouse TF KO experiments. CLIP-GENE prioritizes genes of KO experiments by removing the less informative DEGs using GRN, discards genes that might have sample variance, using SNV information, and ranks genes that are related to the user's context using the text-mining technique, as well as considering the shortest path of protein-protein interaction (PPI) from the KO gene to the target genes. The last study was conducted to develop an informative system that could be used to compare multiple RNA-Seq experiments using Venn diagrams. In general, RNA-Seq experiments are performed to compare samples between control and treated groups, producing a set of DEGs. Each region in a Venn diagram (a subset of DEGs) generally contains a large number of genes that could complicate the determination of the important and relevant genes. Moreover, simply comparing the list of DEGs from different experiments could be misleading because some of the DEG lists may have been measured using different controls. To address these issues, Venn-diaNet was developed, an analysis framework that integrates Venn diagram and network propagation to prioritize genes for experiments that have multiple DEG lists. We demonstrated that Venn-diaNet was able to reproduce research findings reported in the original papers by comparing two, three, and eight biological experiments measured in different conditions. I believe that Venn-diaNet can be very useful for researchers to determine genes for their follow-up studies. In summary, my doctoral study aimed to develop computational tools that can prioritize genes from transcriptome data. To achieve this goal, I combined network approaches with multiple heterogeneous resources in a single computational environment. All three informatics systems are deployed as software packages or web tools to support convenient access to researchers, eliminating the need for installation or learning any additional software packages.Abstract Chapter 1 Introduction 1.1 Challenges of analyzing RNA-seq data 1.1.1 Excessive amount of databases and analysis methods 1.1.2 Knowledge bias that prioritizes less relevant genes 1.1.3 Complicated experiment designs 1.2 My approach to address the challenges for the analysis of RNA-seq data 1.3 Background 1.3.1 Differentially expressed genes 1.3.2 Gene prioritization 1.4 Outline of the thesis Chapter 2 A filtering strategy that combines GRN, SNV information to enhances the gene prioritization in mouse KO studies with small number of samples 2.1 Background 2.2 Methods 2.2.1 First filter: DEG 2.2.2 Second filter: GRN 2.2.3 Third filter: Biological Pathway 2.2.4 Final filter: SNV 2.3 Results and Discussion 2.4 Discussion Chapter 3 An integration of data-fusion and text-mining strategy to prioritize context-laid genes in mouse TF KO experiments 3.1 Background 3.2 Methods 3.2.1 Selection of initial candidate genes 3.2.2 Prioritizing genes with the user context and PPI 3.3 Results and Discussion 3.3.1 Performance with the best context 3.3.2 Performance with the worst context 3.4 Discussion Chapter 4 Integrating Venn diagram to the network-based strategy for comparing multiple biological experiments 4.1 Background 4.2 Methods 4.2.1 Taking input data 4.2.2 Generating Venn diagram of DEG sets 4.2.3 Network propagation and gene ranking 4.3 Results and Discussion 4.3.1 Venn-diaNet for two experiments 4.3.2 Venn-diaNet for three experiments 4.3.3 Venn-diaNet for eight experiments 4.3.4Venn-diaNet performance with different PPI network 4.4 Discussion Chapter 5 Conclusion Bibliography ์ดˆ๋กDocto
    • โ€ฆ
    corecore