2,067 research outputs found

    ์ฐจ์„ธ๋Œ€ ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ SNV/InDel ํ˜ธ์ถœ ๋ฐ ํ•˜ํ”Œ๋กœํƒ€์ดํ•‘์˜ ์ƒˆ๋กœ์šด ์ ‘๊ทผ ๋ฐฉ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ๋ฐ•๊ทผ์ˆ˜.์ฐจ์„ธ๋Œ€ ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์ˆ˜๋งŽ์€ ๋ณ€์ด ํ˜ธ์ถœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๊ฐœ๋ฐœ๋˜์–ด ์™”๋‹ค. ๋Œ€๋‹ค์ˆ˜ ๋ณ€์ด ํ˜ธ์ถœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ์ถ”๊ฐ€์ ์ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•œ ์—ฌ์ง€๋Š” ๋‚จ์•„์žˆ๋‹ค. ํŠนํžˆ, ๋‚ฎ์€ ๋ฆฌ๋“œ ๊นŠ์ด๋ฅผ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋ณ€์ด ํ˜ธ์ถœ๊ณผ ์ฒด์„ธํฌ ๋ณ€์ด ํ˜ธ์ถœ์€ ๊ฐœ์„ ๋  ์—ฌ์ง€๊ฐ€ ๋งŽ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์œ„ ์–‘์„ฑ ๋ณ€์ด๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ ๋ณ€์ด ํ˜ธ์ถœ์˜ ์ •๋ฐ€๋„๋ฅผ ๊ฐœ์„ ํ•˜๋Š” ์ƒˆ๋กœ์šด ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ RDscan์„ ์ œ์•ˆํ•œ๋‹ค. RDscan์€ ์ž˜๋ชป ์ •๋ ฌ๋œ ๋ฆฌ๋“œ๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๋ฆฌ๋“œ๋ฅผ ์žฌ๋ฐฐ์น˜ ํ•œ ํ›„, ๋ฆฌ๋“œ ๊นŠ์ด ๋ถ„ํฌ์— ๊ธฐ๋ฐ˜ํ•œ ๋ณ€์ด์˜ ์‹ ๋ขฐ๋„ ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์œ„ ์–‘์„ฑ ๋ณ€์ด๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ตœ์‹  ๋ณ€์ด ํ˜ธ์ถœ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ RDscan์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. 1000 ๊ฒŒ๋†ˆ ํ”„๋กœ์ ํŠธ์™€ ์ผ๋ฃจ๋ฏธ๋‚˜์˜ ๋ฐ์ดํ„ฐ์„ธํŠธ์— ๋Œ€ํ•˜์—ฌ RDscan์„ ํ†ตํ•œ ์ถ”๊ฐ€์ ์ธ ๋ณ€์ด ํ•„ํ„ฐ๋ง์€ ํ…Œ์ŠคํŠธ์— ์‚ฌ์šฉ๋œ ๋Œ€๋ถ€๋ถ„์˜ ๋ณ€์ด ํ˜ธ์ถœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ •ํ™•์„ฑ์„ ๊ฐœ์„ ์‹œ์ผฐ๋‹ค. ์ƒ์‹์„ธํฌ ๋ณ€์ด์— ๋Œ€ํ•œ ํ˜ธ์ถœ์€ 12๊ฑด์˜ ํ…Œ์ŠคํŠธ ์ค‘ 11๊ฑด, ์ฒด์„ธํฌ ๋ณ€์ด ๋Œ€ํ•œ ํ˜ธ์ถœ์€ 24๊ฑด์˜ ํ…Œ์ŠคํŠธ ์ค‘ 21๊ฑด์—์„œ ์ •ํ™•์„ฑ์ด ์ฆ๊ฐ€๋˜์—ˆ๋‹ค. ์•Œ๋ ค์ง„ ๊ณจ๋“œ ์Šคํƒ ๋‹ค๋“œ ๋ณ€์ด ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™”๋ฅผ ํ†ตํ•ด ์ƒ์„ฑ๋œ ์ตœ์ ์˜ ๋ณ€์ด ์„ธํŠธ์— ๋Œ€ํ•ด์„œ๋„, RDscan ์€ ์ƒ์‹์„ธํฌ ๋ณ€์ด์— ๋Œ€ํ•œ 12๊ฑด ์ค‘ 5๊ฑด, ์ฒด์„ธํฌ ๋ณ€์ด์— ๋Œ€ํ•œ 24 ๊ฑด ์ค‘ 21๊ฑด์—์„œ ๋ณ€์ด ํ˜ธ์ถœ ์ •ํ™•์„ฑ์„ ๊ฐœ์„ ํ•˜์˜€๋‹ค. ์ž„์ƒ ๋ฐ ์—ฐ๊ตฌ์—์„œ๋Š” ๋‹จ์ผ ๊ฒŒ๋†ˆ ๊ฐ€๋‹ฅ์— ์กด์žฌํ•˜๋Š” ๋ณ€์ด์˜ ์„ธํŠธ ์ •๋ณด (ํ•˜ํ”Œ๋กœํƒ€์ดํ•‘)๋ฅผ ํ•„์š”๋กœ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๋‹ค. ํŠนํžˆ ์ธ๊ฐ„ ๋ฐฑํ˜ˆ๊ตฌ ํ•ญ์› ์œ ์ „์ž๋“ค์— ๋Œ€ํ•œ ํ•˜ํ”Œ๋กœํƒ€์ดํ•‘์€ ์‹ค์ œ ์ž„์ƒ์—์„œ ๋‹ค๋ฃจ๋Š” ์ค‘์š”ํ•œ ๋ฌธ์ œ์ด๋‹ค. ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ ์ฐจ์„ธ๋Œ€ ์‹œํ€€์‹ฑ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ธ๊ฐ„ ๋ฐฑํ˜ˆ๊ตฌ ํ•ญ์› ์œ ์ „์ž์— ๋Œ€ํ•œ ํ•˜ํ”Œ๋กœํƒ€์ดํ•‘์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ์— ์ ํ•ฉํ•จ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ํ•˜์ง€๋งŒ, ํ•˜ํ”Œ๋กœํƒ€์ดํ•‘์˜ ์ •ํ™•์„ฑ์„ ์ €ํ•˜์‹œํ‚ค๋Š” ๋Œ€๋ฆฝ ์œ ์ „์ž์˜ ์ƒ ์กฐ์ • ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์—†๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ฐจ์„ธ๋Œ€ ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์ธ๊ฐ„ ๋ฐฑํ˜ˆ๊ตฌ ํ•ญ์› ์œ ์ „์ž๋“ค์— ๋Œ€ํ•œ ํ•˜ํ”Œ๋กœํƒ€์ดํ•‘์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์ƒˆ๋กœ์šด ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ HLAscan์„ ์†Œ๊ฐœํ•œ๋‹ค. HLAscan์€ ImMunoGeneTics ํ”„๋กœ์ ํŠธ์—์„œ ์ œ๊ณตํ•˜๋Š” IMGT/HLA ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์˜ ์ธ๊ฐ„ ๋ฐฑํ˜ˆ๊ตฌ ํ•ญ์› ์œ ์ „์ž ์„œ์—ด๋“ค์— ๋Œ€ํ•ด ๊ฐœ์ธ์˜ ์œ ์ „์ฒด ๋ฆฌ๋“œ๋ฅผ ์ •๋ ฌํ•œ๋‹ค. ๊ทธ ํ›„, ์ •๋ ฌ๋œ ๋ฆฌ๋“œ์˜ ๋ถ„ํฌ์— ๊ธฐ๋ฐ˜ํ•œ ์ ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜ฌ๋ฐ”๋ฅธ ๋Œ€๋ฆฝ์œ ์ „์ž์˜ ์ƒ์„ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. HLAscan์„ ํ†ตํ•œ ํ•˜ํ”Œ๋กœํƒ€์ดํ•‘์€1000 ๊ฒŒ๋†ˆ ํ”„๋กœ์ ํŠธ์™€ HapMap ํ”„๋กœ์ ํŠธ์˜ ๊ณต์‹ ๋ฐ์ดํ„ฐ์„ธํŠธ์— ๋Œ€ํ•ด์„œ ๊ธฐ์กด์˜ ์ฐจ์„ธ๋Œ€ ์‹œํ€€์‹ฑ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค๋ณด๋‹ค ๋†’์€ ์ •ํ™•์„ฑ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๋˜ํ•œ HiSeq X-TEN์œผ๋กœ ์ƒ์„ฑํ•œ ์•„ํ™‰ ๊ฐ€์กฑ์˜ ๋ฐ์ดํ„ฐ์„ธํŠธ์— ๋Œ€ํ•ด์„œ, HLAscan์„ ์‚ฌ์šฉํ•œ ํ•˜ํ”Œ๋กœํƒ€์ดํ•‘ ๊ฒฐ๊ณผ๋Š” 96.9%์˜ ์ •ํ™•์„ฑ์„ ๋ณด์˜€๊ณ , ๊ทธ ์ค‘ 90ร— ์ด์ƒ์˜ ๋†’์€ ๋ฆฌ๋“œ ๊นŠ์ด๋ฅผ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ์„ธํŠธ์— ๋Œ€ํ•ด์„œ๋Š” 100% ์ •ํ™•์„ฑ์„ ๋ณด์˜€๋‹ค.Several tools have been developed for calling variants from next-generation sequencing data. Although they are generally accurate and reliable, most of them have room for improvement, especially in regard to calling variants in datasets with low read depth coverage. In addition, the somatic variants predicted by several somatic variant callers tend to have very low concordance rates. First, we propose a new tool (RDscan) for improving germline and somatic variant calling in next-generation sequencing data. RDscan removes misaligned reads, repositions reads, and calculates RDscore based on the read depth distribution. With RDscore, RDscan improves the precision of variant callers by removing false variants. When we tested our new tool using the latest variant calling algorithms, accuracy was improved for most of the algorithms. After screening variants with RDscan, calling accuracies increased for germline variants in 11 out of 12 cases and for somatic variants in 21 out of 24 cases. For the best set of variants produced by optimizing the parameters of each algorithm using the known truth sets, RDscan increased the calling accuracies for germline variants in 5 out of 12 cases and for somatic variants in 21 out of 24 cases. Some applications require information on multiple variants in a single genome strand (haplotyping). In particular, precise haplotyping for human leukocyte antigen genes is of great clinical importance. Several recent studies showed that next-generation sequencing based method is a feasible and promising technique for haplotyping of human leukocyte antigen genes. To date, however, no method with sufficient read depth has completely solved the allele phasing issue. Second, we developed a new method (HLAscan) for HLA haplotyping using NGS data. HLAscan performs alignment of reads to HLA sequences from human leukocyte antigen (IMGT/HLA) database in the international ImMunoGeneTics project. The distribution of aligned reads was used to calculate a score function to determine correctly phased alleles by progressively removing false-positive alleles. Comparative HLA typing tests using public datasets from the 1000 Genomes Project and the International HapMap Project demonstrated that HLAscan could perform HLA typing more accurately than previously reported NGS-based methods. We also applied HLAscan to a family dataset with various coverage depths generated on the Illumina HiSeq X-TEN platform. HLAscan identified allele types of HLA-A, -B, -C, -DQB1, and -DRB1 with 100% accuracy for sequences at โ‰ฅ 90ร— depth, and the overall accuracy was 96.9%.Abstract i Contents iii List of Figures iv List of Tables vii Chapter 1 Introduction 1 1.1 Background 1 1.2 Problem Statement 7 1.3 Previous Works and New Results 8 1.4 Organization 10 Chapter 2 SNV and InDel Calling 11 2.1 Preliminaries 11 2.2 Germline Variant Calling Algorithm 16 2.3 Somatic Variant Calling Algorithm 21 2.4 Results 22 2.5 Discussions 46 Chapter 3 Haplotyping for MHC region 48 3.1 Preliminaries 48 3.2 Haplotyping Algorithm 52 3.3 Results 58 3.4 Discussions 72 Chapter 4 Conclusion 74 4.1 Summary 74 4.2 Future Directions 76 Bibliography 78Docto

    Introducing deep learning -based methods into the variant calling analysis pipeline

    Get PDF
    Biological interpretation of the genetic variation enhances our understanding of normal and pathological phenotypes, and may lead to the development of new therapeutics. However, it is heavily dependent on the genomic data analysis, which might be inaccurate due to the various sequencing errors and inconsistencies caused by these errors. Modern analysis pipelines already utilize heuristic and statistical techniques, but the rate of falsely identified mutations remains high and variable, particular sequencing technology, settings and variant type. Recently, several tools based on deep neural networks have been published. The neural networks are supposed to find motifs in the data that were not previously seen. The performance of these novel tools is assessed in terms of precision and recall, as well as computational efficiency. Following the established best practices in both variant detection and benchmarking, the discussed tools demonstrate accuracy metrics and computational efficiency that spur further discussion

    Parsimony-based genetic algorithm for haplotype resolution and block partitioning

    Get PDF
    This dissertation proposes a new algorithm for performing simultaneous haplotype resolution and block partitioning. The algorithm is based on genetic algorithm approach and the parsimonious principle. The multiloculs LD measure (Normalized Entropy Difference) is used as a block identification criterion. The proposed algorithm incorporates missing data is a part of the model and allows blocks of arbitrary length. In addition, the algorithm provides scores for the block boundaries which represent measures of strength of the boundaries at specific positions. The performance of the proposed algorithm was validated by running it on several publicly available data sets including the HapMap data and comparing results to those of the existing state-of-the-art algorithms. The results show that the proposed genetic algorithm provides the accuracy of haplotype decomposition within the range of the same indicators shown by the other algorithms. The block structure output by our algorithm in general agrees with the block structure for the same data provided by the other algorithms. Thus, the proposed algorithm can be successfully used for block partitioning and haplotype phasing while providing some new valuable features like scores for block boundaries and fully incorporated treatment of missing data. In addition, the proposed algorithm for haplotyping and block partitioning is used in development of the new clustering algorithm for two-population mixed genotype samples. The proposed clustering algorithm extracts from the given genotype sample two clusters with substantially different block structures and finds haplotype resolution and block partitioning for each cluster

    A machine learning pipeline for quantitative phenotype prediction from genotype data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Quantitative phenotypes emerge everywhere in systems biology and biomedicine due to a direct interest for quantitative traits, or to high individual variability that makes hard or impossible to classify samples into distinct categories, often the case with complex common diseases. Machine learning approaches to genotype-phenotype mapping may significantly improve Genome-Wide Association Studies (GWAS) results by explicitly focusing on predictivity and optimal feature selection in a multivariate setting. It is however essential that stringent and well documented Data Analysis Protocols (DAP) are used to control sources of variability and ensure reproducibility of results. We present a genome-to-phenotype pipeline of machine learning modules for quantitative phenotype prediction. The pipeline can be applied for the direct use of whole-genome information in functional studies. As a realistic example, the problem of fitting complex phenotypic traits in heterogeneous stock mice from single nucleotide polymorphims (SNPs) is here considered.</p> <p>Methods</p> <p>The core element in the pipeline is the L1L2 regularization method based on the naรฏve elastic net. The method gives at the same time a regression model and a dimensionality reduction procedure suitable for correlated features. Model and SNP markers are selected through a DAP originally developed in the MAQC-II collaborative initiative of the U.S. FDA for the identification of clinical biomarkers from microarray data. The L1L2 approach is compared with standard Support Vector Regression (SVR) and with Recursive Jump Monte Carlo Markov Chain (MCMC). Algebraic indicators of stability of partial lists are used for model selection; the final panel of markers is obtained by a procedure at the chromosome scale, termed โ€™saturationโ€™, to recover SNPs in Linkage Disequilibrium with those selected.</p> <p>Results</p> <p>With respect to both MCMC and SVR, comparable accuracies are obtained by the L1L2 pipeline. Good agreement is also found between SNPs selected by the L1L2 algorithms and candidate loci previously identified by a standard GWAS. The combination of L1L2-based feature selection with a saturation procedure tackles the issue of neglecting highly correlated features that affects many feature selection algorithms.</p> <p>Conclusions</p> <p>The L1L2 pipeline has proven effective in terms of marker selection and prediction accuracy. This study indicates that machine learning techniques may support quantitative phenotype prediction, provided that adequate DAPs are employed to control bias in model selection.</p

    Analysis of NGS Data from Immune Response and Viral Samples

    Get PDF
    This thesis is devoted to designing and applying advanced algorithmical and statistical tools for analysis of NGS data related to cancer and infection diseases. NGS data under investigation are obtained either from host samples or viral variants. Recently, random peptide phage display libraries (RPPDL) were applied to studies of host\u27s antibody response to different diseases. We study human antibody response to breast cancer and mouse antibody response to Lyme disease by sequencing of the whole antibody repertoire profiles which are represented by RPPDL. Alternatively, instead of sequencing immune response NGS can be applied directly to a viral population within an infected host. Specifically, we analyze the following RNA viruses: the human immunodeficiency virus (HIV) and the infectious bronchitis virus (IBV). Sequencing of RNA viruses is challenging because there are many variants inside population due to high mutation rate. Our results show that NGS helps to understand RNA viruses and explore their interaction with infected hosts. NGS also helps to analyze immune response to different diseases, trace changing of immune response at different disease stages
    • โ€ฆ
    corecore