2,727 research outputs found

    Informative SNP Selection and Validation

    Get PDF
    The search for genetic regions associated with complex diseases, such as cancer or Alzheimer\u27s disease, is an important challenge that may lead to better diagnosis and treatment. The existence of millions of DNA variations, primarily single nucleotide polymorphisms (SNPs), may allow the fine dissection of such associations. However, studies seeking disease association are limited by the cost of genotyping SNPs. Therefore, it is essential to find a small subset of informative SNPs (tag SNPs) that may be used as good representatives of the rest of the SNPs. Several informative SNP selection methods have been developed. Our experiments compare favorably to all the prediction and statistical methods by selecting the least number of informative SNPs. We proposed algorithms for faster prediction which yielded acceptable trade off. We validated our results using the k-fold test and its many variations

    Genotype/Haplotype Tagging Methods and their Validation

    Get PDF
    This study focuses how the MLR-tagging for statistical covering, i.e. either maximizing average R2 for certain number of requested tags or minimizing number of tags such that for any non-tag SNP there exists a highly correlated (squared correlation R2 \u3e 0.8) tag SNP. We compare with tagger, a software for selecting tags in hapMap project. MLR-tagging needs less number of tags than tagger in all 6 cases of the given test sets except 2. Meanwhile, Biologists can detect or collect data only from a small set. So, this will bring a problem for scientists that the estimates accuracy of tag SNPs when constructing the complete human haplotype map. This study investigates how the MLR-tagging for statistically coverage performs under unbias study. The experiment results shows MLR-tagging still select small amount of SNPs very well even without observing the entire SNP in the sample

    Algorithms for Computational Genetics Epidemiology

    Get PDF
    The most intriguing problems in genetics epidemiology are to predict genetic disease susceptibility and to associate single nucleotide polymorphisms (SNPs) with diseases. In such these studies, it is necessary to resolve the ambiguities in genetic data. The primary obstacle for ambiguity resolution is that the physical methods for separating two haplotypes from an individual genotype (phasing) are too expensive. Although computational haplotype inference is a well-explored problem, high error rates continue to deteriorate association accuracy. Secondly, it is essential to use a small subset of informative SNPs (tag SNPs) accurately representing the rest of the SNPs (tagging). Tagging can achieve budget savings by genotyping only a limited number of SNPs and computationally inferring all other SNPs. Recent successes in high throughput genotyping technologies drastically increase the length of available SNP sequences. This elevates importance of informative SNP selection for compaction of huge genetic data in order to make feasible fine genotype analysis. Finally, even if complete and accurate data is available, it is unclear if common statistical methods can determine the susceptibility of complex diseases. The dissertation explores above computational problems with a variety of methods, including linear algebra, graph theory, linear programming, and greedy methods. The contributions include (1)significant speed-up of popular phasing tools without compromising their quality, (2)stat-of-the-art tagging tools applied to disease association, and (3)graph-based method for disease tagging and predicting disease susceptibility

    Tag SNP selection for prediction of tick resistance in Brazilian Braford and Hereford cattle breeds using Bayesian methods.

    Get PDF
    Cattle resistance to ticks is known to be under genetic control with a complex biological mechanism within and among breeds. Our aim was to identify genomic segments and tag single nucleotide polymorphisms (SNPs) associated with tick-resistance in Hereford and Braford cattle. The predictive performance of a very low-density tag SNP panel was estimated and compared with results obtained with a 50 K SNP dataset.Article 49

    Inferring the Structure of Signal Transduction Networks from Interactions between Cellular Components and Inferring Haplotypes from Informative SNPS

    Get PDF
    Many problems in bioinformatics are inference problems, that is, the problem objective is to infer something based upon a limited amount of information. In this work we explore two different inference problems in bioinformatics. The first problem is inferring the structure of signal transduction networks from interactions between pairs of cellular components. We present two contributions towards the solution to this problem: an mixed integer program that produces and exact solution, and an implementation of an approximation algorithm in Java that was originally described by DasGupta et al. An exact solution is obtained for a problem instance consisting of real data. The second problem this thesis examines is the problem of inferring complete haplotypes from informative SNPs. In this work we describe two variations of the linear algebraic method for haplotype prediction and tag SNP selection: Two different variants of the algorithm are described and implemented, and the results summarized

    Discrete Algorithms for Analysis of Genotype Data

    Get PDF
    Accessibility of high-throughput genotyping technology makes possible genome-wide association studies for common complex diseases. When dealing with common diseases, it is necessary to search and analyze multiple independent causes resulted from interactions of multiple genes scattered over the entire genome. The optimization formulations for searching disease-associated risk/resistant factors and predicting disease susceptibility for given case-control study have been introduced. Several discrete methods for disease association search exploiting greedy strategy and topological properties of case-control studies have been developed. New disease susceptibility prediction methods based on the developed search methods have been validated on datasets from case-control studies for several common diseases. Our experiments compare favorably the proposed algorithms with the existing association search and susceptibility prediction methods

    Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases

    Full text link
    Recent advances of information technology in biomedical sciences and other applied areas have created numerous large diverse data sets with a high dimensional feature space, which provide us a tremendous amount of information and new opportunities for improving the quality of human life. Meanwhile, great challenges are also created driven by the continuous arrival of new data that requires researchers to convert these raw data into scientific knowledge in order to benefit from it. Association studies of complex diseases using SNP data have become more and more popular in biomedical research in recent years. In this paper, we present a review of recent statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic association studies for complex diseases. The review includes both general feature reduction approaches for high dimensional correlated data and more specific approaches for SNPs data, which include unsupervised haplotype mapping, tag SNP selection, and supervised SNPs selection using statistical testing/scoring, statistical modeling and machine learning methods with an emphasis on how to identify interacting loci.Comment: Published in at http://dx.doi.org/10.1214/07-SS026 the Statistics Surveys (http://www.i-journals.org/ss/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Hierarchical bayesian models for genome-wide association studies

    Get PDF
    I consider a well-known problem in the field of statistical genetics called a genome-wide association study (GWAS) where the goal is to identify a set of genetic markers that are associated to a disease. A typical GWAS data set contains, for thousands of unrelated individuals, a set of hundreds of thousands of markers, a set of other covariates such as age, gender, smoking status and other risk factors, and a response variable that indicates the presence or absence of a particular disease. Due to biological phenomena such as the recombination of DNA and linkage disequilibrium, parents are more likely to pass parts of DNA that lie close to each other on a chromosome together to their offspring; this non-random association between adjacent markers leads to strong correlation between markers in GWAS data sets. As a statistician, I reduce the complex problem of GWAS to its essentials, i.e. variable selection on a large-p-small-n data set that exhibits multicollinearity, and develop solutions that complement and advance the current state-of-the-art methods. Before outlining and explaining my contributions to the field in detail, I present a literature review that summarizes the history of GWAS and the relevant tools and techniques that researchers have developed over the years for this problem

    Early selection enabled by the implementation of genomic selection in Coffea arabica breeding.

    Get PDF
    Genomic Selection (GS) has allowed the maximization of genetic gains per unit time in several annual and perennial plant species. However, no GS studies have addressed Coffea arabica, the most economically important species of the genus Coffea. Therefore, this study aimed (i) to evaluate the applicability and accuracy of GS in the prediction of the genomic estimated breeding value (GEBV); (ii) to estimate the genetic parameters; and (iii) to evaluate the time reduction of the selection cycle by GS in Arabica coffee breeding. A total of 195 Arabica coffee individuals, belonging to 13 families in generation of F2, susceptible backcross and resistant backcross, were phenotyped for 18 agronomic traits, and genotyped with 21,211 SNP molecular markers. Phenotypic data, measured in 2014, 2015, and 2016, were analyzed by mixed models. GS analyses were performed by the G-BLUPmethod, using the RKHS (Reproducing Kernel Hilbert Spaces) procedure, with a Bayesian algorithm. Heritabilities and selective accuracies were estimated, revealing moderate to high magnitude for most of the traits evaluated. Results of GS analyses showed the possibility of reducing the cycle time by 50%, maximizing selection gains per unit time. The effect of marker density on GS analyses was evaluated. Genomic selection proved to be promising for C. arabica breeding. The agronomic traits presented high complexity for they are controlled by several QTL and showed low genomic heritabilities, evidencing the need to incorporate genomic selection methodologies to the breeding programs of this species
    corecore