260 research outputs found

    Assessing the Impact of Non-Differential Genotyping Errors on Rare Variant Tests of Association

    Get PDF
    Background/Aims: We aim to quantify the effect of non-differential genotyping errors on the power of rare variant tests and identify those situations when genotyping errors are most harmful. Methods: We simulated genotype and phenotype data for a range of sample sizes, minor allele frequencies, disease relative risks and numbers of rare variants. Genotype errors were then simulated using five different error models covering a wide range of error rates. Results: Even at very low error rates, misclassifying a common homozygote as a heterozygote translates into a substantial loss of power, a result that is exacerbated even further as the minor allele frequency decreases. While the power loss from heterozygote to common homozygote errors tends to be smaller for a given error rate, in practice heterozygote to homozygote errors are more frequent and, thus, will have measurable impact on power. Conclusion: Error rates from genotype-calling technology for next-generation sequencing data suggest that substantial power loss may be seen when applying current rare variant tests of association to called genotypes

    Methods for statistical and population genetics analyses.

    Full text link
    Genetics studies have advanced rapidly, from candidate region studies to genome wide association studies (GWAS) and next generation sequencing projects. The emergence of new technologies has brought with it an array of statistical challenges. In this thesis, we propose methods for statistical and population genetics in our effort to better understand the underlying architecture of our genomes. GWAS rely on indirect association, testing a reduced set of representative markers (tagSNPs) instead of all variants present in the genome. In the first chapter, we propose a graph-based method to select the optimal set of tagSNPs. We apply our method to chromosome-wide data and show that it outperforms the widely used greedy approach, selecting fewer tagSNPs while maintaining high correlation with non-tagSNPs variants. Alignment to a reference sequence is an integral step in many sequencing studies. Multiply mapped reads, reads that align to multiple locations in the reference, are discarded from downstream analyses, resulting in a loss of information. We develop a Gibbs sampling approach to identify the true location of multiply mapped reads obtained from the alignment step. We validate our method using simulation studies. We use the improvement in variant discovery to quantify the effect of including multiply mapped reads in downstream analyses. In the third chapter, we explore the feasibility of admixture mapping, a population genetics tool, in identifying regions harboring rare susceptibility variants. We compare the power of admixture mapping to single marker association studies in detecting causal regions. We find that admixture mapping performs better over a wide range of risk allele frequencies. The site frequency spectrum (SFS) is an important summary statistic in population genetics, encompassing information on selection and demographic history. We show that estimates of the SFS obtained from genotype calling methods underestimate the number of rare variants, especially singletons and doubletons. We derive a maximum likelihood estimate for the SFS. We demonstrate that our method performs better than SFS obtained from genotype calling algorithms using both simulated and real data examples.Ph.D.BiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/89609/1/gopalakr_1.pd

    Application of an efficient Bayesian discretization method to biomedical data

    Get PDF
    Background\ud Several data mining methods require data that are discrete, and other methods often perform better with discrete data. We introduce an efficient Bayesian discretization (EBD) method for optimal discretization of variables that runs efficiently on high-dimensional biomedical datasets. The EBD method consists of two components, namely, a Bayesian score to evaluate discretizations and a dynamic programming search procedure to efficiently search the space of possible discretizations. We compared the performance of EBD to Fayyad and Irani's (FI) discretization method, which is commonly used for discretization.\ud \ud Results\ud On 24 biomedical datasets obtained from high-throughput transcriptomic and proteomic studies, the classification performances of the C4.5 classifier and the naïve Bayes classifier were statistically significantly better when the predictor variables were discretized using EBD over FI. EBD was statistically significantly more stable to the variability of the datasets than FI. However, EBD was less robust, though not statistically significantly so, than FI and produced slightly more complex discretizations than FI.\ud \ud Conclusions\ud On a range of biomedical datasets, a Bayesian discretization method (EBD) yielded better classification performance and stability but was less robust than the widely used FI discretization method. The EBD discretization method is easy to implement, permits the incorporation of prior knowledge and belief, and is sufficiently fast for application to high-dimensional data

    Economic impact of fluctuations in oilsardine landings in India

    Get PDF
    The Indian oilsardine Sardinella longiceps  (Valenciennes, 1847) is a significant contributor to the marine fisheries economy of India. The species showed wide fluctuations in landings in the past and during recent years, the decline in landings is an issue of concern for sustainable harvest of the resourcee specially in the context of climate change regime.The paper analysed the economic impacts of fluctuations in oilsardine landings in terms of gross earnings realised at landing centre and retail levels; inflation in the domestic markets and external trade during the 2000-2018 period. The analysis indicated that the decline in landings was more in the state of Kerala as compared to all India landings. The inflation at point of first sales (landing centre level) was more during 2000-09 period whereas at retail market level inflation was highest during 2010-18. The inflationary pressure on domestic consumers in Kerala was more during 2010-18 period. The growth in external trade of sardines was in tune with the Indian oilsardine landings in the country

    Knowledge-based variable selection for learning rules from proteomic data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The incorporation of biological knowledge can enhance the analysis of biomedical data. We present a novel method that uses a proteomic knowledge base to enhance the performance of a rule-learning algorithm in identifying putative biomarkers of disease from high-dimensional proteomic mass spectral data. In particular, we use the Empirical Proteomics Ontology Knowledge Base (EPO-KB) that contains previously identified and validated proteomic biomarkers to select <it>m/z</it>s in a proteomic dataset prior to analysis to increase performance.</p> <p>Results</p> <p>We show that using EPO-KB as a pre-processing method, specifically selecting all biomarkers found only in the biofluid of the proteomic dataset, reduces the dimensionality by 95% and provides a statistically significantly greater increase in performance over no variable selection and random variable selection.</p> <p>Conclusion</p> <p>Knowledge-based variable selection even with a sparsely-populated resource such as the EPO-KB increases overall performance of rule-learning for disease classification from high-dimensional proteomic mass spectra.</p

    Learning Parsimonious Classification Rules from Gene Expression Data Using Bayesian Networks with Local Structure

    Get PDF
    The comprehensibility of good predictive models learned from high-dimensional gene expression data is attractive because it can lead to biomarker discovery. Several good classifiers provide comparable predictive performance but differ in their abilities to summarize the observed data. We extend a Bayesian Rule Learning (BRL-GSS) algorithm, previously shown to be a significantly better predictor than other classical approaches in this domain. It searches a space of Bayesian networks using a decision tree representation of its parameters with global constraints, and infers a set of IF-THEN rules. The number of parameters and therefore the number of rules are combinatorial in the number of predictor variables in the model. We relax these global constraints to learn a more expressive local structure with BRL-LSS. BRL-LSS entails a more parsimonious set of rules because it does not have to generate all combinatorial rules. The search space of local structures is much richer than the space of global structures. We design the BRL-LSS with the same worst-case time-complexity as BRL-GSS while exploring a richer and more complex model space. We measure predictive performance using Area Under the ROC curve (AUC) and Accuracy. We measure model parsimony performance by noting the average number of rules and variables needed to describe the observed data. We evaluate the predictive and parsimony performance of BRL-GSS, BRL-LSS and the state-of-the-art C4.5 decision tree algorithm, across 10-fold cross-validation using ten microarray gene-expression diagnostic datasets. In these experiments, we observe that BRL-LSS is similar to BRL-GSS in terms of predictive performance, while generating a much more parsimonious set of rules to explain the same observed data. BRL-LSS also needs fewer variables than C4.5 to explain the data with similar predictive performance. We also conduct a feasibility study to demonstrate the general applicability of our BRL methods on the newer RNA sequencing gene-expression data

    Using DNA metabarcoding for simultaneous inference of common vampire bat diet and population structure

    Get PDF
    Metabarcoding diet analysis has become a valuable tool in animal ecology; however, co-amplified predator sequences are not generally used for anything other than to validate predator identity. Exemplified by the common vampire bat, we demonstrate the use of metabarcoding to infer predator population structure alongside diet assessments. Growing populations of common vampire bats impact human, livestock and wildlife health in Latin America through transmission of pathogens, such as lethal rabies viruses. Techniques to determine large scale variation in vampire bat diet and bat population structure would empower locality- and species-specific projections of disease transmission risks. However, previously used methods are not cost-effective and efficient for large scale applications. Using blood meal and faecal samples from common vampire bats from coastal, Andean and Amazonian regions of Peru, we showcase metabarcoding as a scalable tool to assess vampire bat population structure and feeding preferences. Dietary metabarcoding was highly effective, detecting vertebrate prey in 93.2% of the samples. Bats predominantly preyed on domestic animals, but fed on tapirs at one Amazonian site. In addition, we identified arthropods in 9.3% of samples, likely reflecting consumption of ectoparasites. Using the same data, we document mitochondrial geographic population structure in the common vampire bat in Peru. Such simultaneous inference of vampire bat diet and population structure can enable new insights into the interplay between vampire bat ecology and disease transmission risks. Importantly, the methodology can be incorporated into metabarcoding diet studies of other animals to couple information on diet and population structure

    Analysis of independent cohorts of outbred CFW mice reveals novel loci for behavioral and physiological traits and identifies factors determining reproducibility

    Get PDF
    Funding This work was partially supported by National Institutes of Health grants [R01MH115979 (J.F.), R01GM097737 and P50DA037844 (A.A.P)]. J.Z. is supported by a National Science Foundation Graduate Research Fellowship under Grant DGE1650604. Publication charges for this article have been funded by 1R01MH115979. J.F., A.A.P., and R.M. conceived the study. J.Z., J.F., and S.G. performed the bioinformatics analysis. C.P. and J.N. prepared the phenotypes. R.W.D. generated the genotypes. J.Z., C.P., S.G, N.C, A.L. A.A.P., and J.F. wrote the manuscript. All authors read and approved the final manuscript.Peer reviewedPublisher PD
    corecore