1,268 research outputs found

    Haplotype-Based Association Studies: Approaches to Current Challenges

    Get PDF
    Haplotype-based association studies have greatly aided researchers in their attempts to map genes. However, current designs of haplotype-based association studies lead to several challenges from a statistical perspective. To reduce the number of variants, some researchers have employed hierarchical clustering. This thesis starts by addressing the multiple testing problem that results from applying a hierarchical clustering procedure to haplotypes and then performing a statistical test for association at each of the steps in the resulting hierarchy. Applying our method to a haplotype case-control dataset, we find a global p-value. Relative to the minimum p-value over all steps in the hierarchy, the global p-value is markedly inflated. The second challenge involves the inherent errors present when prediction programs are employed to assign haplotype pairs for each individual in a haplotype-based association study. We examined the effect of these misclassification errors on the false positive rate and power for two association tests—the standard likelihood ratio test (LRTstd) and a likelihood ratio test that allows for the misclassification inherent in the haplotype inference procedure (LRTae). Our simulations indicate that 1) for each statistic permutation methods maintain the correct type I error; 2) specific multilocus genotypes that are misclassified as the incorrect haplotype pair are consistently misclassified throughout each entire dataset; and 3) a significant power gain exists for the LRTae over the LRTstd for a subset of the parameter settings. The LRTae showed the greatest benefit over the LRTstd when the cost of phenotyping was very high relative to the cost of genotyping. This situation is likely to occur in a replication study as opposed to a whole genome association study. The third challenge addressed by this thesis involves the uncertainty regarding the exact distribution of the likelihood ratio test (LRT) statistic for haplotype-based association tests in which many of the haplotype frequency estimates are zero or very small. By simulating datasets with known haplotype frequencies and comparing the empirical distribution with various theoretical distributions, we characterized the distribution of the LRT statistic as a χ2 distribution where the degrees of freedom are related to the number of the haplotypes with nonzero frequency estimates. Awareness of the potential pitfalls and the strategies to address them will increase the effectiveness of haplotype-based association as a gene-mapping tool

    The estimation of cell probabilites in large sparse discrete spaces

    Get PDF
    Imperial Users onl

    A Methodology to Develop a Decision Model Using a Large Categorical Database with Application to Identifying Critical Variables during a Transport-Related Hazardous Materials Release

    Get PDF
    An important problem in the use of large categorical databases is extracting information to make decisions, including identification of critical variables. Due to the complexity of a dataset containing many records, variables, and categories, a methodology for simplification and measurement of associations is needed to build the decision model. To this end, the proposed methodology uses existing methods for categorical exploratory analysis. Specifically, latent class analysis and loglinear modeling, which together constitute a three-step, non-simultaneous approach, were used to simplify the variables and measure their associations, respectively. This methodology has not been used to extract data-driven decision models from large categorical databases. A case in point is a large categorical database at the DoT for hazardous materials releases during transportation. This dataset is important due to the risk from an unintentional release. However, due to the lack of a data-congruent decision model of a hazmat release, current decision making, including critical variable identification, is limited at the Office of Hazardous Materials within the DoT. This gap in modeling of a release is paralleled by a similar gap in the hazmat transportation literature. The literature has an operations research and quantitative risk assessment focus, in which the models consist of simple risk equations or more complex, theoretical equations. Thus, based on critical opportunities at the DoT and gaps in the literature, the proposed methodology was demonstrated using the hazmat release database. The methodology can be applied to other categorical databases for extracting decision models, such as those at the National Center for Health Statistics. A key goal of the decision model, a Bayesian network, was identification of the most influential variables relative to two consequences or measures of risk in a hazmat release, dollar loss and release quantity. The most influential variables for dollar loss were found to be variables related to container failure, specifically the causing object and item-area of failure on the container. Similarly, for release quantity, the container failure variables were also most influential, specifically the contributing action and failure mode. In addition, potential changes in these variables for reducing consequences were identified
    • …
    corecore