49 research outputs found

    Simple Measures of Individual Cluster-Membership Certainty for Hard Partitional Clustering

    Full text link
    We propose two probability-like measures of individual cluster-membership certainty which can be applied to a hard partition of the sample such as that obtained from the Partitioning Around Medoids (PAM) algorithm, hierarchical clustering or k-means clustering. One measure extends the individual silhouette widths and the other is obtained directly from the pairwise dissimilarities in the sample. Unlike the classic silhouette, however, the measures behave like probabilities and can be used to investigate an individual's tendency to belong to a cluster. We also suggest two possible ways to evaluate the hard partition. We evaluate the performance of both measures in individuals with ambiguous cluster membership, using simulated binary datasets that have been partitioned by the PAM algorithm or continuous datasets that have been partitioned by hierarchical clustering and k-means clustering. For comparison, we also present results from soft clustering algorithms such as soft analysis clustering (FANNY) and two model-based clustering methods. Our proposed measures perform comparably to the posterior-probability estimators from either FANNY or the model-based clustering methods. We also illustrate the proposed measures by applying them to Fisher's classic iris data set

    hapassoc: Software for Likelihood Inference of Trait Associations with SNP Haplotypes and Other Attributes

    Get PDF
    Complex medical disorders, such as heart disease and diabetes, are thought to involve a number of genes which act in conjunction with lifestyle and environmental factors to increase disease susceptibility. Associations between complex traits and single nucleotide polymorphisms (SNPs) in candidate genomic regions can provide a useful tool for identifying genetic risk factors. However, analysis of trait associations with single SNPs ignores the potential for extra information from haplotypes, combinations of variants at multiple SNPs along a chromosome inherited from a parent. When haplotype-trait associations are of interest and haplotypes of individuals can be determined, generalized linear models (GLMs) may be used to investigate haplotype associations while adjusting for the effects of non-genetic cofactors or attributes. Unfortunately, haplotypes cannot always be determined cost-effectively when data is collected on unrelated subjects. Uncertain haplotypes may be inferred on the basis of data from single SNPs. However, subsequent analyses of risk factors must account for the resulting uncertainty in haplotype assignment in order to avoid potential errors in interpretation. To account for such uncertainty, we have developed hapassoc, software for R implementing a likelihood approach to inference of haplotype and non-genetic effects in GLMs of trait associations. We provide a description of the underlying statistical method and illustrate the use of hapassoc with examples that highlight the flexibility to specify dominant and recessive effects of genetic risk factors, a feature not shared by other software that restricts users to additive effects only. Additionally, hapassoc can accommodate missing SNP genotypes for limited numbers of subjects.

    elrm: Software Implementing Exact-Like Inference for Logistic Regression Models

    Get PDF
    Exact inference is based on the conditional distribution of the sufficient statistics for the parameters of interest given the observed values for the remaining sufficient statistics. Exact inference for logistic regression can be problematic when data sets are large and the support of the conditional distribution cannot be represented in memory. Additionally, these methods are not widely implemented except in commercial software packages such as LogXact and SAS. Therefore, we have developed elrm, software for R implementing (approximate) exact inference for binomial regression models from large data sets. We provide a description of the underlying statistical methods and illustrate the use of elrm with examples. We also evaluate elrm by comparing results with those obtained using other methods.

    A Comparison of Five Methods for Selecting Tagging Single-Nucleotide Polymorphisms

    Get PDF
    Our goal was to compare methods for tagging single-nucleotide polymorphisms (tagSNPs) withrespect to the power to detect disease association under differing haplotype-disease associationmodels. We were also interested in the effect that SNP selection samples, consisting of eithercases, controls, or a mixture, would have on power. We investigated five previously describedalgorithms for choosing tagSNPS: two that picked SNPs based on haplotype structure (Chapmanhaplotypicand Stram), two that picked SNPs based on pair-wise allelic association (Chapman-allelicand Cousin), and one control method that chose equally spaced SNPs (Zhai). In two diseaseassociatedregions from the Genetic Analysis Workshop 14 simulated data, we tested theassociation between tagSNP genotype and disease over the tagSNP sets chosen by each methodfor each sampling scheme. This was repeated for 100 replicates to estimate power. The two allelicmethods chose essentially all SNPs in the region and had nearly optimal power. The two haplotypicmethods chose about half as many SNPs. The haplotypic methods had poor performance comparedto the allelic methods in both regions. We expected an improvement in power when the selectionsample contained cases; however, there was only moderate variation in power between thesampling approaches for each method. Finally, when compared to the haplotypic methods, thereference method performed as well or worse in the region with ancestral disease haplotypestructure

    elrm: Software Implementing Exact-Like Inference for Logistic Regression Models

    Get PDF
    Exact inference is based on the conditional distribution of the sufficient statistics for the parameters of interest given the observed values for the remaining sufficient statistics. Exact inference for logistic regression can be problematic when data sets are large and the support of the conditional distribution cannot be represented in memory. Additionally, these methods are not widely implemented except in commercial software packages such as LogXact and SAS. Therefore, we have developed elrm, software for R implementing (approximate) exact inference for binomial regression models from large data sets. We provide a description of the underlying statistical methods and illustrate the use of elrm with examples. We also evaluate elrm by comparing results with those obtained using other methods

    CrypticIBDcheck: An R Package For Checking Cryptic Relatedness In Nominally Unrelated Individuals

    Get PDF
    Background In population association studies, standard methods of statistical inference assume that study subjects are independent samples. In genetic association studies, it is therefore of interest to diagnose undocumented close relationships in nominally unrelated study samples. Results We describe the R package CrypticIBDcheck to identify pairs of closely-related subjects based on genetic marker data from single-nucleotide polymorphisms (SNPs). The package is able to accommodate SNPs in linkage disequibrium (LD), without the need to thin the markers so that they are approximately independent in the population. Sample pairs are identified by superposing their estimated identity-by-descent (IBD) coefficients on plots of IBD coefficients for pairs of simulated subjects from one of several common close relationships. Conclusions The methods implemented in CrypticIBDcheck are particularly relevant to candidate-gene association studies, in which dependent SNPs cluster in a relatively small number of genes spread throughout the genome. The accommodation of LD allows the use of all available genetic data, a desirable property when working with a modest number of dependent SNPs within candidate genes. CrypticIBDcheck is available from the Comprehensive R Archive Network (CRAN)

    Genetic Variation in Cell Death Genes and Risk of Non-Hodgkin Lymphoma

    Get PDF
    Background Non-Hodgkin lymphomas are a heterogeneous group of solid tumours that constitute the 5th highest cause of cancer mortality in the United States and Canada. Poor control of cell death in lymphocytes can lead to autoimmune disease or cancer, making genes involved in programmed cell death of lymphocytes logical candidate genes for lymphoma susceptibility. Materials and Methods We tested for genetic association with NHL and NHL subtypes, of SNPs in lymphocyte cell death genes using an established population-based study. 17 candidate genes were chosen based on biological function, with 123 SNPs tested. These included tagSNPs from HapMap and novel SNPs discovered by re-sequencing 47 cases in genes for which SNP representation was judged to be low. The main analysis, which estimated odds ratios by fitting data to an additive logistic regression model, used European ancestry samples that passed quality control measures (569 cases and 547 controls). A two-tiered approach for multiple testing correction was used: correction for number of tests within each gene by permutation-based methodology, followed by correction for the number of genes tested using the false discovery rate. Results Variant rs928883, near miR-155, showed an association (OR per A-allele: 2.80 [95% CI: 1.63–4.82]; pF = 0.027) with marginal zone lymphoma that is significant after correction for multiple testing. Conclusions This is the first reported association between a germline polymorphism at a miRNA locus and lymphoma
    corecore