49 research outputs found
Simple Measures of Individual Cluster-Membership Certainty for Hard Partitional Clustering
We propose two probability-like measures of individual cluster-membership
certainty which can be applied to a hard partition of the sample such as that
obtained from the Partitioning Around Medoids (PAM) algorithm, hierarchical
clustering or k-means clustering. One measure extends the individual silhouette
widths and the other is obtained directly from the pairwise dissimilarities in
the sample. Unlike the classic silhouette, however, the measures behave like
probabilities and can be used to investigate an individual's tendency to belong
to a cluster. We also suggest two possible ways to evaluate the hard partition.
We evaluate the performance of both measures in individuals with ambiguous
cluster membership, using simulated binary datasets that have been partitioned
by the PAM algorithm or continuous datasets that have been partitioned by
hierarchical clustering and k-means clustering. For comparison, we also present
results from soft clustering algorithms such as soft analysis clustering
(FANNY) and two model-based clustering methods. Our proposed measures perform
comparably to the posterior-probability estimators from either FANNY or the
model-based clustering methods. We also illustrate the proposed measures by
applying them to Fisher's classic iris data set
hapassoc: Software for Likelihood Inference of Trait Associations with SNP Haplotypes and Other Attributes
Complex medical disorders, such as heart disease and diabetes, are thought to involve a number of genes which act in conjunction with lifestyle and environmental factors to increase disease susceptibility. Associations between complex traits and single nucleotide polymorphisms (SNPs) in candidate genomic regions can provide a useful tool for identifying genetic risk factors. However, analysis of trait associations with single SNPs ignores the potential for extra information from haplotypes, combinations of variants at multiple SNPs along a chromosome inherited from a parent. When haplotype-trait associations are of interest and haplotypes of individuals can be determined, generalized linear models (GLMs) may be used to investigate haplotype associations while adjusting for the effects of non-genetic cofactors or attributes. Unfortunately, haplotypes cannot always be determined cost-effectively when data is collected on unrelated subjects. Uncertain haplotypes may be inferred on the basis of data from single SNPs. However, subsequent analyses of risk factors must account for the resulting uncertainty in haplotype assignment in order to avoid potential errors in interpretation. To account for such uncertainty, we have developed hapassoc, software for R implementing a likelihood approach to inference of haplotype and non-genetic effects in GLMs of trait associations. We provide a description of the underlying statistical method and illustrate the use of hapassoc with examples that highlight the flexibility to specify dominant and recessive effects of genetic risk factors, a feature not shared by other software that restricts users to additive effects only. Additionally, hapassoc can accommodate missing SNP genotypes for limited numbers of subjects.
elrm: Software Implementing Exact-Like Inference for Logistic Regression Models
Exact inference is based on the conditional distribution of the sufficient statistics for the parameters of interest given the observed values for the remaining sufficient statistics. Exact inference for logistic regression can be problematic when data sets are large and the support of the conditional distribution cannot be represented in memory. Additionally, these methods are not widely implemented except in commercial software packages such as LogXact and SAS. Therefore, we have developed elrm, software for R implementing (approximate) exact inference for binomial regression models from large data sets. We provide a description of the underlying statistical methods and illustrate the use of elrm with examples. We also evaluate elrm by comparing results with those obtained using other methods.
A Comparison of Five Methods for Selecting Tagging Single-Nucleotide Polymorphisms
Our goal was to compare methods for tagging single-nucleotide polymorphisms (tagSNPs) withrespect to the power to detect disease association under differing haplotype-disease associationmodels. We were also interested in the effect that SNP selection samples, consisting of eithercases, controls, or a mixture, would have on power. We investigated five previously describedalgorithms for choosing tagSNPS: two that picked SNPs based on haplotype structure (Chapmanhaplotypicand Stram), two that picked SNPs based on pair-wise allelic association (Chapman-allelicand Cousin), and one control method that chose equally spaced SNPs (Zhai). In two diseaseassociatedregions from the Genetic Analysis Workshop 14 simulated data, we tested theassociation between tagSNP genotype and disease over the tagSNP sets chosen by each methodfor each sampling scheme. This was repeated for 100 replicates to estimate power. The two allelicmethods chose essentially all SNPs in the region and had nearly optimal power. The two haplotypicmethods chose about half as many SNPs. The haplotypic methods had poor performance comparedto the allelic methods in both regions. We expected an improvement in power when the selectionsample contained cases; however, there was only moderate variation in power between thesampling approaches for each method. Finally, when compared to the haplotypic methods, thereference method performed as well or worse in the region with ancestral disease haplotypestructure
elrm: Software Implementing Exact-Like Inference for Logistic Regression Models
Exact inference is based on the conditional distribution of the sufficient statistics for the parameters of interest given the observed values for the remaining sufficient statistics. Exact inference for logistic regression can be problematic when data sets are large and the support of the conditional distribution cannot be represented in memory. Additionally, these methods are not widely implemented except in commercial software packages such as LogXact and SAS. Therefore, we have developed elrm, software for R implementing (approximate) exact inference for binomial regression models from large data sets. We provide a description of the underlying statistical methods and illustrate the use of elrm with examples. We also evaluate elrm by comparing results with those obtained using other methods
CrypticIBDcheck: An R Package For Checking Cryptic Relatedness In Nominally Unrelated Individuals
Background
In population association studies, standard methods of statistical inference assume that study subjects are independent samples. In genetic association studies, it is therefore of interest to diagnose undocumented close relationships in nominally unrelated study samples.
Results
We describe the R package CrypticIBDcheck to identify pairs of closely-related subjects based on genetic marker data from single-nucleotide polymorphisms (SNPs). The package is able to accommodate SNPs in linkage disequibrium (LD), without the need to thin the markers so that they are approximately independent in the population. Sample pairs are identified by superposing their estimated identity-by-descent (IBD) coefficients on plots of IBD coefficients for pairs of simulated subjects from one of several common close relationships.
Conclusions
The methods implemented in CrypticIBDcheck are particularly relevant to candidate-gene association studies, in which dependent SNPs cluster in a relatively small number of genes spread throughout the genome. The accommodation of LD allows the use of all available genetic data, a desirable property when working with a modest number of dependent SNPs within candidate genes. CrypticIBDcheck is available from the Comprehensive R Archive Network (CRAN)
Genetic Variation in Cell Death Genes and Risk of Non-Hodgkin Lymphoma
Background
Non-Hodgkin lymphomas are a heterogeneous group of solid tumours that constitute the 5th highest cause of cancer mortality in the United States and Canada. Poor control of cell death in lymphocytes can lead to autoimmune disease or cancer, making genes involved in programmed cell death of lymphocytes logical candidate genes for lymphoma susceptibility.
Materials and Methods
We tested for genetic association with NHL and NHL subtypes, of SNPs in lymphocyte cell death genes using an established population-based study. 17 candidate genes were chosen based on biological function, with 123 SNPs tested. These included tagSNPs from HapMap and novel SNPs discovered by re-sequencing 47 cases in genes for which SNP representation was judged to be low. The main analysis, which estimated odds ratios by fitting data to an additive logistic regression model, used European ancestry samples that passed quality control measures (569 cases and 547 controls). A two-tiered approach for multiple testing correction was used: correction for number of tests within each gene by permutation-based methodology, followed by correction for the number of genes tested using the false discovery rate.
Results
Variant rs928883, near miR-155, showed an association (OR per A-allele: 2.80 [95% CI: 1.63–4.82]; pF = 0.027) with marginal zone lymphoma that is significant after correction for multiple testing.
Conclusions
This is the first reported association between a germline polymorphism at a miRNA locus and lymphoma