8 research outputs found
A permutation test for determining significance of clusters with applications to spatial and gene expression data
Hierarchical clustering is a common procedure for identifying structure in a data set, and this is frequently used for organizing genomic data.
Although more advanced clustering algorithms are available, the
simplicity and visual appeal of hierarchical clustering has made it
ubiquitous in gene expression data analysis. Hence, even minor
improvements in this framework would have significant impact. There is currently no simple and systematic way of assessing and
displaying the significance of various clusters in a resulting
dendrogram without making certain distributional assumptions or
ignoring gene-specific variances. In this work, we introduce a
permutation test based on comparing the within-cluster structure of
the observed data with those of sample datasets obtained by
permuting the cluster membership. We carry out this test at each
node of the dendrogram using a statistic derived from the singular
value decomposition of variance matrices. The p-values thus
obtained provide insight into the significance of each cluster
division. Given these values, one can also modify the dendrogram by
combining non-significant branches. By adjusting the cut-off level
of significance for branches, one can produce dendrograms with a
desired level of detail for ease of interpretation. We demonstrate the usefulness of this approach by applying it to illustrative data sets
A permutation test for determining significance of clusters with applications to spatial and gene expression data
Hierarchical clustering is a common procedure for identifying structure in a dataset, and this is frequently used for organizing genomic data. Although more advanced clustering algorithms are available, the simplicity and visual appeal of hierarchical clustering have made it ubiquitous in gene expression data analysis. Hence, even minor improvements in this framework would have significant impact. There is currently no simple and systematic way of assessing and displaying the significance of various clusters in a resulting dendrogram without making certain distributional assumptions or ignoring gene-specific variances. In this work, we introduce a permutation test based on comparing the within-cluster structure of the observed data with those of sample datasets obtained by permuting the cluster membership. We carry out this test at each node of the dendrogram using a statistic derived from the singular value decomposition of variance matrices. The p-values thus obtained provide insight into the significance of each cluster division. Given these values, one can also modify the dendrogram by combining non-significant branches. By adjusting the cut-off level of significance for branches, one can produce dendrograms with a desired level of detail for ease of interpretation. We demonstrate the usefulness of this approach by applying it to illustrative datasets.
Recommended from our members
Treatment Outcomes for Adolescents With Multidrug-Resistant Tuberculosis in Lima, Peru
Treatment outcomes for adolescents with multidrug-resistant tuberculosis are rarely reported and, to date, have been poor. Among 90 adolescents from Lima, Peru, 68 (75.6%) achieved cure or completion of treatment. Unsuccessful treatment was less common in the Peru cohort than previously described in the literature
Risk Adjustment for Lumbar Dysfunction: Comparison of Linear Mixed Models With and Without Inclusion of Between-Clinic Variation as a Random Effect
Background Valid comparison of patient outcomes of physical therapy care requires risk adjustment for patient characteristics using statistical models. Because patients are clustered within clinics, results of risk adjustment models are likely to be biased by random, unobserved between-clinic differences. Such bias could lead to inaccurate prediction and interpretation of outcomes. Purpose The purpose of this study was to determine if including between-clinic variation as a random effect would improve the performance of a risk adjustment model for patient outcomes following physical therapy for low back dysfunction. Design This was a secondary analysis of data from a longitudinal cohort of 147,623 patients with lumbar dysfunction receiving physical therapy in 1,470 clinics in 48 states of the United States. Methods Three linear mixed models predicting patients\u27 functional status (FS) at discharge, controlling for FS at intake, age, sex, number of comorbidities, surgical history, and health care payer, were developed. Models were: (1) a fixed-effect model, (2) a random-intercept model that allowed clinics to have different intercepts, and (3) a random-slope model that allowed different intercepts and slopes for each clinic. Goodness of fit, residual error, and coefficient estimates were compared across the models. Results The random-effect model fit the data better and explained an additional 11% to 12% of the between-patient differences compared with the fixed-effect model. Effects of payer, acuity, and number of comorbidities were confounded by random clinic effects. Limitations Models may not have included some variables associated with FS at discharge. The clinics studied may not be representative of all US physical therapy clinics. Conclusions Risk adjustment models for functional outcome of patients with lumbar dysfunction that control for between-clinic variation performed better than a model that does not
Identifying multidrug resistant tuberculosis transmission hotspots using routinely collected data
In most countries with large drug resistant tuberculosis epidemics, only those cases that are at highest risk of having MDRTB receive a drug sensitivity test (DST) at the time of diagnosis. Because of this prioritized testing, identification of MDRTB transmission hotspots in communities where TB cases do not receive DST is challenging, as any observed aggregation of MDRTB may reflect systematic differences in how testing is distributed in communities. We introduce a new disease mapping method, which estimates this missing information through probability-weighted locations, to identify geographic areas of increased risk of MDRTB transmission. We apply this method to routinely collected data from two districts in Lima, Peru over three consecutive years. This method identifies an area in the eastern part of Lima where previously untreated cases have increased risk of MDRTB. This may indicate an area of increased transmission of drug resistant disease, a finding that may otherwise have been missed by routine analysis of programmatic data. The risk of MDR among retreatment cases is also highest in these probable transmission hotspots, though a high level of MDR among retreatment cases is present throughout the study area. Identifying potential multidrug resistant tuberculosis (MDRTB) transmission hotspots may allow for targeted investigation and deployment of resources
Recommended from our members
Metabolomic data presents challenges for epidemiological meta-analysis: a case study of childhood body mass index from the ECHO consortium
IntroductionMeta-analyses across diverse independent studies provide improved confidence in results. However, within the context of metabolomic epidemiology, meta-analysis investigations are complicated by differences in study design, data acquisition, and other factors that may impact reproducibility.ObjectiveThe objective of this study was to identify maternal blood metabolites during pregnancy (> 24 gestational weeks) related to offspring body mass index (BMI) at age two years through a meta-analysis framework.MethodsWe used adjusted linear regression summary statistics from three cohorts (total N = 1012 mother-child pairs) participating in the NIH Environmental influences on Child Health Outcomes (ECHO) Program. We applied a random-effects meta-analysis framework to regression results and adjusted by false discovery rate (FDR) using the Benjamini-Hochberg procedure.ResultsOnly 20 metabolites were detected in all three cohorts, with an additional 127 metabolites detected in two of three cohorts. Of these 147, 6 maternal metabolites were nominally associated (P < 0.05) with offspring BMI z-scores at age 2 years in a meta-analytic framework including at least two studies: arabinose (Coefmeta = 0.40 [95% CI 0.10,0.70], Pmeta = 9.7 × 10-3), guanidinoacetate (Coefmeta = - 0.28 [- 0.54, - 0.02], Pmeta = 0.033), 3-ureidopropionate (Coefmeta = 0.22 [0.017,0.41], Pmeta = 0.033), 1-methylhistidine (Coefmeta = - 0.18 [- 0.33, - 0.04], Pmeta = 0.011), serine (Coefmeta = - 0.18 [- 0.36, - 0.01], Pmeta = 0.034), and lysine (Coefmeta = - 0.16 [- 0.32, - 0.01], Pmeta = 0.044). No associations were robust to multiple testing correction.ConclusionsDespite including three cohorts with large sample sizes (N > 100), we failed to identify significant metabolite associations after FDR correction. Our investigation demonstrates difficulties in applying epidemiological meta-analysis to clinical metabolomics, emphasizes challenges to reproducibility, and highlights the need for standardized best practices in metabolomic epidemiology