13 research outputs found
A permutation-based correction for Pearson's chi-square test on data with an imputed complex outcome / A modified EM algorithm for contingency table analysis with missing data
Studies on human subjects often yield missing data, making progress in this field of inherent public health relevance. Here, two statistical methods are proposed for the analysis of discrete data with missing values. First, when one variable is subject to missingness, it was noted the application of Pearson’s chi-square test to singly-imputed data undermines the variability due to imputation, leading to a type-I error rate larger than the nominal level. This research concerns Pearson’s test on data with an imputed complex outcome, where one of its components suffers from missing values. Imputation in this context may be performed either directly through conditional imputation of the complex outcome given covariates, or indirectly through conditional imputation of its missing component given the covariates and the other, observed component. Although the latter imputation scheme is shown to be more efficient, an existing adjustment method cannot be extended to this scenario due to the lack of independence amongst the variables constituting the complex outcome. As a result, a novel permutation-based correction method for Pearson’s test is proposed. Simulation studies indicate it provides the nominal rejection rate under the null. Second, a modification of the expectation maximization (EM) algorithm for the analysis of discrete data with missing values is presented. In general, the update in the M-step requires either knowing or modeling the missing-data mechanism. However, misspecification of this mechanism may lead to biased estimates of model parameters. Given consistent initial estimates of the parameters (which may be obtained from an external, complete data set, or by recalling a random sample of subjects), the target function is approximated in the M-step with empirical estimates, allowing for unbiased estimation without specification or modeling of the often intangible missing-data mechanism. Simulation studies show this modified algorithm yields consistent estimates potentially more efficient than the initial estimates, even under non-ignorable missingness
A variant of sparse partial least squares for variable selection and data exploration.
When data are sparse and/or predictors multicollinear, current implementation of sparse partial least squares (SPLS) does not give estimates for non-selected predictors nor provide a measure of inference. In response, an approach termed all-possible SPLS is proposed, which fits a SPLS model for all tuning parameter values across a set grid. Noted is the percentage of time a given predictor is chosen, as well as the average non-zero parameter estimate. Using a large number of multicollinear predictors, simulation confirmed variables not associated with the outcome were least likely to be chosen as sparsity increased across the grid of tuning parameters, while the opposite was true for those strongly associated. Lastly, variables with a weak association were chosen more often than those with no association, but less often than those with a strong relationship to the outcome. Similarly, predictors most strongly related to the outcome had the largest average parameter estimate magnitude, followed by those with a weak relationship, followed by those with no relationship. Across two independent studies regarding the relationship between volumetric MRI measures and a cognitive test score, this method confirmed a priori hypotheses about which brain regions would be selected most often and have the largest average parameter estimates. In conclusion, the percentage of time a predictor is chosen is a useful measure for ordering the strength of the relationship between the independent and dependent variables, serving as a form of inference. The average parameter estimates give further insight regarding the direction and strength of association. As a result, all-possible SPLS gives more information than the dichotomous output of traditional SPLS, making it useful when undertaking data exploration and hypothesis generation for a large number of potential predictors
Finishing the euchromatic sequence of the human genome
The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers ∼99% of the euchromatic genome and is accurate to an error rate of ∼1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human enome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead
Geochemical Characterization of Trace MVT Mineralization in Paleozoic Sedimentary Rocks of Northeastern Wisconsin, USA
Disseminated Mississippi Valley-type (MVT) mineralization occurs throughout northeastern Wisconsin, USA, and is recognized as the source of regionally extensive natural groundwater contamination in the form of dissolved arsenic, nickel, and other related metals. Although considerable attention has been given to arsenic contamination of groundwater in the region, limited attention has been focused on characterizing the bedrock sources of these and other metals. A better understanding of the potential sources of groundwater contamination is needed, especially in areas where groundwater is the dominant source of drinking water. This article describes the regional, stratigraphic, and petrographic distribution of MVT mineralization in Paleozoic rocks of northeastern Wisconsin, with a focus on sulfide minerals. Whole-rock geochemical analysis performed on 310 samples of dolomite, sandstone, and shale show detectable levels of arsenic, nickel, cobalt, copper, lead, zinc, and other metals related to various sulfide mineral phases identified using scanning electron microscopy. MVT minerals include pyrite, marcasite, sphalerite, galena, chalcopyrite, fluorite, celestine, barite, and others. We describe the first nickel- and cobalt-bearing sulfide mineral phases known from Paleozoic strata in the region. Arsenic, nickel, and cobalt are sometimes present as isomorphous substitutions in pyrite and marcasite, but discrete mineral phases containing nickel and cobalt elements are also observed, including bravoite and vaesite. Locally abundant stratigraphic zones of sulfide minerals occur across the region, especially in the highly enriched Sulfide Cement Horizon at the top of the Ordovician St. Peter Sandstone. Abundant quantities of sulfides also appear near the contact between the Silurian Mayville Formation and the underlying Maquoketa and Neda formations in certain areas along and east of the Niagara escarpment. This article illustrates how a detailed geochemical and mineralogical investigation can yield a better understanding of groundwater quality problems
Recommended from our members
A variant of sparse partial least squares for variable selection and data exploration.
When data are sparse and/or predictors multicollinear, current implementation of sparse partial least squares (SPLS) does not give estimates for non-selected predictors nor provide a measure of inference. In response, an approach termed all-possible SPLS is proposed, which fits a SPLS model for all tuning parameter values across a set grid. Noted is the percentage of time a given predictor is chosen, as well as the average non-zero parameter estimate. Using a large number of multicollinear predictors, simulation confirmed variables not associated with the outcome were least likely to be chosen as sparsity increased across the grid of tuning parameters, while the opposite was true for those strongly associated. Lastly, variables with a weak association were chosen more often than those with no association, but less often than those with a strong relationship to the outcome. Similarly, predictors most strongly related to the outcome had the largest average parameter estimate magnitude, followed by those with a weak relationship, followed by those with no relationship. Across two independent studies regarding the relationship between volumetric MRI measures and a cognitive test score, this method confirmed a priori hypotheses about which brain regions would be selected most often and have the largest average parameter estimates. In conclusion, the percentage of time a predictor is chosen is a useful measure for ordering the strength of the relationship between the independent and dependent variables, serving as a form of inference. The average parameter estimates give further insight regarding the direction and strength of association. As a result, all-possible SPLS gives more information than the dichotomous output of traditional SPLS, making it useful when undertaking data exploration and hypothesis generation for a large number of potential predictors
Higher step length variability indicates lower gray matter integrity of selected regions in older adults
Step length variability (SLV) increases with age in those without overt neurologic disease, is higher in neurologic patients, is associated with falls, and predicts dementia. Whether higher SLV in older adults without neurologic disease indicates presence of neurologic abnormalities is unknown. Our objective was to identify whether SLV in older adults without overt disease is associated with findings from multimodal neuroimaging. A well-characterized cohort of 265 adults (79-90 years) was concurrently assessed by gait mat, magnetic resonance imaging with diffusion tensor, and neurological exam. Linear regression models adjusted for gait speed, demographic, health, and functional covariates assessed associations of MRI measures (gray matter volume, white matter hyperintensity volume, mean diffusivity, fractional anisotropy) with SLV. Regional distribution of associations was assessed by sparse partial least squares analyses. Higher SLV (mean: 8.4, SD: 3.3) was significantly associated with older age, slower gait speed, and poorer executive function and also with lower gray matter integrity measured by mean diffusivity (standardized beta=0.16; p=0.02). Associations between SLV and gray matter integrity were strongest for the hippocampus and anterior cingulate gyrus (both β=0.18) as compared to other regions. Associations of SLV with other neuroimaging markers were not significant. Lower integrity of normal-appearing gray matter may underlie higher SLV in older adults. Our results highlighted the hippocampus and anterior cingulate gyrus, regions involved in memory and executive function. These findings support previous research indicating a role for cognitive function in motor control. Higher SLV may indicate focal neuropathology in those without diagnosed neurologic disease
Traffic-related air pollution exposures and changes in heart rate variability in Mexico City: A panel study
Abstract Background While air pollution exposures have been linked to cardiovascular outcomes, the contribution from acute gas and particle traffic-related pollutants remains unclear. Using a panel study design with repeated measures, we examined associations between personal exposures to traffic-related air pollutants in Mexico City and changes in heart rate variability (HRV) in a population of researchers aged 22 to 56 years. Methods Participants were monitored for approximately 9.5 hours for eight days while operating a mobile laboratory van designed to characterize traffic pollutants while driving in traffic and “chasing” diesel buses. We examined the association between HRV parameters (standard deviation of normal-to-normal intervals (SDNN), power in high frequency (HF) and low frequency (LF), and the LF/HF ratio) and the 5-minute maximum (or average in the case of PM2.5) and 30-, 60-, and 90-minute moving averages of air pollutants (PM2.5, O3, CO, CO2, NO2, NOx, and formaldehyde) using single- and two-pollutant linear mixed-effects models. Results Short-term exposure to traffic-related emissions was associated with statistically significant acute changes in HRV. Gaseous pollutants – particularly ozone – were associated with reductions in time and frequency domain components (α = 0.05), while significant positive associations were observed between PM2.5 and SDNN, HF, and LF. For ozone and formaldehyde, negative associations typically increased in magnitude and significance with increasing averaging periods. The associations for CO, CO2, NO2, and NOx were similar with statistically significant associations observed for SDNN, but not HF or LF. In contrast, PM2.5 increased these HRV parameters. Conclusions Results revealed an association between traffic-related PM exposures and acute changes in HRV in a middle-aged population when PM exposures were relatively low (14 μg/m3) and demonstrate heterogeneity in the effects of different pollutants, with declines in HRV – especially HF – with ozone and formaldehyde exposures, and increases in HRV with PM2.5 exposure. Given that exposure to traffic-related emissions is associated with increased risk of cardiovascular morbidity and mortality, understanding the mechanisms by which traffic-related emissions can cause cardiovascular disease has significant public health relevance.Massachusetts Institute of Technology. Mexico City Projec