511 research outputs found

    Why odds ratio estimates of GWAS are almost always close to 1.0

    Get PDF
    “Missing heritability” in genome-wide association studies (GWAS) refers to the seeming inability for GWAS data to capture the great majority of genetic causes of a disease in comparison to the known degree of heritability for the disease, in spite of GWAS’ genome-wide measures of genetic variations. This paper presents a simple mathematical explanation for this phenomenon, assuming that the heritability information exists in GWAS data. Specifically, it focuses on the fact that the great majority of association measures (in the form of odds ratios) from GWAS are consistently close to the value that indicates no association, explains why this occurs, and deduces two specific forms of epistasis/interaction as its cause. The implication is that GWAS may be able to recover “missing heritability” if the two specific forms of epistasis and gene-environmental interaction are fully explored

    The magnitude of black/hispanic disparity in COVID-19 mortality across United States Counties during the first waves of the COVID-19 Pandemic

    Get PDF
    Objectives: To quantify the Black/Hispanic disparity in COVID-19 mortality in the United States (US). Methods: COVID-19 deaths in all US counties nationwide were analyzed to estimate COVID-19 mortality rate ratios by county-level proportions of Black/Hispanic residents, using mixed-effects Poisson regression. Excess COVID-19 mortality counts, relative to predicted under a counterfactual scenario of no racial/ethnic disparity gradient, were estimated. Results: County-level COVID-19 mortality rates increased monotonically with county-level proportions of Black and Hispanic residents, up to 5.4-fold (=43% Black) and 11.6-fold (=55% Hispanic) higher compared to counties with <5% Black and <15% Hispanic residents, respectively, controlling for county-level poverty, age, and urbanization level. Had this disparity gradient not existed, the US COVID-19 death count would have been 92.1% lower (177,672 fewer deaths), making the rate comparable to other high-income countries with substantially lower COVID-19 death counts. Conclusion: During the first 8 months of the SARS-CoV-2 pandemic, the US experienced the highest number of COVID-19 deaths. This COVID-19 mortality burden is strongly associated with county-level racial/ethnic diversity, explaining most US COVID-19 deaths.Peer ReviewedPostprint (published version

    Relative performance of different exposure modeling approaches for sulfur dioxide concentrations in the air in rural western Canada

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The main objective of this paper is to compare different methods for predicting the levels of SO<sub>2 </sub>air pollution in oil and gas producing area of rural western Canada. Month-long average air quality measurements were collected over a two-year period (2001–2002) at multiple locations, with some side-by-side measurements, and repeated time-series at selected locations.</p> <p>Methods</p> <p>We explored how accurately location-specific mean concentrations of SO<sub>2 </sub>can be predicted for 2002 at 666 locations with multiple measurements. Means of repeated measurements on the 666 locations in 2002 were used as the alloyed gold standard (AGS). First, we considered two approaches: one that uses one measurement from each location of interest; and the other that uses context data on proximity of monitoring sites to putative sources of emission in 2002. Second, we imagined that all of the previous year's (2001's) data were also available to exposure assessors: 9,464 measurements and their context (month, proximity to sources). Exposure prediction approaches we explored with the 2001 data included regression modeling using either mixed or fixed effects models. Third, we used Bayesian methods to combine single measurements from locations in 2002 (not used to calculate AGS) with different <it>priors</it>.</p> <p>Results</p> <p>The regression method that included both fixed and random effects for prediction (Best Linear Unbiased Predictor) had the best agreement with the AGS (Pearson correlation 0.77) and the smallest mean squared error (MSE: 0.03). The second best method in terms of correlation with AGS (0.74) and MSE (0.09) was the Bayesian method that uses normal mixture <it>prior </it>derived from predictions of the 2001 mixed effects applied in the 2002 context.</p> <p>Conclusion</p> <p>It is likely that either collecting some measurements from the desired locations and time periods or predictions of a reasonable empirical mixed effects model perhaps is sufficient in most epidemiological applications. The method to be used in any specific investigation will depend on how much uncertainty can be tolerated in exposure assessment and how closely available data matches circumstances for which estimates/predictions are required.</p

    Reliability, Effect Size, and Responsiveness and Intraclass Correlation of Health Status Measures Used in Randomized and Cluster-Randomized Trials

    Get PDF
    Background: New health status instruments are described by psychometric properties, such as Reliability, Effect Size, and Responsiveness. For cluster-randomized trials, another important statistic is the Intraclass Correlation for the instrument within clusters. Studies using better instruments can be performed with smaller sample sizes, but better instruments may be more expensive in terms of dollars, lost opportunities, or poorer data quality due to the response burden of longer instruments. Investigators often need to estimate the psychometric properties of a new instrument, or of an established instrument in a new setting. Optimal sample sizes for estimating these properties have not been studied in detail. Methods: We examined the power of a two-sample test as a function of the Reliability, Effect Size, Responsiveness, and Intraclass Correlation of the instrument. We calculated the “cost-effectiveness” of using a 1-item versus a 5-item measure of mental health status. We also used simulation to determine formulas for the sample size needed to estimate the psychometric statistics accurately. Findings: Under the usual model for measurement error, the psychometric statistics are all functions of the same error term. In randomized trials, a poorer instrument can achieve the desired power if the number of persons per treatment group is increased. In cluster-randomized trials, adequate power may be obtained by increasing the number of clusters per treatment group (and often the number of persons per cluster), as well as by choosing a better instrument. The 1-item measure of mental health status may be more cost-effective than the 5-item measure in some settings. Most published psychometric values are situation-specific. Very large samples are required to estimate Responsiveness and the Intraclass Correlation accurately. Conclusion: If the goal is to diagnose or refer individual patients, an instrument with high Validity and Reliability is needed. In settings where the sample sizes can be increased easily, less reliable instruments may be cost-effective. It is likely that many values of published psychometric statistics were derived from samples too small to provide accurate values, or are importantly specific to the setting in which they were derived. Note: A paper based on some of the material in this technical report has been published. (Diehr P, Chen L, Patrick D, Feng Z, Yasui Y. Reliability, effect size, and responsiveness of health status measures in the design of randomized and cluster-randomized trials. Contemporary Clinical Trials. 2005; 26:45-58. B). That paper does not include the material on estimating the sample size required to provide an accurate estimate of the reliability of a new instrument. That material is included in this technical report

    Comparative evaluation of gene-set analysis methods

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Multiple data-analytic methods have been proposed for evaluating gene-expression levels in specific biological pathways, assessing differential expression associated with a binary phenotype. Following Goeman and Bühlmann's recent review, we compared statistical performance of three methods, namely Global Test, ANCOVA Global Test, and SAM-GS, that test "self-contained null hypotheses" Via. subject sampling. The three methods were compared based on a simulation experiment and analyses of three real-world microarray datasets.</p> <p>Results</p> <p>In the simulation experiment, we found that the use of the asymptotic distribution in the two Global Tests leads to a statistical test with an incorrect size. Specifically, p-values calculated by the scaled <it>χ</it><sup>2 </sup>distribution of Global Test and the asymptotic distribution of ANCOVA Global Test are too liberal, while the asymptotic distribution with a quadratic form of the Global Test results in p-values that are too conservative. The two Global Tests with permutation-based inference, however, gave a correct size. While the three methods showed similar power using permutation inference after a proper standardization of gene expression data, SAM-GS showed slightly higher power than the Global Tests. In the analysis of a real-world microarray dataset, the two Global Tests gave markedly different results, compared to SAM-GS, in identifying pathways whose gene expressions are associated with <it>p53 </it>mutation in cancer cell lines. A proper standardization of gene expression variances is necessary for the two Global Tests in order to produce biologically sensible results. After the standardization, the three methods gave very similar biologically-sensible results, with slightly higher statistical significance given by SAM-GS. The three methods gave similar patterns of results in the analysis of the other two microarray datasets.</p> <p>Conclusion</p> <p>An appropriate standardization makes the performance of all three methods similar, given the use of permutation-based inference. SAM-GS tends to have slightly higher power in the lower <it>α</it>-level region (i.e. gene sets that are of the greatest interest). Global Test and ANCOVA Global Test have the important advantage of being able to analyze continuous and survival phenotypes and to adjust for covariates. A free Microsoft Excel Add-In to perform SAM-GS is available from <url>http://www.ualberta.ca/~yyasui/homepage.html</url>.</p

    An Automated Peak Identification/Calibration Procedure for High-Dimensional Protein Measures From Mass Spectrometers

    Get PDF
    Discovery of “signature” protein profiles that distinguish disease states (eg, malignant, benign, and normal) is a key step towards translating recent advancements in proteomic technologies into clinical utilities. Protein data generated from mass spectrometers are, however, large in size and have complex features due to complexities in both biological specimens and interfering biochemical/physical processes of the measurement procedure. Making sense out of such high-dimensional complex data is challenging and necessitates the use of a systematic data analytic strategy. We propose here a data processing strategy for two major issues in the analysis of such mass-spectrometry-generated proteomic data: (1) separation of protein “signals” from background “noise” in protein intensity measurements and (2) calibration of protein mass/charge measurements across samples. We illustrate the two issues and the utility of the proposed strategy using data from a prostate cancer biomarker discovery project as an example

    Parallelization of logic regression analysis on SNP-SNP interactions of a Crohn’s disease dataset model

    Get PDF
    SNP-SNP interactions have been recognized to be basically important for understanding genetic causes of complex disease traits. Logic regression is an effective methods for identifying SNP-SNP interactions associated with risk of complex disease. However, identifying SNP-SNP interactions are computationally challenging and may take hours, weeks and months to complete. Although parallel computing is a powerful method to accelerate computing time, it is arduous for users to apply this method to logic regression analyses of SNP-SNP interactions because it requires advanced programming skills to correctly partition and distribute data, control and monitor tasks across multi-core CPUs or several computers, and merge output files. In this paper, we present a novel R-library called SNPInt to automatically speed up analyses of SNP-SNP interactions of genome-wide association (GWA) studies using parallel computing without the advanced programming skills. The Crohn’s disease GWA studies dataset from the Wellcome Trust Case Control Consortium (WTCCC) that includes 4,680 individuals with 500,000 SNPs’ genotypes was analyzed using logic regression on a computer cluster to evaluate SNPInt performance. The results from SNPInt with any number of CPUs are the same as the results from non-parallel approach, and SNPInt library quite accelerated the logic regression analysis. For instance, with two hundred genes and twenty permutation rounds, the computing time was continuously decreased from 7.3 days to only 0.9 day when SNPInt applied eight CPUs. Executing analyses of SNP-SNP interactions using the SNPInt library is an effective way to boost performance, and simplify the parallelization of analyses of SNP-SNP interactions

    Adaptation of an evidence-based intervention to promote colorectal cancer screening: a quasi-experimental study

    Get PDF
    Background To accelerate the translation of research findings into practice for underserved populations, we investigated the adaptation of an evidence-based intervention (EBI), designed to increase colorectal cancer (CRC) screening in one limited English-proficient (LEP) population (Chinese), for another LEP group (Vietnamese) with overlapping cultural and health beliefs. Methods Guided by Diffusion of Innovations Theory, we adapted the EBI to achieve greater reach. Core elements of the adapted intervention included: small media (a DVD and pamphlet) translated into Vietnamese from Chinese; medical assistants distributing the small media instead of a health educator; and presentations on CRC screening to the medical assistants. A quasi-experimental study examined CRC screening adherence among eligible Vietnamese patients at the intervention and control clinics, before and after the 24-month intervention. The proportion of the adherence was assessed using generalized linear mixed models that account for clustering under primary care providers and also within-patient correlation between baseline and follow up. Results Our study included two cross-sectional samples: 1,016 at baseline (604 in the intervention clinic and 412 in the control clinic) and 1,260 post-intervention (746 in the intervention and 514 in the control clinic), including appreciable overlaps between the two time points. Pre-post change in CRC screening over time, expressed as an odds ratio (OR) of CRC screening adherence by time, showed a marginally-significant greater increase in CRC screening adherence at the intervention clinic compared to the control clinic (the ratio of the two ORs = 1.42; 95% CI 0.95, 2.15). In the sample of patients who were non-adherent to CRC screening at baseline, compared to the control clinic, the intervention clinic had marginally-significant greater increase in FOBT (adjusted OR = 1.77; 95% CI 0.98, 3.18) and a statistically-significantly greater increase in CRC screening adherence (adjusted OR = 1.70; 95% CI 1.05, 2.75). Conclusions Theoretically guided adaptations of EBIs may accelerate the translation of research into practice. Adaptation has the potential to mitigate health disparities for hard-to-reach populations in a timely manner
    corecore