8 research outputs found

    Multi-study factor regression model: an application in nutritional epidemiology

    Full text link
    Diet is a risk factor for many diseases. In nutritional epidemiology, studying reproducible dietary patterns is critical to reveal important associations with health. However, it is challenging: diverse cultural and ethnic backgrounds may critically impact eating patterns, showing heterogeneity, leading to incorrect dietary patterns and obscuring the components shared across different groups or populations. Moreover, covariate effects generated from observed variables, such as demographics and other confounders, can further bias these dietary patterns. Identifying the shared and group-specific dietary components and covariate effects is essential to drive accurate conclusions. To address these issues, we introduce a new modeling factor regression, the Multi-Study Factor Regression (MSFR) model. The MSFR model analyzes different populations simultaneously, achieving three goals: capturing shared component(s) across populations, identifying group-specific structures, and correcting for covariate effects. We use this novel method to derive common and ethnic-specific dietary patterns in a multi-center epidemiological study in Hispanic/Latinos community. Our model improves the accuracy of common and group dietary signals and yields better prediction than other techniques, revealing significant associations with health. In summary, we provide a tool to integrate different groups, giving accurate dietary signals crucial to inform public health policy

    Factor regression for dimensionality reduction and data integration techniques with applications to cancer data

    Get PDF
    Two key challenges in modern statistical applications are the large amount of information recorded per individual, and the fact that such data are often not collected all at once but in batches, often causing distortions in both mean and variance. We address both issues by introducing a novel sparse latent factor regression model to integrate heterogeneous data. The model provides a tool that addresses data exploration via dimensionality reduction and corrects the so-called batch effects, and provides sparse low-rank covariance matrix estimates. We study the use of several sparse priors, both local and non-local, to learn the dimension of the latent factors. Our model is fitted in a deterministic fashion by means of an EM algorithm for which we derive closed-form updates; this contributes a novel scalable algorithm for non-local priors, which is of interest beyond the immediate scope of this thesis. We also present several examples, with a focus on bioinformatics applications. Our results mainly show an increase in the accuracy of low-dimensional data reconstructions, with non-local priors substantially improving the inference on factor cardinality and non-zero factor loadings. Moreover, thanks to our batch effect correction, we achieve a considerable improvement in recovering the latent factors. Altogether, this thesis provides a novel approach to latent factor regression that balances sparsity with sensitivity, as well as being highly computationally efficient, and opens new avenues for future research on dimension-reduction-based data integration. The methodology developed in this thesis is available in an R package at https://github.com/AleAviP/BFR.BE.

    Heterogeneous large datasets integration using bayesian factor regression

    Get PDF
    Two key challenges in modern statistical applications are the large amount of information recorded per individual, and that such data are often not collected all at once but in batches. These batch effects can be complex, causing distortions in both mean and variance. We propose a novel sparse latent factor regression model to integrate such heterogeneous data. The model provides a tool for data exploration via dimensionality reduction and sparse low-rank covariance estimation while correcting for a range of batch effects. We study the use of several sparse priors (local and non-local) to learn the dimension of the latent factors. We provide a flexible methodology for sparse factor regression which is not limited to data with batch effects. Our model is fitted in a deterministic fashion by means of an EM algorithm for which we derive closed-form updates, contributing a novel scalable algorithm for non-local priors of interest beyond the immediate scope of this paper. We present several examples, with a focus on bioinformatics applications. Our results show an increase in the accuracy of the dimensionality reduction, with non-local priors substantially improving the reconstruction of factor cardinality. The results of our analyses illustrate how failing to properly account for batch effects can result in unreliable inference. Our model provides a novel approach to latent factor regression that balances sparsity with sensitivity in scenarios both with and without batch effects and is highly computationally efficient

    DifferentialRegulation: a Bayesian hierarchical approach to identify differentially regulated genes

    Full text link
    MOTIVATION: Although transcriptomics data is typically used to analyse mature spliced mRNA, recent attention has focused on jointly investigating spliced and unspliced (or precursor-) mRNA, which can be used to study gene regulation and changes in gene expression production. Nonetheless, most methods for spliced/unspliced inference (such as RNA velocity tools) focus on individual samples, and rarely allow comparisons between groups of samples (e.g., healthy vs . diseased). Furthermore, this kind of inference is challenging, because spliced and unspliced mRNA abundance is characterized by a high degree of quantification uncertainty, due to the prevalence of multi-mapping reads, i.e., reads compatible with multiple transcripts (or genes), and/or with both their spliced and unspliced versions. RESULTS: Here, we present DifferentialRegulation , a Bayesian hierarchical method to discover changes between experimental conditions with respect to the relative abundance of unspliced mRNA (over the total mRNA). We model the quantification uncertainty via a latent variable approach, where reads are allocated to their gene/transcript of origin, and to the respective splice version. We designed several benchmarks where our approach shows good performance, in terms of sensitivity and error control, versus state-of-the-art competitors. Importantly, our tool is flexible, and works with both bulk and single-cell RNA-sequencing data. AVAILABILITY AND IMPLEMENTATION: DifferentialRegulation is distributed as a Bioconductor R package

    Data sources and applied methods for paclitaxel safety signal discernment

    Get PDF
    BackgroundFollowing the identification of a late mortality signal, the Food and Drug Administration (FDA) convened an advisory panel that concluded that additional clinical study data are needed to comprehensively evaluate the late mortality signal observed with the use of drug-coated balloons (DCB) and drug-eluting stent (DES). The objective of this review is to (1) identify and summarize the existing clinical and cohort studies assessing paclitaxel-coated DCBs and DESs, (2) describe and determine the quality of the available data sources for the evaluation of these devices, and (3) present methodologies that can be leveraged for proper signal discernment within available data sources.MethodsStudies and data sources were identified through comprehensive searches. original research studies, clinical trials, comparative studies, multicenter studies, and observational cohort studies written in the English language and published from January 2007 to November 2021, with a follow-up longer than 36 months, were included in the review. Data quality of available data sources identified was assessed in three groupings. Moreover, accepted data-driven methodologies that may help circumvent the limitations of the extracted studies and data sources were extracted and described.ResultsThere were 39 studies and data sources identified. This included 19 randomized clinical trials, nine single-arm studies, eight registries, three administrative claims, and electronic health records. Methodologies focusing on the use of existing premarket clinical data, the incorporation of all contributed patient time, the use of aggregated data, approaches for individual-level data, machine learning and artificial intelligence approaches, Bayesian approaches, and the combination of various datasets were summarized.ConclusionDespite the multitude of available studies over the course of eleven years following the first clinical trial, the FDA-convened advisory panel found them insufficient for comprehensively assessing the late-mortality signal. High-quality data sources with the capabilities of employing advanced statistical methodologies are needed to detect potential safety signals in a timely manner and allow regulatory bodies to act quickly when a safety signal is detected

    Data sources and applied methods for paclitaxel safety signal discernment

    Get PDF
    Background Following the identification of a late mortality signal, the Food and Drug Administration (FDA) convened an advisory panel that concluded that additional clinical study data are needed to comprehensively evaluate the late mortality signal observed with the use of drug-coated balloons (DCB) and drug-eluting stent (DES). The objective of this review is to (1) identify and summarize the existing clinical and cohort studies assessing paclitaxel-coated DCBs and DESs, (2) describe and determine the quality of the available data sources for the evaluation of these devices, and (3) present methodologies that can be leveraged for proper signal discernment within available data sources. Methods Studies and data sources were identified through comprehensive searches. original research studies, clinical trials, comparative studies, multicenter studies, and observational cohort studies written in the English language and published from January 2007 to November 2021, with a follow-up longer than 36 months, were included in the review. Data quality of available data sources identified was assessed in three groupings. Moreover, accepted data-driven methodologies that may help circumvent the limitations of the extracted studies and data sources were extracted and described. Results There were 39 studies and data sources identified. This included 19 randomized clinical trials, nine single-arm studies, eight registries, three administrative claims, and electronic health records. Methodologies focusing on the use of existing premarket clinical data, the incorporation of all contributed patient time, the use of aggregated data, approaches for individual-level data, machine learning and artificial intelligence approaches, Bayesian approaches, and the combination of various datasets were summarized. Conclusion Despite the multitude of available studies over the course of eleven years following the first clinical trial, the FDA-convened advisory panel found them insufficient for comprehensively assessing the late-mortality signal. High-quality data sources with the capabilities of employing advanced statistical methodologies are needed to detect potential safety signals in a timely manner and allow regulatory bodies to act quickly when a safety signal is detected

    International Nosocomial Infection Control Consortium report, data summary of 50 countries for 2010-2015: Device-associated module

    No full text
    •We report INICC device-associated module data of 50 countries from 2010-2015.•We collected prospective data from 861,284 patients in 703 ICUs for 3,506,562 days.•DA-HAI rates and bacterial resistance were higher in the INICC ICUs than in CDC-NHSN's.•Device utilization ratio in the INICC ICUs was similar to CDC-NHSN's. Background: We report the results of International Nosocomial Infection Control Consortium (INICC) surveillance study from January 2010-December 2015 in 703 intensive care units (ICUs) in Latin America, Europe, Eastern Mediterranean, Southeast Asia, and Western Pacific. Methods: During the 6-year study period, using Centers for Disease Control and Prevention National Healthcare Safety Network (CDC-NHSN) definitions for device-associated health care-associated infection (DA-HAI), we collected prospective data from 861,284 patients hospitalized in INICC hospital ICUs for an aggregate of 3,506,562 days. Results: Although device use in INICC ICUs was similar to that reported from CDC-NHSN ICUs, DA-HAI rates were higher in the INICC ICUs: in the INICC medical-surgical ICUs, the pooled rate of central line-associated bloodstream infection, 4.1 per 1,000 central line-days, was nearly 5-fold higher than the 0.8 per 1,000 central line-days reported from comparable US ICUs, the overall rate of ventilator-associated pneumonia was also higher, 13.1 versus 0.9 per 1,000 ventilator-days, as was the rate of catheter-associated urinary tract infection, 5.07 versus 1.7 per 1,000 catheter-days. From blood cultures samples, frequencies of resistance of Pseudomonas isolates to amikacin (29.87% vs 10%) and to imipenem (44.3% vs 26.1%), and of Klebsiella pneumoniae isolates to ceftazidime (73.2% vs 28.8%) and to imipenem (43.27% vs 12.8%) were also higher in the INICC ICUs compared with CDC-NHSN ICUs. Conclusions: Although DA-HAIs in INICC ICU patients continue to be higher than the rates reported in CDC-NSHN ICUs representing the developed world, we have observed a significant trend toward the reduction of DA-HAI rates in INICC ICUs as shown in each international report. It is INICC's main goal to continue facilitating education, training, and basic and cost-effective tools and resources, such as standardized forms and an online platform, to tackle this problem effectively and systematically
    corecore