62 research outputs found

    Novel statistical approaches for missing values in truncated high-dimensional metabolomics data with a detection threshold.

    Get PDF
    Despite considerable advances in high throughput technology over the last decade, new challenges have emerged related to the analysis, interpretation, and integration of high-dimensional data. The arrival of omics datasets has contributed to the rapid improvement of systems biology, which seeks the understanding of complex biological systems. Metabolomics is an emerging omics field, where mass spectrometry technologies generate high dimensional datasets. As advances in this area are progressing, the need for better analysis methods to provide correct and adequate results are required. While in other omics sectors such as genomics or proteomics there has and continues to be critical understanding and concern in developing appropriate methods to handle missing values, handling of missing values in metabolomics has been an undervalued step. Missing data are a common issue in all types of medical research and handling missing data has always been a challenge. Since many downstream analyses such as classification methods, clustering methods, and dimension reduction methods require complete datasets, imputation of missing data is a critical and crucial step. The standard approach used is to remove features with one or more missing values or to substitute them with a value such as mean or half minimum substitution. One of the major issues from the missing data in metabolomics is due to a limit of detection, and thus sophisticated methods are needed to incorporate different origins of missingness. This dissertation contributes to the knowledge of missing value imputation methods with three separate but related research projects. The first project consists of a novel missing value imputation method based on a modification of the k nearest neighbor method which accounts for truncation at the minimum value/limit of detection. The approach assumes that the data follows a truncated normal distribution with the truncation point at the detection limit. The aim of the second project arises from the limitation in the first project. While the novel approach is useful, estimation of the truncated mean and standard deviation is problematic in small sample sizes (N \u3c 10). In this project, we develop a Bayesian model for imputing missing values with small sample sizes. The Bayesian paradigm has generally been utilized in the omics field as it exploits the data accessible from related components to acquire data to stabilize parameter estimation. The third project is based on the motivation to determine the impact of missing value imputation on down-stream analyses and whether ranking of imputation methods correlates well with the biological implications of the imputation

    Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies.

    Get PDF
    BACKGROUND: Untargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in biomedical studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation. METHODS: We investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established metabolic quantitative trait loci. RESULTS: Run day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable. CONCLUSION: Missing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.This work was supported by grants from the German Federal Ministry of Education and Research (BMBF), by BMBF Grant No. 01ZX1313C (project e:Athero-MED) and Grant No. 03IS2061B (project Gani_Med). Moreover, the research leading to these results has received funding from the European Union’s Seventh Framework Programme [FP7-Health-F5-2012] under grant agreement No. 305280 (MIMOmics) and from the European Research Council (starting grant “LatentCauses”). KS is supported by Biomedical Research Program funds at Weill Cornell Medical College in Qatar, a program funded by the Qatar Foundation. The KORA Augsburg studies were financed by the Helmholtz Zentrum MĂŒnchen, German Research Center for Environmental Health, Neuherberg, Germany and supported by grants from the German Federal Ministry of Education and Research (BMBF). Analyses in the EPIC-Norfolk study were supported by funding from the Medical Research Council (MC_PC_13048 and MC_UU_12015/1)

    BayesMetab: treatment of missing values in metabolomic studies using a Bayesian modeling approach

    Get PDF
    Background: With the rise of metabolomics, the development of methods to address analytical challenges in the analysis of metabolomics data is of great importance. Missing values (MVs) are pervasive, yet the treatment of MVs can have a substantial impact on downstream statistical analyses. The MVs problem in metabolomics is quite challenging and can arise because the metabolite is not biologically present in the sample, or is present in the sample but at a concentration below the lower limit of detection (LOD), or is present in the sample but undetected due to technical issues related to sample pre-processing steps. The former is considered missing not at random (MNAR) while the latter is an example of missing at random (MAR). Typically, such MVs are substituted by a minimum value, which may lead to severely biased results in downstream analyses. Results: We develop a Bayesian model, called BayesMetab, that systematically accounts for missing values based on a Markov chain Monte Carlo (MCMC) algorithm that incorporates data augmentation by allowing MVs to be due to either truncation below the LOD or other technical reasons unrelated to its abundance. Based on a variety of performance metrics (power for detecting differential abundance, area under the curve, bias and MSE for parameter estimates), our simulation results indicate that BayesMetab outperformed other imputation algorithms when there is a mixture of missingness due to MAR and MNAR. Further, our approach was competitive with other methods tailored specifically to MNAR in situations where missing data were completely MNAR. Applying our approach to an analysis of metabolomics data from a mouse myocardial infarction revealed several statistically significant metabolites not previously identified that were of direct biological relevance to the study. Conclusions: Our findings demonstrate that BayesMetab has improved performance in imputing the missing values and performing statistical inference compared to other current methods when missing values are due to a mixture of MNAR and MAR. Analysis of real metabolomics data strongly suggests this mixture is likely to occur in practice, and thus, it is important to consider an imputation model that accounts for a mixture of missing data types

    Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values

    Full text link
    This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625

    A modeling platform to predict cancer survival and therapy outcomes using tumor tissue derived metabolomics data.

    Get PDF
    Cancer is a complex and broad disease that is challenging to treat, partially due to the vast molecular heterogeneity among patients even within the same subtype. Currently, no reliable method exists to determine which potential first-line therapy would be most effective for a specific patient, as randomized clinical trials have concluded that no single regimen may be significantly more effective than others. One ongoing challenge in the field of oncology is the search for personalization of cancer treatment based on patient data. With an interdisciplinary approach, we show that tumor-tissue derived metabolomics data is capable of predicting clinical response to systemic therapy classified as disease control vs. progressive disease and pathological stage classified as stage I/II/III vs. stage IV via data analysis with machine-learning techniques (AUROC = 0.970; AUROC=0.902). Patient survival was also analyzed via statistical methods and machine-learning, both of which show that tumor-tissue derived metabolomics data is capable of risk stratifying patients in terms of long vs. short survival (OS AUROC = 0.940TEST; PFS AUROC = 0.875TEST). A set of key metabolites as potential biomarkers and associated metabolic pathways were also found for each outcome, which may lead to insight into biological mechanisms. Additionally, we developed a methodology to calibrate tumor growth related parameters in a well-established mathematical model of cancer to help predict the potential nuances of chemotherapeutic response. The proposed methodology shows results consistent with clinical observations in predicting individual patient response to systemic therapy and helps lay the foundation for further investigation into the calibration of mathematical models of cancer with patient-tissue derived molecular data. Chapters 6 and 8 were published in the Annals of Biomedical Engineering. Chapters 2, 3, and 7 were published in Metabolomics, Lung Cancer, and Pharmaceutical Research, respectively. Chapters 4 has been accepted for publication at the journal Metabolomics (in press) and Chapter 5 is in review at the journal Metabolomics. Chapter 9 is currently undergoing preparation for submission

    Metabolic effects of bezafibrate in mitochondrial disease

    Get PDF
    Mitochondrial disorders affect 1/5,000 and have no cure. Inducing mitochondrial biogenesis with bezafibrate improves mitochondrial function in animal models, but there are no comparable human studies. We performed an open-label observational experimental medicine study of six patients with mitochondrial myopathy caused by the m.3243A>G MTTL1 mutation. Our primary aim was to determine the effects of bezafibrate on mitochondrial metabolism, whilst providing preliminary evidence of safety and efficacy using biomarkers. The participants received 600-1,200 mg bezafibrate daily for 12 weeks. There were no clinically significant adverse events, and liver function was not affected. We detected a reduction in the number of complex IV-immunodeficient muscle fibres and improved cardiac function. However, this was accompanied by an increase in serum biomarkers of mitochondrial disease, including fibroblast growth factor 21 (FGF-21), growth and differentiation factor 15 (GDF-15), plus dysregulation of fatty acid and amino acid metabolism. Thus, although potentially beneficial in short term, inducing mitochondrial biogenesis with bezafibrate altered the metabolomic signature of mitochondrial disease, raising concerns about long-term sequelae

    Dissecting the multi-phenotype effects for cardiometabolic traits in highly dimensional whole genome and omics data through usage of multivariate analytical methods

    Get PDF
    Over a decade, single-phenotype genome-wide association studies (SP-GWAS) have been used to identify the association between variants and cardiometabolic traits. Initially, our team performed an SP-GWAS meta-analysis of fasting insulin (FI) and fasting glucose (FG) of European and trans-ethnic ancestries within the Meta-Analysis of Glucose and Insulin-related traits Consortium (MAGIC). However, after calculating the variance explained, I found FG only slightly increased from MAGIC's previous analysis from 1.5% to 4.3%. We proposed multi-phenotype GWAS (MP-GWAS) to boost the statistical power and performed MP-GWAS of fatty acids in NFBC1966 (N=4955) and replication in NFBC1986 (N=2687) to investigate fatty-acid metabolisms. The meta-analysis conducted by our team detected 10 signals associated with FAs (P<5x10-8) at PCSK9, GCKR, LPXN, FADS1, GPR137, ZNF259, LIPC, PDXDC1, PBX4, and APOE. For subsequent analysis, I proposed a new direct conditional analysis method within MP-GWAS, which detected multiple distinct signals within these loci. While MP-GWAS is a powerful method for locus discovery, it could increase missing phenotype data drastically. I further investigated the properties of seven imputation methods within the MP-GWAS framework via an extensive simulation study. I found that random forest (RF) is the best under various scenarios. However, as there was no available RF software designed for high-dimensional data, I developed the fastest to date RF imputation software, imputeSCOPA. I applied imputeSCOPA to the NFBC data and performed an MP-GWAS of 31 metabolites on the imputed data and complete-cases (CC). I found that the analysis using imputed data boosted the power of MP-GWAS’s by identifying two novel signals at rs61803025 within FCGR3B (PCC=5.68 x10-7 vs Pimp=5.49x10-9) and rs181847072 within ADAMTS3 (PCC= 5.67x10-7 vs Pimp= 9.27x10-11). These results demonstrate the increased power from MP-GWAS as compared to the traditional SP-GWAS. This work further highlights the importance of addressing missing data correctly and introduces a fast RF-based software imputeSCOPA.Open Acces

    Methods for Adapting Global Mass Spectrometry Based Metabolomics to the Clinical Enviornment

    Get PDF
    Metabolomics is a maturing field with successful application to research areas such as biomarker discovery and mechanisms of disease. With the ability to profile hundreds or even thousands of biochemicals simultaneously, many of which are also used in various laboratory diagnostics, the technology has the potential to replace a battery of clinical tests with a single test. However, the current state of global analysis presents several challenges for the clinical environment. This dissertation addresses two of these challenges. First is handling of missing values with respect to comparing an individual sample against a reference population. Second is the semi-quantitative nature of the liquid chromatography mass spectrometry. The first paper explores basic properties of metabolites, specifically the statistical distribution of metabolite concentrations and correlation between them. In human sample sets covering three different sample material appropriate for clinical testing, raw ion counts are shown to be vastly non-normal and consistently having a heavy right skew. Natural log-transformation is effective at removing this skewness and inducing Gaussian behavior, though departures from normality may persist in the tails of the distributions. Correlation between library-matched metabolites after removing artifact related features is also shown to be of only moderate degree in most cases. In the second paper, application of the log transformation is used to account for missing values in estimating population parameters of a reference cohort. Missing values are largely attributed to the true level falling below the detection limit of the instrument. Combining this assumption with the Gaussian model leads to two parametric approaches being introduced for the estimation of population parameters. These methods are shown to outperform standard imputation approaches in the field using a combination of simulations and real metabolomic datasets. The third paper addresses merging multiple global LC-MS metabolomic sets of the same biological sample type together. Typical normalization methods meant to account for sample to sample variation are presented and compared to alternative approaches using technical replicates and within batch scaling. Concentrations from targeted analysis of eight clinical biomarkers are used to show the superiority of these alternative approaches.Doctor of Public Healt
    • 

    corecore