Search CORE

353 research outputs found

Machine Learning and Integrative Analysis of Biomedical Big Data.

Author: Choi Howard
Chung Neo Christopher
Mirza Bilal
Ping Peipei
Wang Jie
Wang Wei
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

Multidisciplinary Digital Publishing Institute

Ezid

Directory of Open Access Journals

eScholarship - University of California

MissForest - nonparametric missing value imputation for mixed-type data

Author: D. J. Stekhoven
Harley
Kurgan
Latal
LITTLE
Oba
P. Buhlmann
Smit
Troyanskaya
van Buuren
Wille
Wu
Publication venue: 'Oxford University Press (OUP)'
Publication date: 27/09/2011
Field of study

Modern data acquisition based on high-throughput technology is often facing the problem of missing data. Algorithms commonly used in the analysis of such large-scale data often depend on a complete set. Missing value imputation offers a solution to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data the different types are usually handled separately. Therefore, these methods ignore possible relations between variable types. We propose a nonparametric method which can cope with different types of variables simultaneously. We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error estimates of random forest we are able to estimate the imputation error without the need of a test set. Evaluation is performed on multiple data sets coming from a diverse selection of biological fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in data sets including different types of variables. In our comparative study missForest outperforms other methods of imputation especially in data settings where complex interactions and nonlinear relations are suspected. The out-of-bag imputation error estimates of missForest prove to be adequate in all settings. Additionally, missForest exhibits attractive computational efficiency and can cope with high-dimensional data.Comment: Submitted to Oxford Journal's Bioinformatics on 3rd of May 201

arXiv.org e-Print Archive

Repository for Publications and Research Data

Crossref

Solving the "many variables" problem in MICE with principal component regression

Author: Costantini Edoardo
Lang Kyle M.
Reeskens Tim
Sijtsma Klaas
Publication venue
Publication date: 21/04/2023
Field of study

Multiple Imputation (MI) is one of the most popular approaches to addressing missing values in questionnaires and surveys. MI with multivariate imputation by chained equations (MICE) allows flexible imputation of many types of data. In MICE, for each variable under imputation, the imputer needs to specify which variables should act as predictors in the imputation model. The selection of these predictors is a difficult, but fundamental, step in the MI procedure, especially when there are many variables in a data set. In this project, we explore the use of principal component regression (PCR) as a univariate imputation method in the MICE algorithm to automatically address the "many variables" problem that arises when imputing large social science data. We compare different implementations of PCR-based MICE with a correlation-thresholding strategy by means of a Monte Carlo simulation study and a case study. We find the use of PCR on a variable-by-variable basis to perform best and that it can perform closely to expertly designed imputation procedures

arXiv.org e-Print Archive

Different Routes or Methods of Application for Dimensionality Reduction in Multicenter Studies Databases

Author: Boukichou Abdelkader Nisa
Montero Alonso Miguel Ángel
Publication venue: 'MDPI AG'
Publication date: 23/02/2022
Field of study

Technological progress and digital transformation, which began with Big Data and Artificial Intelligence (AI), are currently transforming ways of working in all fields, to support decision-making, particularly in multicenter research. This study analyzed a sample of 5178 hospital patients, suffering from exacerbation of chronic obstructive pulmonary disease (eCOPD). Because of differences in disease stages and progression, the clinical pathologies and characteristics of the patients were extremely diverse. Our objective was thus to reduce dimensionality by projecting the data onto a lower dimensional subspace. The results obtained show that principal component analysis (PCA) is the most effective linear technique for dimensionality reduction. Four patient profile groups are generated with similar affinity and characteristics. In conclusion, dimensionality reduction is found to be an effective technique that permits the visualization of early indications of clinical patterns with similar characteristics. This is valuable since the development of other pathologies (chronic diseases) over any given time period influences clinical parameters. If healthcare professionals can have access to such information beforehand, this can significantly improve the quality of patient care, since this type of study is based on a multitude of data-variables that can be used to evaluate and monitor the clinical status of the patient

Repositorio Institucional Universidad de Granada

Different routes or methods of application for dimensionality reduction in multicenter studies databases

Author: Boukichou-Abdelkader Nisa
Montero-Alonso Miguel Ángel
Muñoz García Alberto
Publication venue: MDPI
Publication date: 01/02/2022
Field of study

Directory of Open Access Journals

Universidad Carlos III de Madrid e-Archivo

Solving the many-variables problem in MICE with principal component regression

Author: Costantini E.
Lang K.M.M.
Reeskens T.
Sijtsma K.
Publication venue
Publication date: 01/08/2023
Field of study

Multiple Imputation (MI) is one of the most popular approaches to addressing missing values in questionnaires and surveys. MI with multivariate imputation by chained equations (MICE) allows flexible imputation of many types of data. In MICE, for each variable under imputation, the imputer needs to specify which variables should act as predictors in the imputation model. The selection of these predictors is a difficult, but fundamental, step in the MI procedure, especially when there are many variables in a data set. In this project, we explore the use of principal component regression (PCR) as a univariate imputation method in the MICE algorithm to automatically address the many-variables problem that arises when imputing large social science data. We compare different implementations of PCR-based MICE with a correlation-thresholding strategy through two Monte Carlo simulation studies and a case study. We find the use of PCR on a variable-by-variable basis to perform best and that it can perform closely to expertly designed imputation procedures

Tilburg University Repository

Income and Longevity Revisited: Do High-Earning Women Live Longer?

Author: Friedrich Breyer
Jan Marcus
Publication venue
Publication date
Field of study

The empirical relationship between income and longevity has been addressed by a large number of studies, but most were confined to men. In particular, administrative data from public pension systems are less reliable for women because of the loose relationship between own earnings and household income. Following the procedure first used by Hupfeld (2010), we analyze a large data set from the German public pension scheme on women who died between 1994 and 2005, employing both non-parametric and parametric methods. To overcome the problem mentioned above we concentrate on women with relatively long earnings history. We find that the relationship between earnings and life expectancy is very similar for women as for men: Among women who contributed at least for 25 years, a woman at the 90th percentile of the income distribution can expect to live 3 years longer than a woman at the 10th percentile.Life expectancy and income, women, public pensions, Germany

Research Papers in Economics

Improving Country Conflict and Peace Modeling: Datasets, Imputations, and Hierarchical Clustering

Author: Leiby Benjamin D.
Publication venue: AFIT Scholar
Publication date: 01/09/2022
Field of study

Many disparate datasets exist that provide country attributes covering political, economic, and social aspects. Unfortunately, this data often does not include all countries nor is the data complete for those countries included, as measured by the dataset’s missingness. This research addresses these dataset shortfalls in predicting country instability by considering country attributes in all aspects as well as in greater thresholds of missingness. First, a structured summary of past research is presented framed by a developed casual taxonomy and functional ontology. Additionally, a novel imputation technique for very large datasets is presented to account for moderate missingness in the expanded dataset. This method is further extended to establish the MASS-impute algorithm, a multicollinearity applied stepwise stochastic imputation method that overcomes numerical problems present in preferred commercial packages. Finally, the imputed datasets with 932 variables are used to develop a hierarchical clustering approach that accounts for geographic and cultural influences that are desired in the practical use of modeling country conflict. These additional insights and tools provide a basis for improving future country conflict and peace research

AFTI Scholar (Air Force Institute of Technology)