353 research outputs found
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
MissForest - nonparametric missing value imputation for mixed-type data
Modern data acquisition based on high-throughput technology is often facing
the problem of missing data. Algorithms commonly used in the analysis of such
large-scale data often depend on a complete set. Missing value imputation
offers a solution to this problem. However, the majority of available
imputation methods are restricted to one type of variable only: continuous or
categorical. For mixed-type data the different types are usually handled
separately. Therefore, these methods ignore possible relations between variable
types. We propose a nonparametric method which can cope with different types of
variables simultaneously. We compare several state of the art methods for the
imputation of missing values. We propose and evaluate an iterative imputation
method (missForest) based on a random forest. By averaging over many unpruned
classification or regression trees random forest intrinsically constitutes a
multiple imputation scheme. Using the built-in out-of-bag error estimates of
random forest we are able to estimate the imputation error without the need of
a test set. Evaluation is performed on multiple data sets coming from a diverse
selection of biological fields with artificially introduced missing values
ranging from 10% to 30%. We show that missForest can successfully handle
missing values, particularly in data sets including different types of
variables. In our comparative study missForest outperforms other methods of
imputation especially in data settings where complex interactions and nonlinear
relations are suspected. The out-of-bag imputation error estimates of
missForest prove to be adequate in all settings. Additionally, missForest
exhibits attractive computational efficiency and can cope with high-dimensional
data.Comment: Submitted to Oxford Journal's Bioinformatics on 3rd of May 201
Solving the "many variables" problem in MICE with principal component regression
Multiple Imputation (MI) is one of the most popular approaches to addressing
missing values in questionnaires and surveys. MI with multivariate imputation
by chained equations (MICE) allows flexible imputation of many types of data.
In MICE, for each variable under imputation, the imputer needs to specify which
variables should act as predictors in the imputation model. The selection of
these predictors is a difficult, but fundamental, step in the MI procedure,
especially when there are many variables in a data set. In this project, we
explore the use of principal component regression (PCR) as a univariate
imputation method in the MICE algorithm to automatically address the "many
variables" problem that arises when imputing large social science data. We
compare different implementations of PCR-based MICE with a
correlation-thresholding strategy by means of a Monte Carlo simulation study
and a case study. We find the use of PCR on a variable-by-variable basis to
perform best and that it can perform closely to expertly designed imputation
procedures
Different Routes or Methods of Application for Dimensionality Reduction in Multicenter Studies Databases
Technological progress and digital transformation, which began with Big Data and Artificial
Intelligence (AI), are currently transforming ways of working in all fields, to support decision-making,
particularly in multicenter research. This study analyzed a sample of 5178 hospital patients, suffering
from exacerbation of chronic obstructive pulmonary disease (eCOPD). Because of differences in
disease stages and progression, the clinical pathologies and characteristics of the patients were
extremely diverse. Our objective was thus to reduce dimensionality by projecting the data onto a
lower dimensional subspace. The results obtained show that principal component analysis (PCA)
is the most effective linear technique for dimensionality reduction. Four patient profile groups are
generated with similar affinity and characteristics. In conclusion, dimensionality reduction is found
to be an effective technique that permits the visualization of early indications of clinical patterns with
similar characteristics. This is valuable since the development of other pathologies (chronic diseases)
over any given time period influences clinical parameters. If healthcare professionals can have access
to such information beforehand, this can significantly improve the quality of patient care, since this
type of study is based on a multitude of data-variables that can be used to evaluate and monitor the
clinical status of the patient
Different routes or methods of application for dimensionality reduction in multicenter studies databases
Technological progress and digital transformation, which began with Big Data and Artificial Intelligence (AI), are currently transforming ways of working in all fields, to support decision-making, particularly in multicenter research. This study analyzed a sample of 5178 hospital patients, suffering from exacerbation of chronic obstructive pulmonary disease (eCOPD). Because of differences in disease stages and progression, the clinical pathologies and characteristics of the patients were extremely diverse. Our objective was thus to reduce dimensionality by projecting the data onto a lower dimensional subspace. The results obtained show that principal component analysis (PCA) is the most effective linear technique for dimensionality reduction. Four patient profile groups are generated with similar affinity and characteristics. In conclusion, dimensionality reduction is found to be an effective technique that permits the visualization of early indications of clinical patterns with similar characteristics. This is valuable since the development of other pathologies (chronic diseases) over any given time period influences clinical parameters. If healthcare professionals can have access to such information beforehand, this can significantly improve the quality of patient care, since this type of study is based on a multitude of data-variables that can be used to evaluate and monitor the clinical status of the patient
Solving the many-variables problem in MICE with principal component regression
Multiple Imputation (MI) is one of the most popular approaches to addressing missing values in questionnaires and surveys. MI with multivariate imputation by chained equations (MICE) allows flexible imputation of many types of data. In MICE, for each variable under imputation, the imputer needs to specify which variables should act as predictors in the imputation model. The selection of these predictors is a difficult, but fundamental, step in the MI procedure, especially when there are many variables in a data set. In this project, we explore the use of principal component regression (PCR) as a univariate imputation method in the MICE algorithm to automatically address the many-variables problem that arises when imputing large social science data. We compare different implementations of PCR-based MICE with a correlation-thresholding strategy through two Monte Carlo simulation studies and a case study. We find the use of PCR on a variable-by-variable basis to perform best and that it can perform closely to expertly designed imputation procedures
Income and Longevity Revisited: Do High-Earning Women Live Longer?
The empirical relationship between income and longevity has been addressed by a large number of studies, but most were confined to men. In particular, administrative data from public pension systems are less reliable for women because of the loose relationship between own earnings and household income. Following the procedure first used by Hupfeld (2010), we analyze a large data set from the German public pension scheme on women who died between 1994 and 2005, employing both non-parametric and parametric methods. To overcome the problem mentioned above we concentrate on women with relatively long earnings history. We find that the relationship between earnings and life expectancy is very similar for women as for men: Among women who contributed at least for 25 years, a woman at the 90th percentile of the income distribution can expect to live 3 years longer than a woman at the 10th percentile.Life expectancy and income, women, public pensions, Germany
Improving Country Conflict and Peace Modeling: Datasets, Imputations, and Hierarchical Clustering
Many disparate datasets exist that provide country attributes covering political, economic, and social aspects. Unfortunately, this data often does not include all countries nor is the data complete for those countries included, as measured by the datasetâs missingness. This research addresses these dataset shortfalls in predicting country instability by considering country attributes in all aspects as well as in greater thresholds of missingness. First, a structured summary of past research is presented framed by a developed casual taxonomy and functional ontology. Additionally, a novel imputation technique for very large datasets is presented to account for moderate missingness in the expanded dataset. This method is further extended to establish the MASS-impute algorithm, a multicollinearity applied stepwise stochastic imputation method that overcomes numerical problems present in preferred commercial packages. Finally, the imputed datasets with 932 variables are used to develop a hierarchical clustering approach that accounts for geographic and cultural influences that are desired in the practical use of modeling country conflict. These additional insights and tools provide a basis for improving future country conflict and peace research
- âŠ