10,362 research outputs found
Imputation of missing data using multivariate Gaussian Linear Cluster-Weighted Modeling
Missing data arises when certain values are not recorded or observed for
variables of interest. However, most of the statistical theory assume complete
data availability. To address incomplete databases, one approach is to fill the
gaps corresponding to the missing information based on specific criteria, known
as imputation. In this study, we propose a novel imputation methodology for
databases with non-response units by leveraging additional information from
fully observed auxiliary variables. We assume that the variables included in
the database are continuous and that the auxiliary variables, which are fully
observed, help to improve the imputation capacity of the model. Within a fully
Bayesian framework, our method utilizes a flexible mixture of multivariate
normal distributions to jointly model the response and auxiliary variables. By
employing the principles of Gaussian Cluster-Weighted modeling, we construct a
predictive model to impute the missing values by leveraging information from
the covariates. We present simulation studies and a real data illustration to
demonstrate the imputation capacity of our method across various scenarios,
comparing it to other methods in the literatureComment: 23 pages, 9 figure
A framework for exploration and cleaning of environmental data : Tehran air quality data experience
Management and cleaning of large environmental monitored data sets is a specific challenge. In this article, the authors present a novel framework for exploring and cleaning large datasets. As a case study, we applied the method on air quality data of Tehran, Iran from 1996 to 2013. ; The framework consists of data acquisition [here, data of particulate matter with aerodynamic diameter ≤10 µm (PM10)], development of databases, initial descriptive analyses, removing inconsistent data with plausibility range, and detection of missing pattern. Additionally, we developed a novel tool entitled spatiotemporal screening tool (SST), which considers both spatial and temporal nature of data in process of outlier detection. We also evaluated the effect of dust storm in outlier detection phase.; The raw mean concentration of PM10 before implementation of algorithms was 88.96 µg/m3 for 1996-2013 in Tehran. After implementing the algorithms, in total, 5.7% of data points were recognized as unacceptable outliers, from which 69% data points were detected by SST and 1% data points were detected via dust storm algorithm. In addition, 29% of unacceptable outlier values were not in the PR. The mean concentration of PM10 after implementation of algorithms was 88.41 µg/m3. However, the standard deviation was significantly decreased from 90.86 µg/m3 to 61.64 µg/m3 after implementation of the algorithms. There was no distinguishable significant pattern according to hour, day, month, and year in missing data.; We developed a novel framework for cleaning of large environmental monitored data, which can identify hidden patterns. We also presented a complete picture of PM10 from 1996 to 2013 in Tehran. Finally, we propose implementation of our framework on large spatiotemporal databases, especially in developing countries
Machine learning applied to enzyme turnover numbers reveals protein structural correlates and improves metabolic models.
Knowing the catalytic turnover numbers of enzymes is essential for understanding the growth rate, proteome composition, and physiology of organisms, but experimental data on enzyme turnover numbers is sparse and noisy. Here, we demonstrate that machine learning can successfully predict catalytic turnover numbers in Escherichia coli based on integrated data on enzyme biochemistry, protein structure, and network context. We identify a diverse set of features that are consistently predictive for both in vivo and in vitro enzyme turnover rates, revealing novel protein structural correlates of catalytic turnover. We use our predictions to parameterize two mechanistic genome-scale modelling frameworks for proteome-limited metabolism, leading to significantly higher accuracy in the prediction of quantitative proteome data than previous approaches. The presented machine learning models thus provide a valuable tool for understanding metabolism and the proteome at the genome scale, and elucidate structural, biochemical, and network properties that underlie enzyme kinetics
Dealing with missing standard deviation and mean values in meta-analysis of continuous outcomes: a systematic review
Background: Rigorous, informative meta-analyses rely on availability of appropriate summary statistics or individual
participant data. For continuous outcomes, especially those with naturally skewed distributions, summary
information on the mean or variability often goes unreported. While full reporting of original trial data is the ideal,
we sought to identify methods for handling unreported mean or variability summary statistics in meta-analysis.
Methods: We undertook two systematic literature reviews to identify methodological approaches used to deal with
missing mean or variability summary statistics. Five electronic databases were searched, in addition to the Cochrane
Colloquium abstract books and the Cochrane Statistics Methods Group mailing list archive. We also conducted cited
reference searching and emailed topic experts to identify recent methodological developments. Details recorded
included the description of the method, the information required to implement the method, any underlying
assumptions and whether the method could be readily applied in standard statistical software. We provided a
summary description of the methods identified, illustrating selected methods in example meta-analysis scenarios.
Results: For missing standard deviations (SDs), following screening of 503 articles, fifteen methods were identified in
addition to those reported in a previous review. These included Bayesian hierarchical modelling at the meta-analysis
level; summary statistic level imputation based on observed SD values from other trials in the meta-analysis; a practical
approximation based on the range; and algebraic estimation of the SD based on other summary statistics. Following
screening of 1124 articles for methods estimating the mean, one approximate Bayesian computation approach and
three papers based on alternative summary statistics were identified. Illustrative meta-analyses showed that when
replacing a missing SD the approximation using the range minimised loss of precision and generally performed better
than omitting trials. When estimating missing means, a formula using the median, lower quartile and upper quartile
performed best in preserving the precision of the meta-analysis findings, although in some scenarios, omitting trials
gave superior results.
Conclusions: Methods based on summary statistics (minimum, maximum, lower quartile, upper quartile, median)
reported in the literature facilitate more comprehensive inclusion of randomised controlled trials with missing mean or
variability summary statistics within meta-analyses
- …