3,366 research outputs found
Stability metrics for multi-source biomedical data based on simplicial projections from probability distribution distances
[EN] Biomedical data may be composed of individuals generated from distinct, meaningful
sources. Due to possible contextual biases in the processes that generate data,
there may exist an undesirable and unexpected variability among the probability
distribution functions (PDFs) of the source subsamples, which, when uncontrolled,
may lead to inaccurate or unreproducible research results. Classical statistical
methods may have difficulties to undercover such variabilities when dealing with
multi-modal, multi-type, multi-variate data. This work proposes two metrics for
the analysis of stability among multiple data sources, robust to the aforementioned
conditions, and defined in the context of data quality assessment. Specifically, a
global probabilistic deviation (GPD) and a source probabilistic outlyingness (SPO)
metrics are proposed. The first provides a bounded degree of the global multi-source
variability, designed as an estimator equivalent to the notion of normalized standard
deviation of PDFs. The second provides a bounded degree of the dissimilarity of
each source to a latent central distribution. The metrics are based on the projection
of a simplex geometrical structure constructed from the Jensen-Shannon distances
among the sources PDFs. The metrics have been evaluated and demonstrated their
correct behaviour on a simulated benchmark and with real multi-source biomedical
data using the UCI Heart Disease dataset. The biomedical data quality assessment
based on the proposed stability metrics may improve the efficiency and effectiveness
of biomedical data exploitation and research.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by own IBIME funds under the UPV project Servicio de evaluacion y rating de la calidad de repositorios de datos biomedicos [UPV-2014-872] and the EU FP7 Project Help4Mood - A Computational Distributed System to Support the Treatment of Patients with Major Depression [ICT-248765].Sáez Silvestre, C.; Robles Viejo, M.; GarcĂa GĂłmez, JM. (2014). Stability metrics for multi-source biomedical data based on simplicial projections from probability distribution distances. Statistical Methods in Medical Research. 1-25. https://doi.org/10.1177/0962280214545122S12
Modeling large scale species abundance with latent spatial processes
Modeling species abundance patterns using local environmental features is an
important, current problem in ecology. The Cape Floristic Region (CFR) in South
Africa is a global hot spot of diversity and endemism, and provides a rich
class of species abundance data for such modeling. Here, we propose a
multi-stage Bayesian hierarchical model for explaining species abundance over
this region. Our model is specified at areal level, where the CFR is divided
into roughly one minute grid cells; species abundance is observed at
some locations within some cells. The abundance values are ordinally
categorized. Environmental and soil-type factors, likely to influence the
abundance pattern, are included in the model. We formulate the empirical
abundance pattern as a degraded version of the potential pattern, with the
degradation effect accomplished in two stages. First, we adjust for land use
transformation and then we adjust for measurement error, hence
misclassification error, to yield the observed abundance classifications. An
important point in this analysis is that only of the grid cells have been
sampled and that, for sampled grid cells, the number of sampled locations
ranges from one to more than one hundred. Still, we are able to develop
potential and transformed abundance surfaces over the entire region. In the
hierarchical framework, categorical abundance classifications are induced by
continuous latent surfaces. The degradation model above is built on the latent
scale. On this scale, an areal level spatial regression model was used for
modeling the dependence of species abundance on the environmental factors.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS335 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges
International audienceBackground: In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. Methods: Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 “High-dimensional data” of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. Results: The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. Conclusions: This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses
On the Nature and Types of Anomalies: A Review
Anomalies are occurrences in a dataset that are in some way unusual and do
not fit the general patterns. The concept of the anomaly is generally
ill-defined and perceived as vague and domain-dependent. Moreover, despite some
250 years of publications on the topic, no comprehensive and concrete overviews
of the different types of anomalies have hitherto been published. By means of
an extensive literature review this study therefore offers the first
theoretically principled and domain-independent typology of data anomalies, and
presents a full overview of anomaly types and subtypes. To concretely define
the concept of the anomaly and its different manifestations, the typology
employs five dimensions: data type, cardinality of relationship, anomaly level,
data structure and data distribution. These fundamental and data-centric
dimensions naturally yield 3 broad groups, 9 basic types and 61 subtypes of
anomalies. The typology facilitates the evaluation of the functional
capabilities of anomaly detection algorithms, contributes to explainable data
science, and provides insights into relevant topics such as local versus global
anomalies.Comment: 38 pages (30 pages content), 10 figures, 3 tables. Preprint; review
comments will be appreciated. Improvements in version 2: Explicit mention of
fifth anomaly dimension; Added section on explainable anomaly detection;
Added section on variations on the anomaly concept; Various minor additions
and improvement
- …