1 research outputs found
Analyzing the Fine Structure of Distributions
One aim of data mining is the identification of interesting structures in
data. For better analytical results, the basic properties of an empirical
distribution, such as skewness and eventual clipping, i.e. hard limits in value
ranges, need to be assessed. Of particular interest is the question of whether
the data originate from one process or contain subsets related to different
states of the data producing process. Data visualization tools should deliver a
clear picture of the univariate probability density distribution (PDF) for each
feature. Visualization tools for PDFs typically use kernel density estimates
and include both the classical histogram, as well as the modern tools like
ridgeline plots, bean plots and violin plots. If density estimation parameters
remain in a default setting, conventional methods pose several problems when
visualizing the PDF of uniform, multimodal, skewed distributions and
distributions with clipped data, For that reason, a new visualization tool
called the mirrored density plot (MD plot), which is specifically designed to
discover interesting structures in continuous features, is proposed. The MD
plot does not require adjusting any parameters of density estimation, which is
what may make the use of this plot compelling particularly to non-experts. The
visualization tools in question are evaluated against statistical tests with
regard to typical challenges of explorative distribution analysis. The results
of the evaluation are presented using bimodal Gaussian, skewed distributions
and several features with already published PDFs. In an exploratory data
analysis of 12 features describing quarterly financial statements, when
statistical testing poses a great difficulty, only the MD plots can identify
the structure of their PDFs. In sum, the MD plot outperforms the above
mentioned methods.Comment: 66 pages, 81 figures, accepted in PLOS ON