Testing for Multivariate Normality in Mass Spectrometry
Imaging Data: A Robust Statistical Approach for Clustering Evaluation
and the Generation of Synthetic Mass Spectrometry Imaging Data Sets
Spatial clustering
is a powerful tool in mass spectrometry imaging
(MSI) and has been demonstrated to be capable of differentiating tumor
types, visualizing intratumor heterogeneity, and segmenting anatomical
structures. Several clustering methods have been applied to mass spectrometry
imaging data, but a principled comparison and evaluation of different
clustering techniques presents a significant challenge. We propose
that testing whether the data has a multivariate normal distribution
within clusters can be used to evaluate the performance when using
algorithms that assume normality in the data, such as <i>k</i>-means clustering. In cases where clustering has been performed using
the cosine distance, conversion of the data to polar coordinates prior
to normality testing should be performed to ensure normality is tested
in the correct coordinate system. In addition to these evaluations
of internal consistency, we demonstrate that the multivariate normal
distribution can then be used as a basis for statistical modeling
of MSI data. This allows the generation of synthetic MSI data sets
with known ground truth, providing a means of external clustering
evaluation. To demonstrate this, reference data from seven anatomical
regions of an MSI image of a coronal section of mouse brain were modeled.
From this, a set of synthetic data based on this model was generated.
Results of <i>r</i><sup>2</sup> fitting of the chi-squared
quantile–quantile plots on the seven anatomical regions confirmed
that the data acquired from each spatial region was found to be closer
to normally distributed in polar space than in Euclidean. Finally,
principal component analysis was applied to a single data set that
included synthetic and real data. No significant differences were
found between the two data types, indicating the suitability of these
methods for generating realistic synthetic data