12 research outputs found
Systematic construction of anomaly detection benchmarks from real data
Research in anomaly detection suffers from a lack of realis-tic and publicly-available problem sets. This paper discusses what properties such problem sets should possess. It then introduces a methodology for transforming existing classi-fication data sets into ground-truthed benchmark data sets for anomaly detection. The methodology produces data sets that vary along three important dimensions: (a) point diffi-culty, (b) relative frequency of anomalies, and (c) clustered-ness. We apply our generated datasets to benchmark several popular anomaly detection algorithms under a range of dif-ferent conditions. 1
Automatic classification of Candida species using Raman spectroscopy and machine learning
One of the problems that most affect hospitals is infections by pathogenic microorganisms. Rapid identification and adequate, timely treatment can avoid fatal consequences and the development of antibiotic resistance, so it is crucial to use fast, reliable, and not too laborious techniques to obtain quick results. Raman spectroscopy has proven to be a powerful tool for molecular analysis, meeting these requirements better than traditional techniques. In this work, we have used Raman spectroscopy combined with machine learning algorithms to explore the automatic identification of eleven species of the genus Candida, the most common cause of fungal infections worldwide. The Raman spectra were obtained from more than 220 different measurements of dried drops from pure cultures of each Candida species using a Raman Confocal Microscope with a 532 nm laser excitation source. After developing a spectral preprocessing methodology, a study of the quality and variability of the measured spectra at the isolate and species level, and the spectral features contributing to inter-class variations, showed the potential to discriminate between those pathogenic yeasts. Several machine learning and deep learning algorithms were trained using hyperparameter optimization techniques to find the best possible classifier for this spectral data, in terms of accuracy and lowest possible overfitting. We found that a one-dimensional Convolutional Neural Network (1-D CNN) could achieve above 80 % overall accuracy for the eleven classes spectral dataset, with good generalization capabilities.This work was supported by the R + D projects INNVAL19/17 (funded by Instituto de Investigación Valdecilla-IDIVAL), PID2019-107270RB-C21 (funded by MCIN/ AEI /10.13039/501100011033) and by Plan Nacional de I + D + and Instituto de Salud Carlos III (ISCIII), Subdirección General de Redes y Centros de Investigación Cooperativa, Ministerio de Ciencia, Innovación y Universidades, Spanish Network for Research in Infectious Diseases (REIPI RD16/0016/0007), CIBERINFEC (CB21/13/00068), CIBER-BBN (BBNGC1601), cofinanced by European Development Regional Fund “A way to achieve Europe”. A. A. O.-S was financially supported by the Miguel Servet II program (ISCIII-CPII17-00011)
Copula-based anomaly scoring and localization for large-scale, high-dimensional continuous data
The anomaly detection method presented by this paper has a special feature:
it does not only indicate whether an observation is anomalous or not but also
tells what exactly makes an anomalous observation unusual. Hence, it provides
support to localize the reason of the anomaly.
The proposed approach is model-based; it relies on the multivariate
probability distribution associated with the observations. Since the rare
events are present in the tails of the probability distributions, we use copula
functions, that are able to model the fat-tailed distributions well. The
presented procedure scales well; it can cope with a large number of
high-dimensional samples. Furthermore, our procedure can cope with missing
values, too, which occur frequently in high-dimensional data sets.
In the second part of the paper, we demonstrate the usability of the method
through a case study, where we analyze a large data set consisting of the
performance counters of a real mobile telecommunication network. Since such
networks are complex systems, the signs of sub-optimal operation can remain
hidden for a potentially long time. With the proposed procedure, many such
hidden issues can be isolated and indicated to the network operator.Comment: 27 pages, 12 figures, accepted at ACM Transactions on Intelligent
Systems and Technolog
An overview of clustering methods with guidelines for application in mental health research
Cluster analyzes have been widely used in mental health research to decompose inter-individual heterogeneity
by identifying more homogeneous subgroups of individuals. However, despite advances in new algorithms and
increasing popularity, there is little guidance on model choice, analytical framework and reporting requirements.
In this paper, we aimed to address this gap by introducing the philosophy, design, advantages/disadvantages and
implementation of major algorithms that are particularly relevant in mental health research. Extensions of basic
models, such as kernel methods, deep learning, semi-supervised clustering, and clustering ensembles are subsequently
introduced. How to choose algorithms to address common issues as well as methods for pre-clustering
data processing, clustering evaluation and validation are then discussed. Importantly, we also provide general
guidance on clustering workflow and reporting requirements. To facilitate the implementation of different algorithms,
we provide information on R functions and librarie