12 research outputs found

    Systematic construction of anomaly detection benchmarks from real data

    Full text link
    Research in anomaly detection suffers from a lack of realis-tic and publicly-available problem sets. This paper discusses what properties such problem sets should possess. It then introduces a methodology for transforming existing classi-fication data sets into ground-truthed benchmark data sets for anomaly detection. The methodology produces data sets that vary along three important dimensions: (a) point diffi-culty, (b) relative frequency of anomalies, and (c) clustered-ness. We apply our generated datasets to benchmark several popular anomaly detection algorithms under a range of dif-ferent conditions. 1

    Automatic classification of Candida species using Raman spectroscopy and machine learning

    Get PDF
    One of the problems that most affect hospitals is infections by pathogenic microorganisms. Rapid identification and adequate, timely treatment can avoid fatal consequences and the development of antibiotic resistance, so it is crucial to use fast, reliable, and not too laborious techniques to obtain quick results. Raman spectroscopy has proven to be a powerful tool for molecular analysis, meeting these requirements better than traditional techniques. In this work, we have used Raman spectroscopy combined with machine learning algorithms to explore the automatic identification of eleven species of the genus Candida, the most common cause of fungal infections worldwide. The Raman spectra were obtained from more than 220 different measurements of dried drops from pure cultures of each Candida species using a Raman Confocal Microscope with a 532 nm laser excitation source. After developing a spectral preprocessing methodology, a study of the quality and variability of the measured spectra at the isolate and species level, and the spectral features contributing to inter-class variations, showed the potential to discriminate between those pathogenic yeasts. Several machine learning and deep learning algorithms were trained using hyperparameter optimization techniques to find the best possible classifier for this spectral data, in terms of accuracy and lowest possible overfitting. We found that a one-dimensional Convolutional Neural Network (1-D CNN) could achieve above 80 % overall accuracy for the eleven classes spectral dataset, with good generalization capabilities.This work was supported by the R + D projects INNVAL19/17 (funded by Instituto de Investigación Valdecilla-IDIVAL), PID2019-107270RB-C21 (funded by MCIN/ AEI /10.13039/501100011033) and by Plan Nacional de I + D + and Instituto de Salud Carlos III (ISCIII), Subdirección General de Redes y Centros de Investigación Cooperativa, Ministerio de Ciencia, Innovación y Universidades, Spanish Network for Research in Infectious Diseases (REIPI RD16/0016/0007), CIBERINFEC (CB21/13/00068), CIBER-BBN (BBNGC1601), cofinanced by European Development Regional Fund “A way to achieve Europe”. A. A. O.-S was financially supported by the Miguel Servet II program (ISCIII-CPII17-00011)

    Copula-based anomaly scoring and localization for large-scale, high-dimensional continuous data

    Full text link
    The anomaly detection method presented by this paper has a special feature: it does not only indicate whether an observation is anomalous or not but also tells what exactly makes an anomalous observation unusual. Hence, it provides support to localize the reason of the anomaly. The proposed approach is model-based; it relies on the multivariate probability distribution associated with the observations. Since the rare events are present in the tails of the probability distributions, we use copula functions, that are able to model the fat-tailed distributions well. The presented procedure scales well; it can cope with a large number of high-dimensional samples. Furthermore, our procedure can cope with missing values, too, which occur frequently in high-dimensional data sets. In the second part of the paper, we demonstrate the usability of the method through a case study, where we analyze a large data set consisting of the performance counters of a real mobile telecommunication network. Since such networks are complex systems, the signs of sub-optimal operation can remain hidden for a potentially long time. With the proposed procedure, many such hidden issues can be isolated and indicated to the network operator.Comment: 27 pages, 12 figures, accepted at ACM Transactions on Intelligent Systems and Technolog

    An overview of clustering methods with guidelines for application in mental health research

    Get PDF
    Cluster analyzes have been widely used in mental health research to decompose inter-individual heterogeneity by identifying more homogeneous subgroups of individuals. However, despite advances in new algorithms and increasing popularity, there is little guidance on model choice, analytical framework and reporting requirements. In this paper, we aimed to address this gap by introducing the philosophy, design, advantages/disadvantages and implementation of major algorithms that are particularly relevant in mental health research. Extensions of basic models, such as kernel methods, deep learning, semi-supervised clustering, and clustering ensembles are subsequently introduced. How to choose algorithms to address common issues as well as methods for pre-clustering data processing, clustering evaluation and validation are then discussed. Importantly, we also provide general guidance on clustering workflow and reporting requirements. To facilitate the implementation of different algorithms, we provide information on R functions and librarie
    corecore