Search CORE

11 research outputs found

Hybrid self-organizing feature map (SOM) for anomaly detection in cloud infrastructures using granular clustering based upon value-difference metrics

Author: Chochliouros I.P.
Hutchison D.
Sfakianakis E.
Shirazi S.N.
Stephanakis I.M.
Publication venue: 'Elsevier BV'
Publication date: 01/08/2019
Field of study

We have witnessed an increase in the availability of data from diverse sources over the past few years. Cloud computing, big data and Internet-of-Things (IoT) are distinctive cases of such an increase which demand novel approaches for data analytics in order to process and analyze huge volumes of data for security and business use. Cloud computing has been becoming popular for critical structure IT mainly due to cost savings and dynamic scalability. Current offerings, however, are not mature enough with respect to stringent security and resilience requirements. Mechanisms such as anomaly detection hybrid systems are required in order to protect against various challenges that include network based attacks, performance issues and operational anomalies. Such hybrid AI systems include Neural Networks, blackboard systems, belief (Bayesian) networks, case-based reasoning and rule-based systems and can be implemented in a variety of ways. Traffic in the cloud comes from multiple heterogeneous domains and changes rapidly due to the variety of operational characteristics of the tenants using the cloud and the elasticity of the provided services. The underlying detection mechanisms rely upon measurements drawn from multiple sources. However, the characteristics of the distribution of measurements within specific subspaces might be unknown. We argue in this paper that there is a need to cluster the observed data during normal network operation into multiple subspaces each one of them featuring specific local attributes, i.e. granules of information. Clustering is implemented by the inference engine of a model hybrid NN system. Several variations of the so-called value-difference metric (VDM) are investigated like local histograms and the Canberra distance for scalar attributes, the Jaccard distance for binary word attributes, rough sets as well as local histograms over an aggregate ordering distance and the Canberra measure for vectorial attributes. Low-dimensional subspace representations of each group of points (measurements) in the context of anomaly detection in critical cloud implementations is based upon VD metrics and can be either parametric or non-parametric. A novel application of a Self-Organizing-Feature Map (SOFM) of reduced/aggregate ordered sets of objects featuring VD metrics (as obtained from distributed network measurements) is proposed. Each node of the SOFM stands for a structured local distribution of such objects within the input space. The so-called Neighborhood-based Outlier Factor (NOOF) is defined for such reduced/aggregate ordered sets of objects as a value-difference metric of histogrammes. Measurements that do not belong to local distributions are detected as anomalies, i.e. outliers of the trained SOFM. Several methods of subspace clustering using Expectation-Maximization Gaussian Mixture Models (a parametric approach) as well as local data densities (a non-parametric approach) are outlined and compared against the proposed method using data that are obtained from our cloud testbed in emulated anomalous traffic conditions. The results—which are obtained from a model NN system—indicate that the proposed method performs well in comparison with conventional techniques

Lancaster E-Prints

Data segmentation based on the local intrinsic dimension

Author: Allegra M.
Denti F.
Facco E.
Laio A.
Mira A.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms

arXiv.org e-Print Archive

PubliCatt

Archivio istituzionale della ricerca - Università dell'Insubria

Sissa Digital Library

Archivio istituzionale della ricerca - Università di Padova

Riemannian Multi-Manifold Modeling

Author: Lerman Gilad
Slavakis Konstantinos
Wang Xu
Publication venue
Publication date: 30/09/2014
Field of study

This paper advocates a novel framework for segmenting a dataset in a Riemannian manifold

M

into clusters lying around low-dimensional submanifolds of

M

. Important examples of

M

, for which the proposed clustering algorithm is computationally efficient, are the sphere, the set of positive definite matrices, and the Grassmannian. The clustering problem with these examples of

M

is already useful for numerous application domains such as action identification in video sequences, dynamic texture clustering, brain fiber segmentation in medical imaging, and clustering of deformed images. The proposed clustering algorithm constructs a data-affinity matrix by thoroughly exploiting the intrinsic geometry and then applies spectral clustering. The intrinsic local geometry is encoded by local sparse coding and more importantly by directional information of local tangent spaces and geodesics. Theoretical guarantees are established for a simplified variant of the algorithm even when the clusters intersect. To avoid complication, these guarantees assume that the underlying submanifolds are geodesic. Extensive validation on synthetic and real data demonstrates the resiliency of the proposed method against deviations from the theoretical model as well as its superior performance over state-of-the-art techniques

arXiv.org e-Print Archive

CiteSeerX

Advances in Mining Binary Data: Itemsets as Summaries

Author: Tatti Nikolaj
Publication venue: Teknillinen korkeakoulu, Tietojenkäsittelytieteen laitos
Publication date: 01/01/2008
Field of study

Mining frequent itemsets is one of the most popular topics in data mining. Itemsets are local patterns, representing frequently cooccurring sets of variables. This thesis studies the use of itemsets to give information about the whole dataset. We show how to use itemsets for answering queries, that is, finding out the number of transactions satisfying some given formula. While this is a simple procedure given the original data, the task transforms into a computationally infeasible problem if we seek the solution using the itemsets. By making some assumptions of the structure of the itemsets and applying techniques from the theory of Markov Random Fields we are able to reduce the computational burden of query answering. We can also use the known itemsets to predict the unknown itemsets. The difference between the prediction and the actual value can be used for ranking itemsets. In fact, this method can be seen as generalisation for ranking itemsets based on their deviation from the independence model, an approach commonly used in the data mining literature. The next contribution is to use itemsets to define a distance between the datasets. We achieve this by computing the difference between the frequencies of the itemsets. We take into account the fact that the itemset frequencies may be correlated and by removing the correlation we show that our distance transforms into Euclidean distance between the frequencies of parity formulae. The last contribution concerns calculating the effective dimension of binary data. We apply fractal dimension, a known concept that works well with realvalued data. Applying fractal dimension dimension directly is problematic because of the unique nature of binary data. We propose a solution to this problem by introducing a new concept called normalised correlation dimension. We study our approach theoretically and empirically by comparing it against other methods.Kattavien joukkojen louhinta on yksi suosituimmista tiedon louhinnan teemoista. Kattavat joukot ovat paikallisia hahmoja: ne edustavat usein esiintyviä muuttujakombinaatioita. kattavien joukkojen käyttöä koko tietokantaa kuvaaviin tarkoituksiin. Kattavia joukkoja voidaan käyttää Boolen kyselyihin vastaamiseen, ts. annetun Boolen kaavan toteuttavien tietuiden lukumäärän arviointiin. Tehtävästä tulee kuitenkin laskennallisesti vaativa, jos käytössä ovat vain kattavat joukot. Väitöskirjassa osoitetaan, että tietyin oletuksin ongelman ratkaisemista voidaan helpottaa käyttäen hyväksi tekniikoita, jotka perustuvat Markov-kenttiin. Väitöskirjassa tutkitaan myös miten kattavia joukkoja voidaan käyttää tuntemattomien joukkojen frekvenssin ennustamiseen. Varsinaisen datasta lasketun frekvenssin ja ennusteen välistä erotusta voidaan käyttää kattavan joukon merkitsevyyden mittana. Tämä lähestymistapa on itseasiassa tiedon louhinnassa usein toistuvan tärkeysmitan yleistys, jossa kattavan joukon tärkeys on sen poikkeama riippumattomuusoletuksesta. Väitöskirjan seuraava tutkimusaihe on kattavien joukkojen käyttö tietokantojen välisen etäisyyden määrittelemiseen. Etäisyys määritellään kattavien joukkojen frekvenssien erotuksena. Kattavien joukkojen frekvenssien välillä saattaa olla korrelaatiota ja eliminoimalla tämä korrelaatio työssä osoitetaan, että etäisyys vastaa tiettyjen pariteettikyselyiden välistä euklidista etäisyyttä. Väitöskirjan viimeinen teema on binääritietokannan efektiivisen dimension määritteleminen. Työssä sovelletaan fraktaalidimensiota, joka on suosittu menetelmä ja soveltuu hyvin jatkuvalle datalle. Tämän lähestymistavan soveltaminen diskreettiin dataan ei kuitenkaan ole suoraviivaista. Työssä ehdotetaan ratkaisuksi normalisoitua korrelaatiodimensiota. Lähestymistapoja tarkastellaan sekä teoreettisesti että empiirisesti vertailemalla sitä muihin tunnettuihin menetelmiin

Aaltodoc Publication Archive

Recommended from our members

Exploiting Intrinsic Clustering Structure in Discrete-Valued Data Sets for Efficient Knowledge Discovery in the Presence of Missing Data

Author: Strnadova-Neeley Veronika
Publication venue: eScholarship, University of California
Publication date: 01/01/2018
Field of study

Scalable algorithm design has become central in the era of large-scale data analysis. The vast amounts of data pouring in from a diverse set of application domains, such as bioinformatics, recommender systems, sensor systems, and social networks, cannot be analyzed efficiently using many data mining and statistical tools that were designed for a small scale setting. It is an ongoing challenge to the data mining, machine learning, and statistics communities to design new methods for efficient data analysis. Confounding this challenge is the noisy and incomplete nature of real-world data sets. Research scientists as well as practitioners in industry need to find meaningful patterns in data with missing value rates often as high as 99%, in addition to errors in the data that can obstruct accurate analyses. My contribution to this line of research is the design of new algorithms for scalable clustering, data reduction, and similarity evaluation by exploiting inherent clustering structure in the input data to overcome the challenges of significant amounts of missing entries. I demonstrate that, by focusing on underlying clustering properties of the data, we can improve the efficiency of several data analysis methods on sparse, discrete-valued data sets. I highlight new methods that I have developed with my collaborators for three diverse knowledge discovery tasks: (1) clustering genetic markers into linkage groups, (2) reducing large-scale genetic data to a much smaller, more accurate representative data set, and (3) computing similarity between users in recommender systems. In each case, I point out how the underlying clustering structure can be used to design more efficient algorithms, even when high missing value rates are present

eScholarship - University of California

International Evaluation of Research and Doctoral Training at the University of Helsinki 2005-2010 : RC-Specific Evaluation of ALKO - Algorithms and Data Analysis

Author
Publication venue
Publication date: 01/01/2012
Field of study

Helsingin yliopiston digitaalinen arkisto