5 research outputs found

    Integration of Expectation Maximization using Gaussian Mixture Models and Naïve Bayes for Intrusion Detection

    Get PDF
    Intrusion detection is the investigation process of information about the system activities or its data to detect any malicious behavior or unauthorized activity. Most of the IDS implement K-means clustering technique due to its linear complexity and fast computing ability. Nonetheless, it is Naïve use of the mean data value for the cluster core that presents a major drawback. The chances of two circular clusters having different radius and centering at the same mean will occur. This condition cannot be addressed by the K-means algorithm because the mean value of the various clusters is very similar together. However, if the clusters are not spherical, it fails. To overcome this issue, a new integrated hybrid model by integrating expectation maximizing (EM) clustering using a Gaussian mixture model (GMM) and naïve Bays classifier have been proposed. In this model, GMM give more flexibility than K-Means in terms of cluster covariance. Also, they use probabilities function and soft clustering, that’s why they can have multiple cluster for a single data. In GMM, we can define the cluster form in GMM by two parameters: the mean and the standard deviation. This means that by using these two parameters, the cluster can take any kind of elliptical shape. EM-GMM will be used to cluster data based on data activity into the corresponding category

    Hybrid intelligent approach for network intrusion detection

    Get PDF
    In recent years, computer networks are broadly used, and they have become very complicated. A lot of sensitive information passes through various kinds of computer devices, ranging from minicomputers to servers and mobile devices. These occurring changes have led to draw the conclusion that the number of attacks on important information over the network systems is increasing with every year. Intrusion is the main threat to the network. It is defined as a series of activities aimed for exposing the security of network systems in terms of confidentiality, integrity and availability, as a result; intrusion detection is extremely important as a part of the defense. Hence, there must be substantial improvement in network intrusion detection techniques and systems. Due to the prevailing limitations of finding novel attacks, high false detection, and accuracy in previous intrusion detection approaches, this study has proposed a hybrid intelligent approach for network intrusion detection based on k-means clustering algorithm and support vector machine classification algorithm. The aim of this study is to reduce the rate of false alarm and also to improve the detection rate, comparing with the existing intrusion detection approaches. In the present study, NSL-KDD intrusion dataset has been used for training and testing the proposed approach. In order to improve classification performance, some steps have been taken beforehand. The first one is about unifying the types and filtering the dataset by data transformation. Then, a features selection algorithm is applied to remove irrelevant and noisy features for the purpose of intrusion. Feature selection has decreased the features from 41 to 21 features for intrusion detection and later normalization method is employed to perform and reduce the differences among the data. Clustering is the last step of processing before classification has been performed, using k-means algorithm. Under the purpose of classification, support vector machine have been used. After training and testing the proposed hybrid intelligent approach, the results of performance evaluation have shown that the proposed network intrusion detection has achieved high accuracy and low false detection rate. The accuracy is 96.025 percent and the false alarm is 3.715 percent

    complexFuzzy: A novel clustering method for selecting training instances of cross-project defect prediction

    Get PDF
    Over the last decade, researchers have investigated to what extent cross-project defect prediction (CPDP) shows advantages over traditional defect prediction settings. These works do not take training and testing data of defect prediction from the same project. Instead, dissimilar projects are employed. Selecting proper training data plays an important role in terms of the success of CPDP. In this study, a novel clustering method named complexFuzzy is presented for selecting training data of CPDP. The method is developed by determining membership values with the help of some metrics which can be considered as indicators of complexity. First, CPDP combinations are created on 29 different data sets. Subsequently, complexFuzzy is evaluated by considering cluster centers of data sets and comparing some performance measures including area under the curve (AUC) and F-measure. The method is superior to other five comparison algorithms in terms of the distance of cluster centers and prediction performance

    Optimised meta-clustering approach for clustering Time Series Matrices

    Get PDF
    The prognostics (health state) of multiple components represented as time series data stored in vectors and matrices were processed and clustered more effectively and efficiently using the newly devised ‘Meta-Clustering’ approach. These time series data gathered from large applications and systems in diverse fields such as communication, medicine, data mining, audio, visual applications, and sensors. The reason time series data was used as the domain of this research is that meaningful information could be extracted regarding the characteristics of systems and components found in large applications. Also when it came to clustering, only time series data would allow us to group these data according to their life cycle, i.e. from the time which they were healthy until the time which they start to develop faults and ultimately fail. Therefore by proposing a technique that can better process extracted time series data would significantly cut down on space and time consumption which are both crucial factors in data mining. This approach will, as a result, improve the current state of the art pattern recognition algorithms such as K-NM as the clusters will be identified faster while consuming less space. The project also has application implications in the sense that by calculating the distance between the similar components faster while also consuming less space means that the prognostics of multiple components clustered can be realised and understood more efficiently. This was achieved by using the Meta-Clustering approach to process and cluster the time series data by first extracting and storing the time series data as a two-dimensional matrix. Then implementing an enhance K-NM clustering algorithm based on the notion of Meta-Clustering and using the Euclidean distance tool to measure the similarity between the different set of failure patterns in space. This approach would initially classify and organise each component within its own refined individual cluster. This would provide the most relevant set of failure patterns that show the highest level of similarity and would also get rid of any unnecessary data that adds no value towards better understating the failure/health state of the component. Then during the second stage, once these clusters were effectively obtained, the following inner clusters initially formed are thereby grouped into one general cluster that now represents the prognostics of all the processed components. The approach was tested on multivariate time series data extracted from IGBT components within Matlab and the results achieved from this experiment showed that the optimised Meta-Clustering approach proposed does indeed consume less time and space to cluster the prognostics of IGBT components as compared to existing data mining techniques

    Clustering von großen hochdimensionalen und unsicheren Datensätzen in der Astronomie

    Get PDF
    Ein ständiges Wachstum der Datenmengen ist in vielen IT-affinen Bereichen gegeben. Wissenschaftliche und insbesondere astronomische Datensätze weisen komplexe Eigenschaften wie Unsicherheiten, eine hohen Anzahl an Dimensionen sowie die enorme Anzahl an Dateninstanzen auf. Beispielsweise besitzen astronomische Datensätze mehrere Millionen Dateninstanzen mit jeweils mehreren tausend Dimensionen, die sich durch die Anzahl unabhängiger Eigenschaften bzw. Komponenten widerspiegeln. Diese Größenordnungen bzgl. der Dimensionen und Datenmengen in Kombination mit Unsicherheiten zeigen, dass automatisierte Analysen der Datensätze in akzeptabler Analysezeit und damit akzeptabler Berechnungskomplexität notwendig sind. Mit Clustering Verfahren existiert eine mögliche Analysemethodik zur Untersuchung von Ähnlichkeiten innerhalb eines Datensatzes. Aktuelle Verfahren integrieren jedoch nur einzelne Aspekte der komplexen Datensätze im Verfahren, mit einer teilweise nicht-linearen Berechnungskomplexität im Hinblick auf eine steigende Anzahl an Dateninstanzen sowie Dimensionen. Diese Dissertation skizziert die einzelnen Herausforderungen der Prozessierung komplexer Daten in einem Clustering Verfahren. Darüber hinaus präsentiert die Arbeit einen neuartigen parametrisierbaren Ansatz zur Verarbeitung großer und komplexer Datensätze, genannt Fractal Similarity Measures, der die Datenmengen in log-linearer Analysezeit prozessiert. Durch das ebenfalls vorgestellte sogenannte unsichere Sortierungsverfahren für hochdimensionale Daten, stellt die dafür notwendigen Initialisierungsverfahren Gitter bereit. Mit Hilfe des neuen Konzepts des fraktalen Ähnlichkeitsmaßes bzw. dem fraktalen Informationswert analysiert das Verfahren die möglichen Cluster sowie die Dateninstanzen auf Ähnlichkeiten. Zur Demonstration der Funktionalität und Effizienz des Algorithmus evaluiert diese Arbeit das Verfahren mit Hilfe eines synthetischen und eines reellen Datensatzes aus der Astronomie. Die Prozessierung des reellen Datensatzes setzt eine Vergleichbarkeit der gegebenen Spektraldaten voraus, weshalb ein weiteres Verfahren zur Vorprozessierung von Spektraldaten auf Basis des Hadoop-Rahmenwerks vorgestellt wird. Die Dissertation stellt darüber hinaus Ergebnisse des Clustering-Vorgangs des reellen Datensatzes vor, die mit manuell erstellten Ergebnissen von Domänennexperten qualitativ vergleichbar sind
    corecore