14 research outputs found

    J48Consolidated: an implementation of CTC algorithm for WEKA

    Get PDF
    The CTC algorithm, Consolidated Tree Construction algorithm, is a machine learning paradigm that was designed to solve a class imbalance problem, a fraud detection problem in the area of car insurance [1] where, besides, an explanation about the classification made was required. The algorithm is based on a decision tree construction algorithm, in this case the well-known C4.5, but it extracts knowledge from data using a set of samples instead of a single one as C4.5 does. In contrast to other methodologies based on several samples to build a classifier, such as bagging, the CTC builds a single tree and as a consequence, it obtains comprehensible classifiers. The main motivation of this implementation is to make public and available an implementation of the CTC algorithm. With this purpose we have implemented the algorithm within the well-known WEKA data mining environment http://www.cs.waikato.ac.nz/ml/weka/). WEKA is an open source project that contains a collection of machine learning algorithms written in Java for data mining tasks. J48 is the implementation of C4.5 algorithm within the WEKA package. We called J48Consolidated to the implementation of CTC algorithm based on the J48 Java class

    An update of the J48Consolidated WEKA’s class: CTC algorithm enhanced with the notion of coverage

    Get PDF
    This document aims to describe an update of the implementation of the J48Consolidated class within WEKA platform. The J48Consolidated class implements the CTC algorithm [2][3] which builds a unique decision tree based on a set of samples. The J48Consolidated class extends WEKA’s J48 class which implements the well-known C4.5 algorithm. This implementation was described in the technical report "J48Consolidated: An implementation of CTC algorithm for WEKA". The main, but not only, change in this update is the integration of the notion of coverage in order to determine the number of samples to be generated to build a consolidated tree. We define coverage as the percentage of examples of the training sample present in –or covered by– the set of generated subsamples. So, depending on the type of samples that we use, we will need more or less samples in order to achieve a specific value of coverage

    Generation of the database gurekddcup

    Get PDF
    GureKDDCup datubasea UADI (Unsupervised Anomaly Detection for Intrusion detection system) proiektuaren barnean eraiki da. Proiektu honen helburu nagusia, sistema batean sarkinak (erasoak) detektatuko dituen sailkatzaile bat garatzea izango da, sailkatzaile hau garatzeko gainbegiratu gabeko sailkapeneko teknikak erabiliko direlarik. Proiektu honek duen berezitasunik nagusiena, konexioetan erasoak detektatzeko payload-a (paketeen datu eremua) erabiliko dela da. Konexioen sailkapena burutzeko payload-a erabiltzea oraindik sakon aztertu gabe dagoen arloa da baina badirudi R2L (Remote to Local, baliabide bat erabiltzeko eskubiderik izan gabe berau atzitzea du helburu) eta U2R (User to Root, erabiltzaile arrunt batek super-erabiltzaile edo administratzaile eskubideak lortzea du helburu) motako erasoak antzemateko ezinbestekoa dela.. Sailkapen prozesuan konexio kopuru izugarriarekin egin beharko dugu lan eta honek ezinbestean Data Mining munduan murgiltzea dakar. Sailkatzailea ikasteko prozesua automatikoa izatea nahiko dugu eta hortik Machine Learning (ikasketa automatikoa) arloak eskaintzen dizkigun teknikak erabiliko ditugu. Baina lehenik, beharrezkoa zaigu datubase egoki bat eraikitzea beraren gainean estrategia ezberdinak gainean ikertzeko. Beraz, txosten honen helburua, UADI proiektuak erabiliko duen datu-basea sortzeko jarraitutako prozesua azaltzea izango da. Datu-base hori lortzeko abiapuntua Darpa98 da eta helburua, ingurune zientifikoan erabiltzen den KDDCup datu-basearen antzeko ezaugarriak dituen beste bat sortzea da. Sortuko den datu-basearen (gurekddcup) ezaugarriak, KDDCup99 datu-basearenaren antzekoak izango dira, baina payload-ari dagokion informazioa eta konexioaren hainbat ezaugarri (IP helbideak, portu zenbakiak,...) gehiturik. Beraz jarraian, KDDCup99 sortzeko jarraitu ziren pausuak azalduko dira, ondoren gutxi gora behera antzeko pausuak jarraitu beharko baita gureKddcup, KDDCup99-ren hedapen berria sortzeko (kddcup99+payload), hau da, guk behar dugun datu-basea sortzeko.The database gureKDDCup has been generated within the UADI project (Unsupervised Anomaly Detection for Intrusion detection system) in which a classifier that detects intrusions or attacks in network based systems was developed. To develop this classifier we are going to use unsupervised classification techniques. The main distinctive feature of this project is that it uses the payload (body part of network packages) to detect attacks in network connections. The analysis of the payload to classify the connections is not a deeply analysed field, however, it seems that it is essential to detect attacks such as R2L (Remote to Local, its goal is to use resources without permission) and U2R (User to Root, its goal is to get root or administrative privileges without having them). In the classification process we have to handle with a huge amount of connections and discover useful patterns among them. Therefore, this leads us to the Data Mining field. Moreover, we want our UADI system to be able to discover patterns or generate the model of network traffic automatically, that is, we want the learning process to be automatic, and to do it possible, we are going to use Machine Learning techniques. But first it is essential to generate the apropriate database to work upon it. So the aim of this report is to explain the process we have followed to generate the database we used in the UADI project. The objective is to generate a database with similar characteristics to KDDCup99 which is broadly used database in the scientific environment, taking as starting point the Darpa98 (DARPA Intrusion Detection Data Sets). The generated database is called gureKDDCup and it has similar features to the ones in KDDCup99, but we added to it payload information and other features related to the connection such as IP address and port numbers. Next lines explains the steps followed to generate the KDDCup99 database because our aim is to repeat those steps as accurately as possible, to create KDDCup99 the database we need in UADI project, in other words, a new extension of the (KDDCup99+payload) that we called it gureKDDCup.The University of the Basque Country UPV/EHU (BAILab, grant UFI11/45); The Department of Education, Universities and Research of the Basque Government (grant IT-395-10); The Ministry of Economy and Competitiveness of the Spanish Government and by the European Regional Development Fund - ERDF (eGovernAbility, grant TIN2014-52665-C2-1-R)

    Diagnostic classification of Parkinson’s disease based on non-motor manifestations and machine learning strategies

    Get PDF
    Non-motor manifestations of Parkinson’s disease (PD) appear early and have a significant impact on the quality of life of patients, but few studies have evaluated their predictive potential with machine learning algorithms. We evaluated 9 algorithms for discriminating PD patients from controls using a wide collection of non-motor clinical PD features from two databases: Biocruces (96 subjects) and PPMI (687 subjects). In addition, we evaluated whether the combination of both databases could improve the individual results. For each database 2 versions with different granularity were created and a feature selection process was performed. We observed that most of the algorithms were able to detect PD patients with high accuracy (>80%). Support Vector Machine and Multi-Layer Perceptron obtained the best performance, with an accuracy of 86.3% and 84.7%, respectively. Likewise, feature selection led to a significant reduction in the number of variables and to better performance. Besides, the enrichment of Biocruces database with data from PPMI moderately benefited the performance of the classification algorithms, especially the recall and to a lesser extent the accuracy, while the precision worsened slightly. The use of interpretable rules obtained by the RIPPER algorithm showed that simply using two variables (autonomic manifestations and olfactory dysfunction), it was possible to achieve an accuracy of 84.4%. Our study demonstrates that the analysis of non-motor parameters of PD through machine learning techniques can detect PD patients with high accuracy and recall, and allows us to select the most discriminative non-motor variables to create potential tools for PD screening.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work was partially funded by the Department of Education, Universities and Research of the Basque Government (ADIAN, IT-980-16); by the Spanish Ministry of Science, Innovation and Universities - National Research Agency and the European Regional Development Fund - ERDF (PhysComp, TIN2017-85409-P), and from the State Research Agency (AEI, Spain) under grant agreement No RED2018-102312-T (IA-Biomed); by Michael J. Fox Foundation [RRIA 2014 (Rapid Response Innovation Awards) Program (Grant ID: 10189)]; by the Instituto de Salud Carlos III through the project “PI14/00679” and “PI16/00005”, the Juan Rodes grant “JR15/00008” (IG) (Co-funded by European Regional Development Fund/European Social Fund - “Investing in your future”); and by the Department of Health of the Basque Government through the projects “2016111009” and “2019111100”

    Aportaciones a la clasificación no supervisada y a su validación. Aplicación a la seguridad informática

    Get PDF
    220 p.Dado el número y las características de las transacciones que se pueden realizar hoy en día a través de las redes de computadores, la seguridad informática es un área cada vez más importante. Sin embargo, dada la gran cantidad de datos involucrados, un análisis manual es inviable. Este trabajo aplica técnicas de aprendizaje automático, más concretamente de clasificación no supervisada, a dos problemas de seguridad informática. En el primero de ellos se agrupa código malicioso en base a su comportamiento con el objeto de poder catalogarlos de forma eficiente. En el segundo se analiza tráfico de red con el objeto de detectar intrusiones. El estudio de las técnicas de clasificación no supervisada ha llevado a realizar tres aportaciones en este área que también se reflejan en este trabajo. La primera aportación es un algoritmo de clustering jerárquico incremental que garantiza la estabilidad de las estructuras actualizadas. La segunda aportación propone un nuevo método para extraer particiones de una jerarquía de clusters ya que se muestra que el método tradicional tiene problemas en determinados contextos. Finalmente, la última aportación define una nueva metodología de evaluación de índices de validación de clusters. Se muestra que la metodología tradicional se basa en un supuesto que a menudo no se cumple y se propone una variación que evita dicho problema

    THE EFFECT OF THE USED RESAMPLING TECHNIQUE AND NUMBER OF SAMPLES IN CONSOLIDATED TREES’ CONSTRUCTION ALGORITHM

    No full text
    In many pattern recognition problems, the explanation of the made classification becomes as important as the good performance of the classifier related to its discriminating capacity. For this kind of problems we can use Consolidated Trees ´ Construction (CTC) algorithm which uses several subsamples to build a single tree. This paper presents a wide analysis of the behavior of CTC algorithm for 20 databases. The effect of two parameters of the algorithm: number of samples and the way subsamples have been built has been analyzed. The results obtained with Consolidated Trees have been compared to C4.5 trees executing 5 times a 10 fold cross validation. The comparison has been done from two points of view: error rate (accuracy) and complexity (explanation). Results show that, for subsamples of 75 % of the training sample, Consolidated Trees achieve, in average, smaller error rates than C4.5 trees when they are built with 10 or more subsamples and with similar complexity, so, they are better situated in the learning curve. On the other hand, the method used to build subsamples clearly affects to the quality of results achieved with Consolidated Trees. If bootstrap samples are used to build trees the obtained results are worse than the ones obtained with subsamples of 75 % from the two points of view: error and complexity

    An update of the J48Consolidated WEKA’s class: CTC algorithm enhanced with the notion of coverage

    Get PDF
    This document aims to describe an update of the implementation of the J48Consolidated class within WEKA platform. The J48Consolidated class implements the CTC algorithm [2][3] which builds a unique decision tree based on a set of samples. The J48Consolidated class extends WEKA’s J48 class which implements the well-known C4.5 algorithm. This implementation was described in the technical report "J48Consolidated: An implementation of CTC algorithm for WEKA". The main, but not only, change in this update is the integration of the notion of coverage in order to determine the number of samples to be generated to build a consolidated tree. We define coverage as the percentage of examples of the training sample present in –or covered by– the set of generated subsamples. So, depending on the type of samples that we use, we will need more or less samples in order to achieve a specific value of coverage
    corecore