159 research outputs found

    ur-CAIM: Improved CAIM Discretization for Unbalanced and Balanced Data

    Get PDF
    Supervised discretization is one of basic data preprocessing techniques used in data mining. CAIM (Class- Attribute InterdependenceMaximization) is a discretization algorithm of data for which the classes are known. However, new arising challenges such as the presence of unbalanced data sets, call for new algorithms capable of handling them, in addition to balanced data. This paper presents a new discretization algorithm named ur-CAIM, which improves on the CAIM algorithm in three important ways. First, it generates more flexible discretization schemes while producing a small number of intervals. Second, the quality of the intervals is improved based on the data classes distribution, which leads to better classification performance on balanced and, especially, unbalanced data. Third, the runtime of the algorithm is lower than CAIM’s. The algorithm has been designed free-parameter and it self-adapts to the problem complexity and the data class distribution. The ur-CAIM was compared with 9 well-known discretization methods on 28 balanced, and 70 unbalanced data sets. The results obtained were contrasted through non-parametric statistical tests, which show that our proposal outperforms CAIM and many of the other methods on both types of data but especially on unbalanced data, which is its significant advantage

    Scalable CAIM Discretization on Multiple GPUs Using Concurrent Kernels

    Get PDF
    CAIM(Class-Attribute InterdependenceMaximization) is one of the stateof- the-art algorithms for discretizing data for which classes are known. However, it may take a long time when run on high-dimensional large-scale data, with large number of attributes and/or instances. This paper presents a solution to this problem by introducing a GPU-based implementation of the CAIM algorithm that significantly speeds up the discretization process on big complex data sets. The GPU-based implementation is scalable to multiple GPU devices and enables the use of concurrent kernels execution capabilities ofmodernGPUs. The CAIMGPU-basedmodel is evaluated and compared with the original CAIM using single and multi-threaded parallel configurations on 40 data sets with different characteristics. The results show great speedup, up to 139 times faster using 4 GPUs, which makes discretization of big data efficient and manageable. For example, discretization time of one big data set is reduced from 2 hours to less than 2 minute

    ANALISA CLASS-ATTRIBUTE INTERDEPENDENCE MAXIMIZATION (CAIM) UNTUK DISKRETISASI PADA SUPERVISED LEARNING Analysis of Class-Attribute Interdependence Maximization (CAIM) for Supervised Learning Discretization

    Get PDF
    ABSTRAKSI: Algoritma machine learning secara garis besar melakukan ekstraksi knowledge dari suatu database. Sebagian besar algoritma tersebut biasanya hanya bisa diaplikasikan pada data numerik ataupun nominal. Lain halnya untuk atribut continuous, dibutuhkan proses diskretisasi dahulu untuk merubah nilai atribut continuous menjadi interval.Diskretisasi adalah proses mentransformasi nilai atribut continuous menjadi sejumlah interval terbatas yang berhubungan dengan nilai diskret, yaitu nilai numerik. Pendekatan yang biasa dilakukan dalam proses learning menggunakan mixed-mode data (campuran antara data numerik dan continuous) adalah melakukan diskretisasi terlebih dahulu sebelum proses learning (preprocessing).CAIM (Class-Attribute Interdependence Maximization) adalah salah satu algoritma diskretisasi yang dirancang untuk supervised learning. Algoritma ini memaksimalkan saling ketergantungan (interdependency) antara kelas dan atribut, dan pada saat bersamaan menghasilkan jumlah interval diskret seminimal mungkin. Algoritma ini bekerja tanpa user harus mendefinisikan dahulu jumlah intervalnya.Pada tugas akhir ini penulis mengimplementasikan metode diskretisasi CAIM untuk supervised learning pada sejumlah dataset. Lalu hasil diskretisasinya diujikan pada algoritma C5.0 untuk menghasilkan rule klasifikasi. Tingkat akurasi dan jumlah rule yang dihasilkan CAIM lalu dibandingkan dengan akurasi dan jumlah rule yang dihasilkan enam metode diskretisasi lain. Hasil perbandingan menunjukkan secara umum CAIM mencapai hasil terbaik – akurasi tinggi dan jumlah rule kecil – dibanding enam metode lain yang diujikan.Kata Kunci : CAIM, Class-Attribute Interdependence Maximization, diskretisasi,ABSTRACT: The task of extracting knowledge from databases is quite often performed by machine learning algorithms. The majority of these algorithms can be applied only to data described by discrete numerical or nominal attributes (features).Discretization is a process to transform a continuous attribute’s value into a finite number of intervals and associate with each interval a numerical, discrete value. For mixed-mode (continuous and discrete) data, discretization is usually performed prior to the learning process, called pre-processing.CAIM (Class-Attribute Interdependence Maximization) is one of discretization algorithm design for supervised learning. It maximizes the classattribute interdependence and to generate a possibly minimal number of discrete intervals. The algorithm does not require the user to predefine the number of intervals. It considered as CAIM’s superiority against other discretization algorithms for supervised learning.This final project implements CAIM discretization methode for supervised learning to several datasets. C5.o algorithm is used to generate classification rules from data discretized by CAIM. The test performed using CAIM and six other state-of-the-art discretization algorithms show that the accuracy of generated rules is – on average - higher and the number of rules is lower for data discretized by CAIM when compared to data discretized using six other discretization algorithms.Keyword: CAIM, class-attribute interdependence maximization, discretizatio

    Using entropy-based local weighting to improve similarity assessment

    Get PDF
    This paper enhances and analyses the power of local weighted similarity measures. The paper proposes a new entropy-based local weighting algorithm to be used in similarity assessment to improve the performance of the CBR retrieval task. It has been carried out a comparative analysis of the performance of unweighted similarity measures, global weighted similarity measures, and local weighting similarity measures. The testing has been done using several similarity measures, and some data sets from the UCI Machine Learning Database Repository and other environmental databases.Postprint (published version

    Global Entropy Based Greedy Algorithm for discretization

    Get PDF
    Discretization algorithm is a crucial step to not only achieve summarization of continuous attributes but also better performance in classification that requires discrete values as input. In this thesis, I propose a supervised discretization method, Global Entropy Based Greedy algorithm, which is based on the Information Entropy Minimization. Experimental results show that the proposed method outperforms state of the art methods with well-known benchmarking datasets. To further improve the proposed method, a new approach for stop criterion that is based on the change rate of entropy was also explored. From the experimental analysis, it is noticed that the threshold based on the decreasing rate of entropy could be more effective than a constant number of intervals in the classification such as C5.0

    Ameva: An autonomous discretization algorithm

    Get PDF
    This paper describes a new discretization algorithm, called Ameva, which is designed to work with supervised learning algorithms. Ameva maximizes a contingency coefficient based on Chi-square statistics and generates a potentially minimal number of discrete intervals. Its most important advantage, in contrast with several existing discretization algorithms, is that it does not need the user to indicate the number of intervals. We have compared Ameva with one of the most relevant discretization algorithms, CAIM. Tests performed comparing these two algorithms show that discrete attributes generated by the Ameva algorithm always have the lowest number of intervals, and even if the number of classes is high, the same computational complexity is maintained. A comparison between the Ameva and the genetic algorithm approaches has been also realized and there are very small differences between these iterative and combinatorial approaches, except when considering the execution time.Ministerio de EducaciĂłn y Ciencia TSI2006-13390-C02-02Junta de AndalucĂ­a P06-TIC-0214

    MULTIVALUED SUBSETS UNDER INFORMATION THEORY

    Get PDF
    In the fields of finance, engineering and varied sciences, Data Mining/ Machine Learning has held an eminent position in predictive analysis. Complex algorithms and adaptive decision models have contributed towards streamlining directed research as well as improve on the accuracies in forecasting. Researchers in the fields of mathematics and computer science have made significant contributions towards the development of this field. Classification based modeling, which holds a significant position amongst the different rule-based algorithms, is one of the most widely used decision making tools. The decision tree has a place of profound significance in classification-based modeling. A number of heuristics have been developed over the years to prune the decision making process. Some key benchmarks in the evolution of the decision tree could to attributed to the researchers like Quinlan (ID3 and C4.5), Fayyad (GID3/3*, continuous value discretization), etc. The most common heuristic applied for these trees is the entropy discussed under information theory by Shannon. The current application with entropy covered under the term `Information Gain\u27 is directed towards individual assessment of the attribute-value sets. The proposed study takes a look at the effects of combining the attribute-value sets, aimed at improving the information gain. Couple of key applications have been tested and presented with statistical conclusions. The first being the application towards the feature selection process, a key step in the data mining process, while the second application is targeted towards the discretization of data. A search-based heuristic tool is applied towards identifying the subsets sharing a better gain value than the ones presented in the GID approach

    A fast supervised density-based discretization algorithm for classification tasks in the medical domain

    Get PDF
    Discretization is a preprocessing technique used for converting continuous features into categorical. This step is essential for processing algorithms that cannot handle continuous data as input. In addition, in the big data era, it is important for a discretizer to be able to efficiently discretize data. In this paper, a new supervised density-based discretization (DBAD) algorithm is proposed, which satisfies these requirements. For the evaluation of the algorithm, 11 datasets that cover a wide range of datasets in the medical domain were used. The proposed algorithm was tested against three state-of-the art discretizers using three classifiers with different characteristics. A parallel version of the algorithm was evaluated using two synthetic big datasets. In the majority of the performed tests, the algorithm was found performing statistically similar or better than the other three discretization algorithms it was compared to. Additionally, the algorithm was faster than the other discretizers in all of the performed tests. Finally, the parallel version of DBAD shows almost linear speedup for a Message Passing Interface (MPI) implementation (9.64Ă— for 10 nodes), while a hybrid MPI/OpenMP implementation improves execution time by 35.3Ă— for 10 nodes and 6 threads per node.Peer ReviewedPostprint (published version

    A Max-relevance-min-divergence Criterion for Data Discretization with Applications on Naive Bayes

    Full text link
    In many classification models, data is discretized to better estimate its distribution. Existing discretization methods often target at maximizing the discriminant power of discretized data, while overlooking the fact that the primary target of data discretization in classification is to improve the generalization performance. As a result, the data tend to be over-split into many small bins since the data without discretization retain the maximal discriminant information. Thus, we propose a Max-Dependency-Min-Divergence (MDmD) criterion that maximizes both the discriminant information and generalization ability of the discretized data. More specifically, the Max-Dependency criterion maximizes the statistical dependency between the discretized data and the classification variable while the Min-Divergence criterion explicitly minimizes the JS-divergence between the training data and the validation data for a given discretization scheme. The proposed MDmD criterion is technically appealing, but it is difficult to reliably estimate the high-order joint distributions of attributes and the classification variable. We hence further propose a more practical solution, Max-Relevance-Min-Divergence (MRmD) discretization scheme, where each attribute is discretized separately, by simultaneously maximizing the discriminant information and the generalization ability of the discretized data. The proposed MRmD is compared with the state-of-the-art discretization algorithms under the naive Bayes classification framework on 45 machine-learning benchmark datasets. It significantly outperforms all the compared methods on most of the datasets.Comment: Under major revision of Pattern Recognitio
    • …
    corecore