159 research outputs found
ur-CAIM: Improved CAIM Discretization for Unbalanced and Balanced Data
Supervised discretization is one of basic data preprocessing
techniques used in data mining. CAIM (Class-
Attribute InterdependenceMaximization) is a discretization
algorithm of data for which the classes are known. However,
new arising challenges such as the presence of unbalanced
data sets, call for new algorithms capable of handling them,
in addition to balanced data. This paper presents a new discretization
algorithm named ur-CAIM, which improves on
the CAIM algorithm in three important ways. First, it generates
more flexible discretization schemes while producing
a small number of intervals. Second, the quality of the intervals
is improved based on the data classes distribution,
which leads to better classification performance on balanced
and, especially, unbalanced data. Third, the runtime of the
algorithm is lower than CAIM’s. The algorithm has been
designed free-parameter and it self-adapts to the problem
complexity and the data class distribution. The ur-CAIM
was compared with 9 well-known discretization methods
on 28 balanced, and 70 unbalanced data sets. The results
obtained were contrasted through non-parametric statistical
tests, which show that our proposal outperforms CAIM and
many of the other methods on both types of data but especially
on unbalanced data, which is its significant advantage
Scalable CAIM Discretization on Multiple GPUs Using Concurrent Kernels
CAIM(Class-Attribute InterdependenceMaximization) is one of the stateof-
the-art algorithms for discretizing data for which classes are known. However, it
may take a long time when run on high-dimensional large-scale data, with large number
of attributes and/or instances. This paper presents a solution to this problem by
introducing a GPU-based implementation of the CAIM algorithm that significantly
speeds up the discretization process on big complex data sets. The GPU-based implementation
is scalable to multiple GPU devices and enables the use of concurrent
kernels execution capabilities ofmodernGPUs. The CAIMGPU-basedmodel is evaluated
and compared with the original CAIM using single and multi-threaded parallel
configurations on 40 data sets with different characteristics. The results show great
speedup, up to 139 times faster using 4 GPUs, which makes discretization of big
data efficient and manageable. For example, discretization time of one big data set is
reduced from 2 hours to less than 2 minute
ANALISA CLASS-ATTRIBUTE INTERDEPENDENCE MAXIMIZATION (CAIM) UNTUK DISKRETISASI PADA SUPERVISED LEARNING Analysis of Class-Attribute Interdependence Maximization (CAIM) for Supervised Learning Discretization
ABSTRAKSI: Algoritma machine learning secara garis besar melakukan ekstraksi knowledge dari suatu database. Sebagian besar algoritma tersebut biasanya hanya bisa diaplikasikan pada data numerik ataupun nominal. Lain halnya untuk atribut continuous, dibutuhkan proses diskretisasi dahulu untuk merubah nilai atribut continuous menjadi interval.Diskretisasi adalah proses mentransformasi nilai atribut continuous menjadi sejumlah interval terbatas yang berhubungan dengan nilai diskret, yaitu nilai numerik. Pendekatan yang biasa dilakukan dalam proses learning menggunakan mixed-mode data (campuran antara data numerik dan continuous) adalah melakukan diskretisasi terlebih dahulu sebelum proses learning (preprocessing).CAIM (Class-Attribute Interdependence Maximization) adalah salah satu algoritma diskretisasi yang dirancang untuk supervised learning. Algoritma ini memaksimalkan saling ketergantungan (interdependency) antara kelas dan atribut, dan pada saat bersamaan menghasilkan jumlah interval diskret seminimal mungkin. Algoritma ini bekerja tanpa user harus mendefinisikan dahulu jumlah intervalnya.Pada tugas akhir ini penulis mengimplementasikan metode diskretisasi CAIM untuk supervised learning pada sejumlah dataset. Lalu hasil diskretisasinya diujikan pada algoritma C5.0 untuk menghasilkan rule klasifikasi. Tingkat akurasi dan jumlah rule yang dihasilkan CAIM lalu dibandingkan dengan akurasi dan jumlah rule yang dihasilkan enam metode diskretisasi lain. Hasil perbandingan menunjukkan secara umum CAIM mencapai hasil terbaik – akurasi tinggi dan jumlah rule kecil – dibanding enam metode lain yang diujikan.Kata Kunci : CAIM, Class-Attribute Interdependence Maximization, diskretisasi,ABSTRACT: The task of extracting knowledge from databases is quite often performed by machine learning algorithms. The majority of these algorithms can be applied only to data described by discrete numerical or nominal attributes (features).Discretization is a process to transform a continuous attribute’s value into a finite number of intervals and associate with each interval a numerical, discrete value. For mixed-mode (continuous and discrete) data, discretization is usually performed prior to the learning process, called pre-processing.CAIM (Class-Attribute Interdependence Maximization) is one of discretization algorithm design for supervised learning. It maximizes the classattribute interdependence and to generate a possibly minimal number of discrete intervals. The algorithm does not require the user to predefine the number of intervals. It considered as CAIM’s superiority against other discretization algorithms for supervised learning.This final project implements CAIM discretization methode for supervised learning to several datasets. C5.o algorithm is used to generate classification rules from data discretized by CAIM. The test performed using CAIM and six other state-of-the-art discretization algorithms show that the accuracy of generated rules is – on average - higher and the number of rules is lower for data discretized by CAIM when compared to data discretized using six other discretization algorithms.Keyword: CAIM, class-attribute interdependence maximization, discretizatio
Using entropy-based local weighting to improve similarity assessment
This paper enhances and analyses the power of local weighted similarity measures. The paper proposes a new entropy-based local weighting algorithm to be used in similarity assessment to improve the performance of the CBR retrieval task. It has been carried out a comparative analysis of the performance of unweighted similarity measures, global weighted similarity measures, and local weighting similarity measures. The testing has been done using several similarity measures, and some data sets from the UCI Machine Learning Database Repository and other environmental databases.Postprint (published version
Global Entropy Based Greedy Algorithm for discretization
Discretization algorithm is a crucial step to not only achieve summarization of continuous attributes but also better performance in classification that requires discrete values as input. In this thesis, I propose a supervised discretization method, Global Entropy Based Greedy algorithm, which is based on the Information Entropy Minimization. Experimental results show that the proposed method outperforms state of the art methods with well-known benchmarking datasets. To further improve the proposed method, a new approach for stop criterion that is based on the change rate of entropy was also explored. From the experimental analysis, it is noticed that the threshold based on the decreasing rate of entropy could be more effective than a constant number of intervals in the classification such as C5.0
Ameva: An autonomous discretization algorithm
This paper describes a new discretization algorithm, called Ameva, which is designed to work with supervised learning algorithms. Ameva maximizes a contingency coefficient based on Chi-square statistics and generates a potentially minimal number of discrete intervals. Its most important advantage, in contrast with several existing discretization algorithms, is that it does not need the user to indicate the number of
intervals. We have compared Ameva with one of the most relevant discretization algorithms, CAIM. Tests performed comparing these two algorithms show that discrete attributes generated by the Ameva algorithm always have the lowest number of intervals, and even if the number of classes is high, the same computational complexity is maintained. A comparison between the Ameva and the genetic algorithm
approaches has been also realized and there are very small differences between these iterative and combinatorial approaches, except when considering the execution time.Ministerio de EducaciĂłn y Ciencia TSI2006-13390-C02-02Junta de AndalucĂa P06-TIC-0214
MULTIVALUED SUBSETS UNDER INFORMATION THEORY
In the fields of finance, engineering and varied sciences, Data Mining/ Machine Learning has held an eminent position in predictive analysis. Complex algorithms and adaptive decision models have contributed towards streamlining directed research as well as improve on the accuracies in forecasting. Researchers in the fields of mathematics and computer science have made significant contributions towards the development of this field. Classification based modeling, which holds a significant position amongst the different rule-based algorithms, is one of the most widely used decision making tools. The decision tree has a place of profound significance in classification-based modeling. A number of heuristics have been developed over the years to prune the decision making process. Some key benchmarks in the evolution of the decision tree could to attributed to the researchers like Quinlan (ID3 and C4.5), Fayyad (GID3/3*, continuous value discretization), etc. The most common heuristic applied for these trees is the entropy discussed under information theory by Shannon. The current application with entropy covered under the term `Information Gain\u27 is directed towards individual assessment of the attribute-value sets. The proposed study takes a look at the effects of combining the attribute-value sets, aimed at improving the information gain. Couple of key applications have been tested and presented with statistical conclusions. The first being the application towards the feature selection process, a key step in the data mining process, while the second application is targeted towards the discretization of data. A search-based heuristic tool is applied towards identifying the subsets sharing a better gain value than the ones presented in the GID approach
A fast supervised density-based discretization algorithm for classification tasks in the medical domain
Discretization is a preprocessing technique used for converting continuous features into categorical. This step is essential for processing algorithms that cannot handle continuous data as input. In addition, in the big data era, it is important for a discretizer to be able to efficiently discretize data. In this paper, a new supervised density-based discretization (DBAD) algorithm is proposed, which satisfies these requirements. For the evaluation of the algorithm, 11 datasets that cover a wide range of datasets in the medical domain were used. The proposed algorithm was tested against three state-of-the art discretizers using three classifiers with different characteristics. A parallel version of the algorithm was evaluated using two synthetic big datasets. In the majority of the performed tests, the algorithm was found performing statistically similar or better than the other three discretization algorithms it was compared to. Additionally, the algorithm was faster than the other discretizers in all of the performed tests. Finally, the parallel version of DBAD shows almost linear speedup for a Message Passing Interface (MPI) implementation (9.64Ă— for 10 nodes), while a hybrid MPI/OpenMP implementation improves execution time by 35.3Ă— for 10 nodes and 6 threads per node.Peer ReviewedPostprint (published version
A Max-relevance-min-divergence Criterion for Data Discretization with Applications on Naive Bayes
In many classification models, data is discretized to better estimate its
distribution. Existing discretization methods often target at maximizing the
discriminant power of discretized data, while overlooking the fact that the
primary target of data discretization in classification is to improve the
generalization performance. As a result, the data tend to be over-split into
many small bins since the data without discretization retain the maximal
discriminant information. Thus, we propose a Max-Dependency-Min-Divergence
(MDmD) criterion that maximizes both the discriminant information and
generalization ability of the discretized data. More specifically, the
Max-Dependency criterion maximizes the statistical dependency between the
discretized data and the classification variable while the Min-Divergence
criterion explicitly minimizes the JS-divergence between the training data and
the validation data for a given discretization scheme. The proposed MDmD
criterion is technically appealing, but it is difficult to reliably estimate
the high-order joint distributions of attributes and the classification
variable. We hence further propose a more practical solution,
Max-Relevance-Min-Divergence (MRmD) discretization scheme, where each attribute
is discretized separately, by simultaneously maximizing the discriminant
information and the generalization ability of the discretized data. The
proposed MRmD is compared with the state-of-the-art discretization algorithms
under the naive Bayes classification framework on 45 machine-learning benchmark
datasets. It significantly outperforms all the compared methods on most of the
datasets.Comment: Under major revision of Pattern Recognitio
- …