2,632 research outputs found
A Decision tree-based attribute weighting filter for naive Bayes
The naive Bayes classifier continues to be a popular learning algorithm for data mining applications due to its simplicity and linear run-time. Many enhancements to the basic algorithm have been proposed to help mitigate its primary weakness--the assumption that attributes are independent given the class. All of them improve the performance of naïve Bayes at the expense (to a greater or lesser degree) of execution time and/or simplicity of the final model. In this paper we present a simple filter method for setting attribute weights for use with naive Bayes. Experimental results show that naive Bayes with attribute weights rarely degrades the quality of the model compared to standard naive Bayes and, in many cases, improves it dramatically. The main advantages of this method compared to other approaches for improving naive Bayes is its
run-time complexity and the fact that it maintains the simplicity of the final model
Boosting the Discriminant Power of Naive Bayes
Naive Bayes has been widely used in many applications because of its
simplicity and ability in handling both numerical data and categorical data.
However, lack of modeling of correlations between features limits its
performance. In addition, noise and outliers in the real-world dataset also
greatly degrade the classification performance. In this paper, we propose a
feature augmentation method employing a stack auto-encoder to reduce the noise
in the data and boost the discriminant power of naive Bayes. The proposed stack
auto-encoder consists of two auto-encoders for different purposes. The first
encoder shrinks the initial features to derive a compact feature representation
in order to remove the noise and redundant information. The second encoder
boosts the discriminant power of the features by expanding them into a
higher-dimensional space so that different classes of samples could be better
separated in the higher-dimensional space. By integrating the proposed feature
augmentation method with the regularized naive Bayes, the discrimination power
of the model is greatly enhanced. The proposed method is evaluated on a set of
machine-learning benchmark datasets. The experimental results show that the
proposed method significantly and consistently outperforms the state-of-the-art
naive Bayes classifiers.Comment: Accepted by 2022 International Conference on Pattern Recognitio
A Max-relevance-min-divergence Criterion for Data Discretization with Applications on Naive Bayes
In many classification models, data is discretized to better estimate its
distribution. Existing discretization methods often target at maximizing the
discriminant power of discretized data, while overlooking the fact that the
primary target of data discretization in classification is to improve the
generalization performance. As a result, the data tend to be over-split into
many small bins since the data without discretization retain the maximal
discriminant information. Thus, we propose a Max-Dependency-Min-Divergence
(MDmD) criterion that maximizes both the discriminant information and
generalization ability of the discretized data. More specifically, the
Max-Dependency criterion maximizes the statistical dependency between the
discretized data and the classification variable while the Min-Divergence
criterion explicitly minimizes the JS-divergence between the training data and
the validation data for a given discretization scheme. The proposed MDmD
criterion is technically appealing, but it is difficult to reliably estimate
the high-order joint distributions of attributes and the classification
variable. We hence further propose a more practical solution,
Max-Relevance-Min-Divergence (MRmD) discretization scheme, where each attribute
is discretized separately, by simultaneously maximizing the discriminant
information and the generalization ability of the discretized data. The
proposed MRmD is compared with the state-of-the-art discretization algorithms
under the naive Bayes classification framework on 45 machine-learning benchmark
datasets. It significantly outperforms all the compared methods on most of the
datasets.Comment: Under major revision of Pattern Recognitio
SODE: Self-Adaptive One-Dependence Estimators for classification
© 2015 Elsevier Ltd. SuperParent-One-Dependence Estimators (SPODEs) represent a family of semi-naive Bayesian classifiers which relax the attribute independence assumption of Naive Bayes (NB) to allow each attribute to depend on a common single attribute (superparent). SPODEs can effectively handle data with attribute dependency but still inherent NB's key advantages such as computational efficiency and robustness for high dimensional data. In reality, determining an optimal superparent for SPODEs is difficult. One common approach is to use weighted combinations of multiple SPODEs, each having a different superparent with a properly assigned weight value (i.e., a weight value is assigned to each attribute). In this paper, we propose a self-adaptive SPODEs, namely SODE, which uses immunity theory in artificial immune systems to automatically and self-adaptively select the weight for each single SPODE. SODE does not need to know the importance of individual SPODE nor the relevance among SPODEs, and can flexibly and efficiently search optimal weight values for each SPODE during the learning process. Extensive experiments and comparisons on 56 benchmark data sets, and validations on image and text classification, demonstrate that SODE outperforms state-of-the-art weighted SPODE algorithms and is suitable for a wide range of learning tasks. Results also confirm that SODE provides an appropriate balance between runtime efficiency and accuracy
Software defect prediction using maximal information coefficient and fast correlation-based filter feature selection
Software quality ensures that applications that are developed are failure free. Some modern systems are intricate, due to the complexity of their information processes. Software fault prediction is an important quality assurance activity, since it is a mechanism that correctly predicts the defect proneness of modules and classifies modules that saves resources, time and developers’ efforts. In this study, a model that selects relevant features that can be used in defect prediction was proposed. The literature was reviewed and it revealed that process metrics are better predictors of defects in version systems and are based on historic source code over time. These metrics are extracted from the source-code module and include, for example, the number of additions and deletions from the source code, the number of distinct committers and the number of modified lines. In this research, defect prediction was conducted using open source software (OSS) of software product line(s) (SPL), hence process metrics were chosen. Data sets that are used in defect prediction may contain non-significant and redundant attributes that may affect the accuracy of machine-learning algorithms. In order to improve the prediction accuracy of classification models, features that are significant in the defect prediction process are utilised. In machine learning, feature selection techniques are applied in the identification of the relevant data. Feature selection is a pre-processing step that helps to reduce the dimensionality of data in machine learning. Feature selection techniques include information theoretic methods that are based on the entropy concept. This study experimented the efficiency of the feature selection techniques. It was realised that software defect prediction using significant attributes improves the prediction accuracy. A novel MICFastCR model, which is based on the Maximal Information Coefficient (MIC) was developed to select significant attributes and Fast Correlation Based Filter (FCBF) to eliminate redundant attributes. Machine learning algorithms were then run to predict software defects. The MICFastCR achieved the highest prediction accuracy as reported by various performance measures.School of ComputingPh. D. (Computer Science
The effect of locality based learning on software defect prediction
Software defect prediction poses many problems during classification. A common solution used to improve software defect prediction is to train on similar, or local, data to the testing data. Prior work [12, 64] shows that locality improves the performance of classifiers. This approach has been commonly applied to the field of software defect prediction. In this thesis, we compare the performance of many classifiers, both locality based and non-locality based. We propose a novel classifier called Clump, with the goals of improving classification while providing an explanation as to how the decisions were reached. We also explore the effects of standard clustering and relevancy filtering algorithms.;Through experimentation, we show that locality does not improve classification performance when applied to software defect prediction. The performance of the algorithms is impacted more by the datasets used than by the algorithmic choices made. More research is needed to explore locality based learning and the impact of the datasets chosen
New techniques for Arabic document classification
Text classification (TC) concerns automatically assigning a class (category) label to
a text document, and has increasingly many applications, particularly in the domain
of organizing, for browsing in large document collections. It is typically achieved
via machine learning, where a model is built on the basis of a typically large collection
of document features. Feature selection is critical in this process, since there
are typically several thousand potential features (distinct words or terms). In text
classification, feature selection aims to improve the computational e ciency and
classification accuracy by removing irrelevant and redundant terms (features), while
retaining features (words) that contain su cient information that help with the
classification task.
This thesis proposes binary particle swarm optimization (BPSO) hybridized with
either K Nearest Neighbour (KNN) or Support Vector Machines (SVM) for feature
selection in Arabic text classi cation tasks. Comparison between feature selection
approaches is done on the basis of using the selected features in conjunction with
SVM, Decision Trees (C4.5), and Naive Bayes (NB), to classify a hold out test
set. Using publically available Arabic datasets, results show that BPSO/KNN and
BPSO/SVM techniques are promising in this domain. The sets of selected features
(words) are also analyzed to consider the di erences between the types of features
that BPSO/KNN and BPSO/SVM tend to choose. This leads to speculation concerning
the appropriate feature selection strategy, based on the relationship between
the classes in the document categorization task at hand.
The thesis also investigates the use of statistically extracted phrases of length
two as terms in Arabic text classi cation. In comparison with Bag of Words text
representation, results show that using phrases alone as terms in Arabic TC task
decreases the classification accuracy of Arabic TC classifiers significantly while combining
bag of words and phrase based representations may increase the classification
accuracy of the SVM classifier slightly
- …