266 research outputs found

    Global Entropy Based Greedy Algorithm for discretization

    Get PDF
    Discretization algorithm is a crucial step to not only achieve summarization of continuous attributes but also better performance in classification that requires discrete values as input. In this thesis, I propose a supervised discretization method, Global Entropy Based Greedy algorithm, which is based on the Information Entropy Minimization. Experimental results show that the proposed method outperforms state of the art methods with well-known benchmarking datasets. To further improve the proposed method, a new approach for stop criterion that is based on the change rate of entropy was also explored. From the experimental analysis, it is noticed that the threshold based on the decreasing rate of entropy could be more effective than a constant number of intervals in the classification such as C5.0

    Scalable CAIM Discretization on Multiple GPUs Using Concurrent Kernels

    Get PDF
    CAIM(Class-Attribute InterdependenceMaximization) is one of the stateof- the-art algorithms for discretizing data for which classes are known. However, it may take a long time when run on high-dimensional large-scale data, with large number of attributes and/or instances. This paper presents a solution to this problem by introducing a GPU-based implementation of the CAIM algorithm that significantly speeds up the discretization process on big complex data sets. The GPU-based implementation is scalable to multiple GPU devices and enables the use of concurrent kernels execution capabilities ofmodernGPUs. The CAIMGPU-basedmodel is evaluated and compared with the original CAIM using single and multi-threaded parallel configurations on 40 data sets with different characteristics. The results show great speedup, up to 139 times faster using 4 GPUs, which makes discretization of big data efficient and manageable. For example, discretization time of one big data set is reduced from 2 hours to less than 2 minute

    A Max-relevance-min-divergence Criterion for Data Discretization with Applications on Naive Bayes

    Full text link
    In many classification models, data is discretized to better estimate its distribution. Existing discretization methods often target at maximizing the discriminant power of discretized data, while overlooking the fact that the primary target of data discretization in classification is to improve the generalization performance. As a result, the data tend to be over-split into many small bins since the data without discretization retain the maximal discriminant information. Thus, we propose a Max-Dependency-Min-Divergence (MDmD) criterion that maximizes both the discriminant information and generalization ability of the discretized data. More specifically, the Max-Dependency criterion maximizes the statistical dependency between the discretized data and the classification variable while the Min-Divergence criterion explicitly minimizes the JS-divergence between the training data and the validation data for a given discretization scheme. The proposed MDmD criterion is technically appealing, but it is difficult to reliably estimate the high-order joint distributions of attributes and the classification variable. We hence further propose a more practical solution, Max-Relevance-Min-Divergence (MRmD) discretization scheme, where each attribute is discretized separately, by simultaneously maximizing the discriminant information and the generalization ability of the discretized data. The proposed MRmD is compared with the state-of-the-art discretization algorithms under the naive Bayes classification framework on 45 machine-learning benchmark datasets. It significantly outperforms all the compared methods on most of the datasets.Comment: Under major revision of Pattern Recognitio

    The classification performance of Bayesian Networks Classifiers: a case study of detecting Denial of Service (DoS) attacks in cloud computing environments

    Get PDF
    In this research we propose a Bayesian networks approach as a promissory classification technique for detecting malicious traffic due to Denial of Service (DoS) attacks. Bayesian networks have been applied in numerous fields fraught with uncertainty and they have been proved to be successful. They have excelled tremendously in classification tasks i.e. text analysis, medical diagnoses and environmental modeling and management. The detection of DoS attacks has received tremendous attention in the field of network security. DoS attacks have proved to be detrimental and are the bane of cloud computing environments. Large business enterprises have been/or are still unwilling to outsource their businesses to the cloud due to the intrusive tendencies that the cloud platforms are prone too. To make use of Bayesian networks it is imperative to understand the ―ecosystem‖ of factors that are external to modeling the Bayesian algorithm itself. Understanding these factors have proven to result in comparable improvement in classification performance beyond the augmentation of the existing algorithms. Literature provides discussions pertaining to the factors that impact the classification capability, however it was noticed that the effects of the factors are not universal, they tend to be unique for each domain problem. This study investigates the effects of modeling parameters on the classification performance of Bayesian network classifiers in detecting DoS attacks in cloud platforms. We analyzed how structural complexity, training sample size, the choice of discretization method and lastly the score function both individually and collectively impact the performance of classifying between normal and DoS attacks on the cloud. To study the aforementioned factors, we conducted a series of experiments in detecting live DoS attacks launched against a deployed cloud and thereafter examined the classification performance in terms of accuracy of different classes of Bayesian networks. NSL-KDD dataset was used as our training set. We used ownCloud software to deploy our cloud platform. To launch DoS attacks, we used hping3 hacker friendly utility. A live packet capture was used as our test set. WEKA version 3.7.12 was used for our experiments. Our results show that the progression in model complexity improves the classification performance. This is attributed to the increase in the number of attribute correlations. Also the size of the training sample size proved to improve classification ability. Our findings noted that the choice of discretization algorithm does matter in the quest for optimal classification performance. Furthermore, our results indicate that the choice of scoring function does not affect the classification performance of Bayesian networks. Conclusions drawn from this research are prescriptive particularly for a novice machine learning researcher with valuable recommendations that ensure optimal classification performance of Bayesian networks classifiers

    A New Swarm-Based Framework for Handwritten Authorship Identification in Forensic Document Analysis

    Get PDF
    Feature selection has become the focus of research area for a long time due to immense consumption of high-dimensional data. Originally, the purpose of feature selection is to select the minimally sized subset of features class distribution which is as close as possible to original class distribution. However in this chapter, feature selection is used to obtain the unique individual significant features which are proven very important in handwriting analysis of Writer Identification domain. Writer Identification is one of the areas in pattern recognition that have created a center of attention by many researchers to work in due to the extensive exchange of paper documents. Its principal point is in forensics and biometric application as such the writing style can be used as bio-metric features for authenticating the identity of a writer. Handwriting style is a personal to individual and it is implicitly represented by unique individual significant features that are hidden in individual’s handwriting. These unique features can be used to identify the handwritten authorship accordingly. The use of feature selection as one of the important machine learning task is often disregarded in Writer Identification domain, with only a handful of studies implemented feature selection phase. The key concern in Writer Identification is in acquiring the features reflecting the author of handwriting. Thus, it is an open question whether the extracted features are optimal or near-optimal to identify the author. Therefore, feature extraction and selection of the unique individual significant features are very important in order to identify the writer, moreover to improve the classification accuracy. It relates to invarianceness of authorship where invarianceness between features for intra-class (same writer) is lower than inter-class (different writer). Many researches have been done to develop algorithms for extracting good features that can reflect the authorship with good performance. This chapter instead focuses on identifying the unique individual significant features of word shape by using feature selection method prior the identification task. In this chapter, feature selection is explored in order to find the most unique individual significant features which are the unique features of individual’s writing. This chapter focuses on the integration of Swarm Optimized and Computationally Inexpensive Floating Selection (SOCIFS) feature selection technique into the proposed hybrid of Writer Identification framework 386 S.F. Pratama et al. and feature selection framework, namely Cheap Computational Cost Class-Specific Swarm Sequential Selection (C4S4). Experiments conducted to proof the validity and feasibility of the proposed framework using dataset from IAM Database by comparing the proposed framework to the existing Writer Identification framework and various feature selection techniques and frameworks yield satisfactory results. The results show the proposed framework produces the best result with 99.35% classification accuracy. The promising outcomes are opening the gate to future explorations in Writer Identification domain specifically and other domains generally

    SODE: Self-Adaptive One-Dependence Estimators for classification

    Full text link
    © 2015 Elsevier Ltd. SuperParent-One-Dependence Estimators (SPODEs) represent a family of semi-naive Bayesian classifiers which relax the attribute independence assumption of Naive Bayes (NB) to allow each attribute to depend on a common single attribute (superparent). SPODEs can effectively handle data with attribute dependency but still inherent NB's key advantages such as computational efficiency and robustness for high dimensional data. In reality, determining an optimal superparent for SPODEs is difficult. One common approach is to use weighted combinations of multiple SPODEs, each having a different superparent with a properly assigned weight value (i.e., a weight value is assigned to each attribute). In this paper, we propose a self-adaptive SPODEs, namely SODE, which uses immunity theory in artificial immune systems to automatically and self-adaptively select the weight for each single SPODE. SODE does not need to know the importance of individual SPODE nor the relevance among SPODEs, and can flexibly and efficiently search optimal weight values for each SPODE during the learning process. Extensive experiments and comparisons on 56 benchmark data sets, and validations on image and text classification, demonstrate that SODE outperforms state-of-the-art weighted SPODE algorithms and is suitable for a wide range of learning tasks. Results also confirm that SODE provides an appropriate balance between runtime efficiency and accuracy

    Failure Analysis of Soil Slopes with Advanced Bayesian Networks

    Get PDF
    To prevent catastrophic consequences of slope failure, it can be effective to have in advance a good understanding of the effect of both, internal and external triggering-factors on the slope stability. Herein we present an application of advanced Bayesian networks for solving geotechnical problems. A model of soil slopes is constructed to predict the probability of slope failure and analyze the influence of the induced-factors on the results. The paper explains the theoretical background of enhanced Bayesian networks, able to cope with continuous input parameters, and Credal networks, specially used for incomplete input information. Two geotechnical examples are implemented to demonstrate the feasibility and predictive effectiveness of advanced Bayesian networks. The ability of BNs to deal with the prediction of slope failure is discussed as well. The paper also evaluates the influence of several geotechnical parameters. Besides, it discusses how the different types of BNs contribute for assessing the stability of real slopes, and how new information could be introduced and updated in the analysis

    Data trend mining for predictive systems design

    Get PDF
    The goal of this research is to propose a data mining based design framework that can be used to solve complex systems design problems in a timely and efficient manner, with the main focus being product family design problems. Traditional data acquisition techniques that have been employed in the product design community have relied primarily on customer survey data or focus group feedback as a means of integrating customer preference information into the product design process. The reliance of direct customer interaction can be costly and time consuming and may therefore limit the overall size and complexity of the customer preference data. Furthermore, since survey data typically represents stated customer preferences (customer responses for hypothetical product designs, rather than actual product purchasing decisions made), design engineers may not know the true customer preferences for specific product attributes, a challenge that could ultimately result in misguided product designs. By analyzing large scale time series consumer data, new products can be designed that anticipate emerging product preference trends in the market space. The proposed data trend mining algorithm will enable design engineers to determine how to characterize attributes based on their relevance to the overall product design. A cell phone case study is used to demonstrate product design problems involving new product concept generation and an aerodynamic particle separator case study is presented for product design problems requiring attribute relevance characterization and product family clustering. Finally, it is shown that the proposed trend mining methodology can be expanded beyond product design problems to include systems of systems design problems such as military systems simulations
    • …
    corecore