473,716 research outputs found

    A review of associative classification mining

    Get PDF
    Associative classification mining is a promising approach in data mining that utilizes the association rule discovery techniques to construct classification systems, also known as associative classifiers. In the last few years, a number of associative classification algorithms have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. These algorithms employ several different rule discovery, rule ranking, rule pruning, rule prediction and rule evaluation methods. This paper focuses on surveying and comparing the state-of-the-art associative classification techniques with regards to the above criteria. Finally, future directions in associative classification, such as incremental learning and mining low-quality data sets, are also highlighted in this paper

    Use of Data Mining for Intelligent Evaluation of Imputation Methods

    Get PDF
    In real-world situations, researchers frequently face the difficulty of missing values (MV), i.e., values not observed in a data set. Data imputation techniques allow the estimation of MV using different algorithms, by means of which important data can be imputed for a particular instance. Most of the literature in this field deals with different imputation methods. However, few studies deal with a comparative evaluation of the different methods as to provide more appropriate guidelines for the selection of the method to be applied to impute data for specific situations. The objective of this work is to show a methodology for evaluating the performance of imputation methods by means of new metrics derived from data mining processes, using quality metrics of data mining models. We started from the complete dataset that was amputated with different amputation mechanisms to generate 63 datasets with MV; these were imputed using Median, k-NN, k-Means and Hot-Deck imputation methods. The performance of the imputation methods was evaluated using new metrics derived from quality metrics of the data mining processes, performed with the original full file and with the imputed files. This evaluation is not based on measuring the error when imputing (usual operation), but on considering the similarity of the values of the quality metrics of the data mining processes obtained with the original file and with the imputed files. The results show that –globally considered and according to the new proposed metric, the imputation methods that showed the best performance were k-NN and k-Means. An additional advantage of the proposed methodology is that it provides predictive data mining models that can be used a posteriori

    Agile mining : a novel data mining process for industry practice based on Agile Methods and visualization

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.Current standard data mining processes like CRoss-Industry Standard Process for Data mining (CRISP-DM) are vulnerable to frequent change of customer requirement. Meanwhile, Stakeholders might not acquire sufficient understanding to generate business value from analytic results due to a lack of intelligible explanatory stage. These two cases repeatedly happen on those companies which are inexperienced in data mining practice. Towards this issue, Agile Mining, a refined CRISP-DM based data mining (DM) process, is proposed to address these two friction points between current data mining processes and inexperienced industry practitioners. By merging agile methods into CRISP-DM, Agile Mining process achieves a requirement changing friendly data mining environment for inexperienced companies. Moreover, this Agile Mining transforms traditional analytic-oriented evaluation to business-oriented visualization-based evaluation. In the case study, two industrial data mining projects are used to illustrate the application of this new data mining process and its advantages

    BigFCM: Fast, Precise and Scalable FCM on Hadoop

    Full text link
    Clustering plays an important role in mining big data both as a modeling technique and a preprocessing step in many data mining process implementations. Fuzzy clustering provides more flexibility than non-fuzzy methods by allowing each data record to belong to more than one cluster to some degree. However, a serious challenge in fuzzy clustering is the lack of scalability. Massive datasets in emerging fields such as geosciences, biology and networking do require parallel and distributed computations with high performance to solve real-world problems. Although some clustering methods are already improved to execute on big data platforms, but their execution time is highly increased for large datasets. In this paper, a scalable Fuzzy C-Means (FCM) clustering named BigFCM is proposed and designed for the Hadoop distributed data platform. Based on the map-reduce programming model, it exploits several mechanisms including an efficient caching design to achieve several orders of magnitude reduction in execution time. Extensive evaluation over multi-gigabyte datasets shows that BigFCM is scalable while it preserves the quality of clustering

    A Statistical Toolbox For Mining And Modeling Spatial Data

    Get PDF
    Most data mining projects in spatial economics start with an evaluation of a set of attribute variables on a sample of spatial entities, looking for the existence and strength of spatial autocorrelation, based on the Moran’s and the Geary’s coefficients, the adequacy of which is rarely challenged, despite the fact that when reporting on their properties, many users seem likely to make mistakes and to foster confusion. My paper begins by a critical appraisal of the classical definition and rational of these indices. I argue that while intuitively founded, they are plagued by an inconsistency in their conception. Then, I propose a principled small change leading to corrected spatial autocorrelation coefficients, which strongly simplifies their relationship, and opens the way to an augmented toolbox of statistical methods of dimension reduction and data visualization, also useful for modeling purposes. A second section presents a formal framework, adapted from recent work in statistical learning, which gives theoretical support to our definition of corrected spatial autocorrelation coefficients. More specifically, the multivariate data mining methods presented here, are easily implementable on the existing (free) software, yield methods useful to exploit the proposed corrections in spatial data analysis practice, and, from a mathematical point of view, whose asymptotic behavior, already studied in a series of papers by Belkin & Niyogi, suggests that they own qualities of robustness and a limited sensitivity to the Modifiable Areal Unit Problem (MAUP), valuable in exploratory spatial data analysis

    Principles of Green Data Mining

    Get PDF
    This paper develops a set of principles for green data mining, related to the key stages of business un- derstanding, data understanding, data preparation, modeling, evaluation, and deployment. The principles are grounded in a review of the Cross Industry Stand- ard Process for Data mining (CRISP-DM) model and relevant literature on data mining methods and Green IT. We describe how data scientists can contribute to designing environmentally friendly data mining pro- cesses, for instance, by using green energy, choosing between make-or-buy, exploiting approaches to data reduction based on business understanding or pure statistics, or choosing energy friendly models

    Comparison of Support Vector Machine and Back Propagation Neural Network in Evaluating the Enterprise Financial Distress

    Full text link
    Recently, applying the novel data mining techniques for evaluating enterprise financial distress has received much research alternation. Support Vector Machine (SVM) and back propagation neural (BPN) network has been applied successfully in many areas with excellent generalization results, such as rule extraction, classification and evaluation. In this paper, a model based on SVM with Gaussian RBF kernel is proposed here for enterprise financial distress evaluation. BPN network is considered one of the simplest and are most general methods used for supervised training of multilayered neural network. The comparative results show that through the difference between the performance measures is marginal; SVM gives higher precision and lower error rates.Comment: 13 pages, 1 figur
    corecore