162,743 research outputs found

    Distributed Kernel Regression: An Algorithm for Training Collaboratively

    Full text link
    This paper addresses the problem of distributed learning under communication constraints, motivated by distributed signal processing in wireless sensor networks and data mining with distributed databases. After formalizing a general model for distributed learning, an algorithm for collaboratively training regularized kernel least-squares regression estimators is derived. Noting that the algorithm can be viewed as an application of successive orthogonal projection algorithms, its convergence properties are investigated and the statistical behavior of the estimator is discussed in a simplified theoretical setting.Comment: To be presented at the 2006 IEEE Information Theory Workshop, Punta del Este, Uruguay, March 13-17, 200

    Granular Support Vector Machines Based on Granular Computing, Soft Computing and Statistical Learning

    Get PDF
    With emergence of biomedical informatics, Web intelligence, and E-business, new challenges are coming for knowledge discovery and data mining modeling problems. In this dissertation work, a framework named Granular Support Vector Machines (GSVM) is proposed to systematically and formally combine statistical learning theory, granular computing theory and soft computing theory to address challenging predictive data modeling problems effectively and/or efficiently, with specific focus on binary classification problems. In general, GSVM works in 3 steps. Step 1 is granulation to build a sequence of information granules from the original dataset or from the original feature space. Step 2 is modeling Support Vector Machines (SVM) in some of these information granules when necessary. Finally, step 3 is aggregation to consolidate information in these granules at suitable abstract level. A good granulation method to find suitable granules is crucial for modeling a good GSVM. Under this framework, many different granulation algorithms including the GSVM-CMW (cumulative margin width) algorithm, the GSVM-AR (association rule mining) algorithm, a family of GSVM-RFE (recursive feature elimination) algorithms, the GSVM-DC (data cleaning) algorithm and the GSVM-RU (repetitive undersampling) algorithm are designed for binary classification problems with different characteristics. The empirical studies in biomedical domain and many other application domains demonstrate that the framework is promising. As a preliminary step, this dissertation work will be extended in the future to build a Granular Computing based Predictive Data Modeling framework (GrC-PDM) with which we can create hybrid adaptive intelligent data mining systems for high quality prediction

    Discovery of the D-basis in binary tables based on hypergraph dualization

    Get PDF
    Discovery of (strong) association rules, or implications, is an important task in data management, and it nds application in arti cial intelligence, data mining and the semantic web. We introduce a novel approach for the discovery of a speci c set of implications, called the D-basis, that provides a representation for a reduced binary table, based on the structure of its Galois lattice. At the core of the method are the D-relation de ned in the lattice theory framework, and the hypergraph dualization algorithm that allows us to e ectively produce the set of transversals for a given Sperner hypergraph. The latter algorithm, rst developed by specialists from Rutgers Center for Operations Research, has already found numerous applications in solving optimization problems in data base theory, arti cial intelligence and game theory. One application of the method is for analysis of gene expression data related to a particular phenotypic variable, and some initial testing is done for the data provided by the University of Hawaii Cancer Cente

    Discovery of the D-basis in binary tables based on hypergraph dualization

    Get PDF
    Discovery of (strong) association rules, or implications, is an important task in data management, and it nds application in arti cial intelligence, data mining and the semantic web. We introduce a novel approach for the discovery of a speci c set of implications, called the D-basis, that provides a representation for a reduced binary table, based on the structure of its Galois lattice. At the core of the method are the D-relation de ned in the lattice theory framework, and the hypergraph dualization algorithm that allows us to e ectively produce the set of transversals for a given Sperner hypergraph. The latter algorithm, rst developed by specialists from Rutgers Center for Operations Research, has already found numerous applications in solving optimization problems in data base theory, arti cial intelligence and game theory. One application of the method is for analysis of gene expression data related to a particular phenotypic variable, and some initial testing is done for the data provided by the University of Hawaii Cancer Cente

    Negative sequential pattern mining

    Full text link
    University of Technology, Sydney. Faculty of Engineering and Information Technology.Sequential pattern mining provides an important way to obtain special patterns from sequence data. It produces important insights on bioinformatics data, web-logs, customer transaction data, and so on. Different from traditional positive sequential pattern (PSP) mining, negative sequential pattern (NSP) mining takes negative itemsets into account besides positive ones. It would be more interesting in applications where non-occurring itemsets need to be considered. This thesis reports our previous and the latest research outcomes in this area. The contributions of the thesis are as following. • A comprehensive literature review of negative frequent pattern mining is described. • A general framework of the NSP mining is proposed. It can be used to describe the big picture of both PSP and NSP mining problems. • Three innovative algorithms are proposed to mine NSP efficiently. • Extensive experiments about the three algorithms on either synthetic or real-world datasets show that the proposed methods can find NSP efficiently. • A case study describes a real-life application on customer claims analysis in health insurance industry. Three algorithms of NSP mining are proposed in this thesis, listed as below: (1) The first algorithm Neg-GSP (Zheng, Zhao, Zuo & Cao 2009) is based on a PSP mining algorithm GSP (Srikant & Agrawal 1996). Neg-GSP deals with negative problem by introducing new methods of joining and generating candidates, which borrow ideas from GSP algorithm. And also, an effective pruning method to reduce the number of candidates is proposed as well. (2) The second one is a Genetic Algorithm based algorithm (Zheng, Zhao, Zuo & Cao 2010), which is called GA-NSP. It is proposed to find NSP with novel crossover and mutation operations, which are efficient at passing good genes on to next generations. An effective dynamic fitness function and a pruning method are also provided to improve performance. (3) The third algorithm e-NSP (Dong, Zheng, Cao, Zhao, Zhang, Li, Wei & Ou 2011) is based on the Set Theory. It mines NSP by only involving the identified PSP, without re-scanning the database. In this way, mining NSP does not require any additional database scans. It facilitates the existing PSP mining algorithms to mine NSP. It offers a new strategy for efficient mining of NSP. The results of extensive experiments about the three algorithms show that they can find NSP efficiently. They have good performance compared with some other existing NSP mining algorithms, such as PNSP (Hsueh, Lin & Chen 2008). If we compare the problem statements of the above three methods, Neg-GSP and GA-NSP share the same definitions, e-NSP uses stronger constraints since it requires clear boundary to follow the Set Theory. When comparing their performances, GA-NSP algorithm slightly outperforms Neg-GSP in terms of execution time, but it may miss some patterns in the complete result sets due to limitations of Genetic Algorithm. Apparently, e-NSP is the most efficient and effective one since it does not need to scan datasets to calculate the support of NSP. Although adding stronger constraints on e-NSP makes the search space much smaller than what it is under the normal definitions, it is still very practicable while being used in some real-life applications. Following that, NSP mining case studies coming from health insurance industry are introduced. Based on real-life customer claims datasets, we use the proposed NSP mining methods to find PSP and NSP on solving two business issues, one is in ancillary service over-service analysis, another is fraud claim detection. Both of the two case studies demonstrate the benefits gained from mining NSP

    Data mining algorithm for manufacturing process control

    Full text link
    In this paper, a new data mining algorithm based on the rough sets theory is presented for manufacturing process control. The algorithm extracts useful knowledge from large data sets obtained from manufacturing processes and represents this knowledge using “if/then” decision rules. Application of the data mining algorithm developed in this paper is illustrated with an industrial example of rapid tool making (RTM). RTM is a technology that adopts rapid prototyping (RP) techniques, such as spray forming, and applies them to tool and die making. A detailed discussion on how to control the output of the manufacturing process using the results obtained from the data mining algorithm is also presented. Compared to other data mining methods, such decision trees and neural networks, the advantage of the proposed approach is its accuracy, computational efficiency, and ease of use.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/45889/1/170_2004_Article_2367.pd

    Advances on Concept Drift Detection in Regression Tasks using Social Networks Theory

    Full text link
    Mining data streams is one of the main studies in machine learning area due to its application in many knowledge areas. One of the major challenges on mining data streams is concept drift, which requires the learner to discard the current concept and adapt to a new one. Ensemble-based drift detection algorithms have been used successfully to the classification task but usually maintain a fixed size ensemble of learners running the risk of needlessly spending processing time and memory. In this paper we present improvements to the Scale-free Network Regressor (SFNR), a dynamic ensemble-based method for regression that employs social networks theory. In order to detect concept drifts SFNR uses the Adaptive Window (ADWIN) algorithm. Results show improvements in accuracy, especially in concept drift situations and better performance compared to other state-of-the-art algorithms in both real and synthetic data

    On the Bayesian network based data mining framework for the choice of appropriate time scale for regional analysis of drought Hazard

    Get PDF
    Data mining has a significant role in hyrdrologic research. Among several methods of data mining, Bayesian network theory has great importance and wide applications as well. The drought indices are very useful tools for drought monitoring and forecasting. However, the multi-scaling nature of standardized type drought indices creates several problems in data analysis and reanalysis at regional level. This paper presents a novel framework of data mining for hydrological research-the Bayesian Integrated Regional Drought Time Scale (BIRDts). The mechanism of BIRDts gives effective and sufficient time scales by considering dependency/interdependency probabilities from Bayesian network algorithm. The resultant time scales are proposed for further investigation and research related to the hydrological process. Application of the proposed method consists of 46 meteorological stations of Pakistan. In this research, we have employed Standardized Precipitation Temperature Index (SPTI) drought index for 1-, 3-, 6-, 9-, 12-, 24-, and ()month time scales. Outcomes associated with this research show that the proposed method has rationale to aggregate time scales at regional level by configuring marginal posterior probability as weights in the selection process of effective drought time scales

    Intertemporal Choice of Fuzzy Soft Sets

    Get PDF
    This paper first merges two noteworthy aspects of choice. On the one hand, soft sets and fuzzy soft sets are popular models that have been largely applied to decision making problems, such as real estate valuation, medical diagnosis (glaucoma, prostate cancer, etc.), data mining, or international trade. They provide crisp or fuzzy parameterized descriptions of the universe of alternatives. On the other hand, in many decisions, costs and benefits occur at different points in time. This brings about intertemporal choices, which may involve an indefinitely large number of periods. However, the literature does not provide a model, let alone a solution, to the intertemporal problem when the alternatives are described by (fuzzy) parameterizations. In this paper, we propose a novel soft set inspired model that applies to the intertemporal framework, hence it fills an important gap in the development of fuzzy soft set theory. An algorithm allows the selection of the optimal option in intertemporal choice problems with an infinite time horizon. We illustrate its application with a numerical example involving alternative portfolios of projects that a public administration may undertake. This allows us to establish a pioneering intertemporal model of choice in the framework of extended fuzzy set theorie
    corecore