53 research outputs found

    HANDLING MISSING ATTRIBUTE VALUES IN DECISION TABLES USING VALUED TOLERANCE APPROACH

    Get PDF
    Rule induction is one of the key areas in data mining as it is applied to a large number of real life data. However, in such real life data, the information is incompletely specified most of the time. To induce rules from these incomplete data, more powerful algorithms are necessary. This research work mainly focuses on a probabilistic approach based on the valued tolerance relation. This thesis is divided into two parts. The first part describes the implementation of the valued tolerance relation. The induced rules are then evaluated based on the error rate due to incorrectly classified and unclassified examples. The second part of this research work shows a comparison of the rules induced by the MLEM2 algorithm that has been implemented before, with the rules induced by the valued tolerance based approach which was implemented as part of this research. Hence, through this thesis, the error rate for the MLEM2 algorithm and the valued tolerance based approach are compared and the results are documented

    MRDTL: a multi-relational decision tree learning algorithm

    Get PDF
    Many real-world data sets are organized in relational databases consisting of multiple tables and associations. Other types of data such as in bioinformatics, computational biology, HTML and XML documents require reasoning about the structure of the objects. However, most of the existing approaches to machine learning typically assume that the data are stored in a single table, and use a propositional (as opposed to relational) language for discovering predictive models. Hence, there is a need for data mining algorithms for discovery of a-priori unknown relationships from multi-relational data. This thesis explores a new framework for multi-relational data mining. It describes experiments with an implementation of a Multi-Relational Decision Tree Learning (MRDTL) algorithm for induction of decision trees from relational databases based on an approach suggested by Knobbe et al., 1999. Our experiments with widely used benchmark data sets (e.g., the carcinogenesis data) show that the performance of MRDTL is competitive with that of other algorithms for learning classifiers from multiple relations including Progol (Muggleton, 1995) FOIL (Quinlan, 1993), Tilde (Blockeel, 1998). Preliminary results indicate that MRDTL, when augmented with principled methods for handling missing attribute values, is likely to be competitive with the state-of-the-art algorithms for learning classifiers from multiple relations on real-world data sets drawn from bioinformatics applications (prediction of gene localization and gene function) used in the KDD Cup 2001 data mining competition (Cheng et al., 2002)

    Rough Fuzzy Subspace Clustering for Data with Missing Values

    Get PDF
    The paper presents rough fuzzy subspace clustering algorithm and experimental results of clustering. In this algorithm three approaches for handling missing values are used: marginalisation, imputation and rough sets. The algorithm also assigns weights to attributes in each cluster; this leads to subspace clustering. The parameters of clusters are elaborated in the iterative procedure based on minimising of criterion function. The crucial parameter of the proposed algorithm is the parameter having the influence on the sharpness of elaborated subspace cluster. The lower values of the parameter lead to selection of the most important attribute. The higher values create clusters in the global space, not in subspaces. The paper is accompanied by results of clustering of synthetic and real life data sets

    Ensemble missing data techniques for software effort prediction

    Get PDF
    Constructing an accurate effort prediction model is a challenge in software engineering. The development and validation of models that are used for prediction tasks require good quality data. Unfortunately, software engineering datasets tend to suffer from the incompleteness which could result to inaccurate decision making and project management and implementation. Recently, the use of machine learning algorithms has proven to be of great practical value in solving a variety of software engineering problems including software prediction, including the use of ensemble (combining) classifiers. Research indicates that ensemble individual classifiers lead to a significant improvement in classification performance by having them vote for the most popular class. This paper proposes a method for improving software effort prediction accuracy produced by a decision tree learning algorithm and by generating the ensemble using two imputation methods as elements. Benchmarking results on ten industrial datasets show that the proposed ensemble strategy has the potential to improve prediction accuracy compared to an individual imputation method, especially if multiple imputation is a component of the ensemble

    Research on data mining technology and its application in teaching management in Colleges and Universities

    Get PDF
    近几年来,我国教育事业进入高速发展时期,各所高校在办学规模、招生数量以及教学队伍都在日益扩大,加上高校在办学模式方面逐渐多元化与个性化,使得教学管理难度也随之增加,而传统的教学管理模式已经难以满足学校发展的需求,因此迫切地需要提高教学管理水平与效率。随着信息技术的不断发展和普及,高校信息化建设也在稳步前行,并取得非常显著的效果。正是由于高校信息化建设不断地深入与普及,使得学校积累了大量的相关数据,只有能够充分地挖掘与分析这些海量数据所包含的价值,才能进一步提高教学管理水平与效率。而数据挖掘技术就是一种有效的方法,能够充分地挖掘与分析隐藏在数据背后的信息,并为教学管理提供决策支持。 本文首先分...In recent years, China's education industry has entered the high-speed development period, every college school size, the number of admissions and teaching teams are growing in, plus the university in terms of school system gradually diversified and personalized, so that also increases the difficulty of teaching management while traditional teaching management model has been difficult to meet the ...学位:工程硕士院系专业:软件学院_软件工程学号:X201223029

    Comparison of Cart and Naive Bayesian Algorithm Performance to Diagnose Diabetes Mellitus

    Get PDF
    Based on Indonesia's health profile in 2008, Diabetes Mellitus is the cause of the ranking of six for all ages in Indonesia with the proportion of deaths of 5.7% under stroke, TB, hypertension, injury and perinatal. This is reinforced by WHO (2003), Diabetes Mellitus disease reached 194 million people or 5.1 percent of the world's adult population and in 2025 is expected to increase to 333 million inhabitants. In particular, in Indonesia, people with Diabetes Mellitus are increasing. In 2000, Diabetes Mellitus sufferers have reached 8.4 million people and it is estimated that the prevalence of Diabetes Mellitus in 2030 in Indonesia reaches 21.3 million people.This allows researchers and practitioners to focus their attention on detecting/diagnosing diabetes mellitus and to prevent it because the disease can cause complications. The method used in this research was problem identification, data collection, pre-processing stage, classification method, validation and evaluation and conclusion. The algorithm used in this research was CART and Naïve Bayes using dataset taken from UCI Indian Pima database repository consisting of clinical data ofpatients who detected positive and negative diabetes mellitus. Validation and evaluation method used was 10-crossvalidation and confusion Matrix for the assessment of precision, recall and F-Measure. The result of calculation has been done, got the accuracy result on CART algorithm equaled to 76.9337% with precision 0.764%, recall 0.769%, and F-Measure 0.765%. Whilethe diabetes dataset was tested with the Naïve Bayes algorithm, got an accuracy of 73.7569% with precision 0.732%, recall 0.738%, and F-Measure 0.734%. From these results it can be concluded that to diagnose diabetes mellitus disease it is suggested to use CART algorithm

    CLASSIFICATION MODEL FOR LEARNING DISABILITIES IN ELEMENTARY SCHOOL PUPILS

    Get PDF
    Learning disability is a general term that describes specific kinds of learning problems.  Although, Learning Disability cannot be cured medically, there exist several methods for detecting learning disabilities in a child. Existing methods of classification of learning disabilities in children are binary classification – either a child is normal or learning disabled. The focus of this paper is to extend the binary classification to multi-label classification of learning disabilities. This paper formulated and simulated a classification model for learning disabilities in primary school pupils. Information containing the symptoms of learning disabilities in pupils were elicited by administering five hundred (500) questionnaire to teachers of Primary One to Four pupils in fifteen government owned elementary schools within Ife Central Local Government Area, Ile-Ife of Osun State. The classification model was formulated using Principal Component Analysis, rule based system and back propagation algorithm. The formulated model was simulated using Waikatto Environment for Knowledge Analysis (WEKA) version 3.7.2. The performance of the model was evaluated using precision and accuracy. The classification model of primary one, primary two, primary three and primary four yielded precision rate of 95%, 91.18%, 93.10% and 93.60% respectively while the accuracy results were 95.00%, 91.18%, 93.10% and 93.60% respectively. The results obtained showed that the developed model proved to be accurate and precise in classifying pupils with learning disabilities in primary schools. The model can be adopted for the management of pupils with learning disabilities. &nbsp
    corecore