233 research outputs found

    Good methods for coping with missing data in decision trees

    Get PDF
    We propose a simple and effective method for dealing with missing data in decision trees used for classification. We call this approach 'missingness incorporated in attributes' (MIA). It is very closely related to the technique of treating 'missing' as a category in its own right, generalizing it for use with continuous as well as categorical variables. We show through a substantial data-based study of classification accuracy that MIA exhibits consistently good performance across a broad range of data types and of sources and amounts of missingness. It is competitive with the best of the rest (particularly, a multiple imputation EM algorithm method; EMMI) while being conceptually and computationally simpler. A simple combination of MIA and EMMI is slower but even more accurate

    Computational information geometry in statistics: foundations

    Get PDF
    This paper lays the foundations for a new framework for numerically and computationally applying information geometric methods to statistical modelling

    Visual Data Mining with ILOG Discovery

    Full text link
    Data mining deals with the discovery of useful and previously unknown knowledge from large data sets [3]. Traditional data mining tools use a combination of machine learning, statistical analysis, modeling techniques and database technology to find patterns, exceptions and subtle relationships in data. Typical applications include market segmentation, customer profiling, fraud detection, credit risk analysis, and business data development

    Kernel density classification and boosting: an L2 sub analysis

    Get PDF
    Kernel density estimation is a commonly used approach to classification. However, most of the theoretical results for kernel methods apply to estimation per se and not necessarily to classification. In this paper we show that when estimating the difference between two densities, the optimal smoothing parameters are increasing functions of the sample size of the complementary group, and we provide a small simluation study which examines the relative performance of kernel density methods when the final goal is classification. A relative newcomer to the classification portfolio is “boosting”, and this paper proposes an algorithm for boosting kernel density classifiers. We note that boosting is closely linked to a previously proposed method of bias reduction in kernel density estimation and indicate how it will enjoy similar properties for classification. We show that boosting kernel classifiers reduces the bias whilst only slightly increasing the variance, with an overall reduction in error. Numerical examples and simulations are used to illustrate the findings, and we also suggest further areas of research

    Citizen Science 2.0 : Data Management Principles to Harness the Power of the Crowd

    Get PDF
    Citizen science refers to voluntary participation by the general public in scientific endeavors. Although citizen science has a long tradition, the rise of online communities and user-generated web content has the potential to greatly expand its scope and contributions. Citizens spread across a large area will collect more information than an individual researcher can. Because citizen scientists tend to make observations about areas they know well, data are likely to be very detailed. Although the potential for engaging citizen scientists is extensive, there are challenges as well. In this paper we consider one such challenge – creating an environment in which non-experts in a scientific domain can provide appropriate and accurate data regarding their observations. We describe the problem in the context of a research project that includes the development of a website to collect citizen-generated data on the distribution of plants and animals in a geographic region. We propose an approach that can improve the quantity and quality of data collected in such projects by organizing data using instance-based data structures. Potential implications of this approach are discussed and plans for future research to validate the design are described

    Assessment of Financial Risk Prediction Models with Multi-criteria Decision Making Methods

    Get PDF
    A wide range of classification models have been explored for financial risk prediction, but conclusions on which technique behaves better may vary when different performance evaluation measures are employed. Accordingly, this paper proposes the use of multiple criteria decision making tools in order to give a ranking of algorithms. More specifically, the selection of the most appropriate credit risk prediction method is here modeled as a multi-criteria decision making problem that involves a number of performance measures (criteria) and classification techniques (alternatives). An empirical study is carried out to evaluate the performance of ten algorithms over six real-life credit risk data sets. The results reveal that the use of a unique performance measure may lead to unreliable conclusions, whereas this situation can be overcome by the application of multi-criteria decision making techniques

    Partitional clustering of protein sequences - An inductive logic programming approach

    Get PDF
    We present a novel approach to cluster sets of protein sequences, based on Inductive Logic Programming (ILP). Preliminary results show that; the method proposed Produces understand able descriptions/explanations of the clusters. Furthermore, it can be used as a knowledge elicitation tool to explain clusters proposed by other clustering approaches, such as standard phylogenetic programs

    Thermal treatment of nuclear fuel-containing Magnox sludge radioactive waste

    Get PDF
    Magnesium aluminosilicate and magnesium borosilicate glass formulations were developed and evaluated for the immobilisation of the radioactive waste known as Magnox sludge. Glass compositions were synthesised using two simplified bounding waste simulants, including corroded and metallic uranium and magnesium at waste loadings of up to 50 wt.%. The glasses immobilising corroded simulant waste formed heterogeneous and fully amorphous glasses, while those immobilising metallic wastes contained crystallites of UO2 and U3O8. Uranium speciation within the glass was investigated by micro-focus X-ray absorption near edge spectroscopy and it was shown that the borosilicate glass compositions were characterised by a slightly lower mean uranium oxidation state than the aluminosilicate counterparts. This had an impact upon the durability, and uranium within glasses of higher mean oxidation states was dissolved more readily. All material showed dissolution rates that were comparable to simulant high level radioactive waste glasses, while the borosilicate-based formulations melted at a temperature suitable for modern vitrification technologies used in radioactive waste applications. These data highlights the potential for vitrification of hazardous radioactive Magnox sludge waste in borosilicate or aluminosilicate glass formulations, with the potential to achieve >95 % reduction in conditioned waste volume over the current baseline plan
    • …
    corecore