11,802 research outputs found

    Decision Tree Classifiers for Star/Galaxy Separation

    Full text link
    We study the star/galaxy classification efficiency of 13 different decision tree algorithms applied to photometric objects in the Sloan Digital Sky Survey Data Release Seven (SDSS DR7). Each algorithm is defined by a set of parameters which, when varied, produce different final classification trees. We extensively explore the parameter space of each algorithm, using the set of 884,126884,126 SDSS objects with spectroscopic data as the training set. The efficiency of star-galaxy separation is measured using the completeness function. We find that the Functional Tree algorithm (FT) yields the best results as measured by the mean completeness in two magnitude intervals: 14r2114\le r\le21 (85.285.2%) and r19r\ge19 (82.182.1%). We compare the performance of the tree generated with the optimal FT configuration to the classifications provided by the SDSS parametric classifier, 2DPHOT and Ball et al. (2006). We find that our FT classifier is comparable or better in completeness over the full magnitude range 15r2115\le r\le21, with much lower contamination than all but the Ball et al. classifier. At the faintest magnitudes (r>19r>19), our classifier is the only one able to maintain high completeness (>>80%) while still achieving low contamination (2.5\sim2.5%). Finally, we apply our FT classifier to separate stars from galaxies in the full set of 69,545,32669,545,326 SDSS photometric objects in the magnitude range 14r2114\le r\le21.Comment: Submitted to A

    Taxonomic evidence applying intelligent information algorithm and the principle of maximum entropy: the case of asteroids families

    Get PDF
    The Numeric Taxonomy aims to group operational taxonomic units in clusters (OTUs or taxons or taxa), using the denominated structure analysis by means of numeric methods. These clusters that constitute families are the purpose of this series of projects and they emerge of the structural analysis, of their phenotypical characteristic, exhibiting the relationships in terms of grades of similarity of the OTUs, employing tools such as i) the Euclidean distance and ii) nearest neighbor techniques. Thus taxonomic evidence is gathered so as to quantify the similarity for each pair of OTUs (pair-group method) obtained from the basic data matrix and in this way the significant concept of spectrum of the OTUs is introduced, being based the same one on the state of their characters. A new taxonomic criterion is thereby formulated and a new approach to Computational Taxonomy is presented, that has been already employed with reference to Data Mining, when apply of Machine Learning techniques, in particular to the C4.5 algorithms, created by Quinlan, the degree of efficiency achieved by the TDIDT family´s algorithms when are generating valid models of the data in classification problems with the Gain of Entropy through Maximum Entropy Principle.Fil: Perichinsky, Gregorio. Universidad de Buenos Aires. Facultad de Ingeniería; ArgentinaFil: Jiménez Rey, Elizabeth Miriam. Universidad de Buenos Aires. Facultad de Ingeniería; ArgentinaFil: Grossi, María Delia. Universidad de Buenos Aires. Facultad de Ingeniería; ArgentinaFil: Vallejos, Félix Anibal. Universidad de Buenos Aires. Facultad de Ingeniería; Argentina. Universidad Nacional de La Plata. Facultad de Ciencias Astronómicas y Geofísicas; ArgentinaFil: Servetto, Arturo Carlos. Universidad de Buenos Aires. Facultad de Ingeniería; ArgentinaFil: Orellana, Rosa Beatriz. Universidad Nacional de La Plata. Facultad de Ciencias Astronómicas y Geofísicas; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Plastino, Ángel Luis. Universidad Nacional de La Plata. Facultad de Ciencias Exactas. Departamento de Física; Argentin

    Probabilistic Inference from Arbitrary Uncertainty using Mixtures of Factorized Generalized Gaussians

    Full text link
    This paper presents a general and efficient framework for probabilistic inference and learning from arbitrary uncertain information. It exploits the calculation properties of finite mixture models, conjugate families and factorization. Both the joint probability density of the variables and the likelihood function of the (objective or subjective) observation are approximated by a special mixture model, in such a way that any desired conditional distribution can be directly obtained without numerical integration. We have developed an extended version of the expectation maximization (EM) algorithm to estimate the parameters of mixture models from uncertain training examples (indirect observations). As a consequence, any piece of exact or uncertain information about both input and output values is consistently handled in the inference and learning stages. This ability, extremely useful in certain situations, is not found in most alternative methods. The proposed framework is formally justified from standard probabilistic principles and illustrative examples are provided in the fields of nonparametric pattern classification, nonlinear regression and pattern completion. Finally, experiments on a real application and comparative results over standard databases provide empirical evidence of the utility of the method in a wide range of applications

    Classification of Categorical Uncertain Data Using Decision Tree

    Get PDF
    Certain data is a data whose values are known precisely whereas uncertain data means whose value are not known precisely. But data is always uncertain in real life applications. In data uncertainty attribute value is represented by a set of values. There are two types of attributes in data sets namely, numerical and categorical attributes. Data uncertainty can arise in both numerical and categorical attributes. Traditional decision tree algorithms work with certain data only. The classification performance of decision tree can be improved if complete information of data is considered. Probability Density Function (PDF) is used to improve the accuracy of decision tree classifier. Existing system to handle uncertain data works on only numerical attributes means only range of values. They cannot works uncertain categorical attributes. This paper proposes a method for handling data uncertainty on categorical attributes. The decision tree algorithm is extended to handle uncertain data. The experiments show that the classification performance of this decision tree can be enhanced. DOI: 10.17762/ijritcc2321-8169.15066

    Assessing and Remedying Coverage for a Given Dataset

    Full text link
    Data analysis impacts virtually every aspect of our society today. Often, this analysis is performed on an existing dataset, possibly collected through a process that the data scientists had limited control over. The existing data analyzed may not include the complete universe, but it is expected to cover the diversity of items in the universe. Lack of adequate coverage in the dataset can result in undesirable outcomes such as biased decisions and algorithmic racism, as well as creating vulnerabilities such as opening up room for adversarial attacks. In this paper, we assess the coverage of a given dataset over multiple categorical attributes. We first provide efficient techniques for traversing the combinatorial explosion of value combinations to identify any regions of attribute space not adequately covered by the data. Then, we determine the least amount of additional data that must be obtained to resolve this lack of adequate coverage. We confirm the value of our proposal through both theoretical analyses and comprehensive experiments on real data.Comment: in ICDE 201

    Cost-Sensitive Decision Trees with Completion Time Requirements

    Get PDF
    In many classification tasks, managing costs and completion times are the main concerns. In this paper, we assume that the completion time for classifying an instance is determined by its class label, and that a late penalty cost is incurred if the deadline is not met. This time requirement enriches the classification problem but posts a challenge to developing a solution algorithm. We propose an innovative approach for the decision tree induction, which produces multiple candidate trees by allowing more than one splitting attribute at each node. The user can specify the maximum number of candidate trees to control the computational efforts required to produce the final solution. In the tree-induction process, an allocation scheme is used to dynamically distribute the given number of candidate trees to splitting attributes according to their estimated contributions to cost reduction. The algorithm finds the final tree by backtracking. An extensive experiment shows that the algorithm outperforms the top-down heuristic and can effectively obtain the optimal or near-optimal decision trees without an excessive computation time.classification, decision tree, cost and time sensitive learning, late penalty
    corecore