Search CORE

11,802 research outputs found

Decision Tree Classifiers for Star/Galaxy Separation

Author: Abazajian
Abazajian
Abazajian
Ball
Bernstein
Breiman
Brodley
E. C. Vasconcellos
F. L. LaBarbera
Fayyad
Fayyad
Freund
Gama
Geoffrey
H. Frago Campos Velho
H. V. Capelato
Haijian
Heydon-Dumbleton
Holmes
Kohavi
La Barbera
M. Trevisan
MacGillivray
Maddox
Murthy
Odewahn
Odewahn
Quinlam
Quinlam
Quinlam
R. R. de Carvalho
R. R. Gal
R. S. R. Ruiz
Ruiz
Stoughton
Suchkov
Weir
Witten
Yasuda
York
Publication venue: 'IOP Publishing'
Publication date: 08/11/2010
Field of study

We study the star/galaxy classification efficiency of 13 different decision tree algorithms applied to photometric objects in the Sloan Digital Sky Survey Data Release Seven (SDSS DR7). Each algorithm is defined by a set of parameters which, when varied, produce different final classification trees. We extensively explore the parameter space of each algorithm, using the set of

884,126

SDSS objects with spectroscopic data as the training set. The efficiency of star-galaxy separation is measured using the completeness function. We find that the Functional Tree algorithm (FT) yields the best results as measured by the mean completeness in two magnitude intervals:

14\le r\le21

(

85.2%

) and

r\ge19

(

82.1%

). We compare the performance of the tree generated with the optimal FT configuration to the classifications provided by the SDSS parametric classifier, 2DPHOT and Ball et al. (2006). We find that our FT classifier is comparable or better in completeness over the full magnitude range

15\le r\le21

, with much lower contamination than all but the Ball et al. classifier. At the faintest magnitudes (

r>19

), our classifier is the only one able to maintain high completeness (

>

80%) while still achieving low contamination (

\sim2.5%

). Finally, we apply our FT classifier to separate stars from galaxies in the full set of

69,545,326

SDSS photometric objects in the magnitude range

14\le r\le21

.Comment: Submitted to A

arXiv.org e-Print Archive

Crossref

RCAAP - Repositório Científico de Acesso Aberto de Portugal

Universidade de São Paulo

Taxonomic evidence applying intelligent information algorithm and the principle of maximum entropy: the case of asteroids families

Author: Grossi María Delia
Jiménez Rey Elizabeth Miriam
Orellana Rosa Beatriz
Perichinsky Gregorio
Plastino Ángel Luis
Servetto Arturo Carlos
Vallejos Félix Anibal
Publication venue: 'IBEPES (Instituto Brasileiro de Estudos e Pesquisas Sociais)'
Publication date: 01/12/2005
Field of study

The Numeric Taxonomy aims to group operational taxonomic units in clusters (OTUs or taxons or taxa), using the denominated structure analysis by means of numeric methods. These clusters that constitute families are the purpose of this series of projects and they emerge of the structural analysis, of their phenotypical characteristic, exhibiting the relationships in terms of grades of similarity of the OTUs, employing tools such as i) the Euclidean distance and ii) nearest neighbor techniques. Thus taxonomic evidence is gathered so as to quantify the similarity for each pair of OTUs (pair-group method) obtained from the basic data matrix and in this way the significant concept of spectrum of the OTUs is introduced, being based the same one on the state of their characters. A new taxonomic criterion is thereby formulated and a new approach to Computational Taxonomy is presented, that has been already employed with reference to Data Mining, when apply of Machine Learning techniques, in particular to the C4.5 algorithms, created by Quinlan, the degree of efficiency achieved by the TDIDT family´s algorithms when are generating valid models of the data in classification problems with the Gain of Entropy through Maximum Entropy Principle.Fil: Perichinsky, Gregorio. Universidad de Buenos Aires. Facultad de Ingeniería; ArgentinaFil: Jiménez Rey, Elizabeth Miriam. Universidad de Buenos Aires. Facultad de Ingeniería; ArgentinaFil: Grossi, María Delia. Universidad de Buenos Aires. Facultad de Ingeniería; ArgentinaFil: Vallejos, Félix Anibal. Universidad de Buenos Aires. Facultad de Ingeniería; Argentina. Universidad Nacional de La Plata. Facultad de Ciencias Astronómicas y Geofísicas; ArgentinaFil: Servetto, Arturo Carlos. Universidad de Buenos Aires. Facultad de Ingeniería; ArgentinaFil: Orellana, Rosa Beatriz. Universidad Nacional de La Plata. Facultad de Ciencias Astronómicas y Geofísicas; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Plastino, Ángel Luis. Universidad Nacional de La Plata. Facultad de Ciencias Exactas. Departamento de Física; Argentin

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

CONICET Digital

Recommended from our members

The effect of missing values using genetic programming on evolvable diagnosis

Author: Kalganova T
Werner JC
Publication venue
Publication date: 01/01/2002
Field of study

Medical databases usually contain missing values due the policy of reducing stress and harm to the patient. In practice missing values has been a problem mainly due to the necessity to evaluate mathematical equations obtained by genetic programming. The solution to this problem is to use fill in methods to estimate the missing values. This paper analyses three fill in methods: (1) attribute means, (2) conditional means, and (3) random number generation. The methods are evaluated using sensitivity, specificity, and entropy to explain the exchange in knowledge of the results. The results are illustrated based on the breast cancer database. Conditional means produced the best fill in experimental results

Brunel University Research Archive

Probabilistic Inference from Arbitrary Uncertainty using Mixtures of Factorized Generalized Gaussians

Author: Garrido M. C.
Lopez-de-Teruel P. E.
Ruiz A.
Publication venue: 'AI Access Foundation'
Publication date: 18/05/2011
Field of study

This paper presents a general and efficient framework for probabilistic inference and learning from arbitrary uncertain information. It exploits the calculation properties of finite mixture models, conjugate families and factorization. Both the joint probability density of the variables and the likelihood function of the (objective or subjective) observation are approximated by a special mixture model, in such a way that any desired conditional distribution can be directly obtained without numerical integration. We have developed an extended version of the expectation maximization (EM) algorithm to estimate the parameters of mixture models from uncertain training examples (indirect observations). As a consequence, any piece of exact or uncertain information about both input and output values is consistently handled in the inference and learning stages. This ability, extremely useful in certain situations, is not found in most alternative methods. The proposed framework is formally justified from standard probabilistic principles and illustrative examples are provided in the fields of nonparametric pattern classification, nonlinear regression and pattern completion. Finally, experiments on a real application and comparative results over standard databases provide empirical evidence of the utility of the method in a wide range of applications

arXiv.org e-Print Archive

Crossref

Classification of Categorical Uncertain Data Using Decision Tree

Author: Shweta S. Thakur
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 30/06/2015
Field of study

Certain data is a data whose values are known precisely whereas uncertain data means whose value are not known precisely. But data is always uncertain in real life applications. In data uncertainty attribute value is represented by a set of values. There are two types of attributes in data sets namely, numerical and categorical attributes. Data uncertainty can arise in both numerical and categorical attributes. Traditional decision tree algorithms work with certain data only. The classification performance of decision tree can be improved if complete information of data is considered. Probability Density Function (PDF) is used to improve the accuracy of decision tree classifier. Existing system to handle uncertain data works on only numerical attributes means only range of values. They cannot works uncertain categorical attributes. This paper proposes a method for handling data uncertainty on categorical attributes. The decision tree algorithm is extended to handle uncertain data. The experiments show that the classification performance of this decision tree can be enhanced. DOI: 10.17762/ijritcc2321-8169.15066

International Journal on Recent and Innovation Trends in Computing and Communication

Assessing and Remedying Coverage for a Given Dataset

Author: Asudeh Abolfazl
Jagadish H. V.
Jin Zhongjun
Publication venue
Publication date: 23/02/2019
Field of study

Data analysis impacts virtually every aspect of our society today. Often, this analysis is performed on an existing dataset, possibly collected through a process that the data scientists had limited control over. The existing data analyzed may not include the complete universe, but it is expected to cover the diversity of items in the universe. Lack of adequate coverage in the dataset can result in undesirable outcomes such as biased decisions and algorithmic racism, as well as creating vulnerabilities such as opening up room for adversarial attacks. In this paper, we assess the coverage of a given dataset over multiple categorical attributes. We first provide efficient techniques for traversing the combinatorial explosion of value combinations to identify any regions of attribute space not adequately covered by the data. Then, we determine the least amount of additional data that must be obtained to resolve this lack of adequate coverage. We confirm the value of our proposal through both theoretical analyses and comprehensive experiments on real data.Comment: in ICDE 201

arXiv.org e-Print Archive

Cost-Sensitive Decision Trees with Completion Time Requirements

Author: Hung-Pin KAO
Jen TANG
Kwei TANG
Publication venue
Publication date
Field of study

In many classification tasks, managing costs and completion times are the main concerns. In this paper, we assume that the completion time for classifying an instance is determined by its class label, and that a late penalty cost is incurred if the deadline is not met. This time requirement enriches the classification problem but posts a challenge to developing a solution algorithm. We propose an innovative approach for the decision tree induction, which produces multiple candidate trees by allowing more than one splitting attribute at each node. The user can specify the maximum number of candidate trees to control the computational efforts required to produce the final solution. In the tree-induction process, an allocation scheme is used to dynamically distribute the given number of candidate trees to splitting attributes according to their estimated contributions to cost reduction. The algorithm finds the final tree by backtracking. An extensive experiment shows that the algorithm outperforms the top-down heuristic and can effectively obtain the optimal or near-optimal decision trees without an excessive computation time.classification, decision tree, cost and time sensitive learning, late penalty

Research Papers in Economics