11,802 research outputs found
Decision Tree Classifiers for Star/Galaxy Separation
We study the star/galaxy classification efficiency of 13 different decision
tree algorithms applied to photometric objects in the Sloan Digital Sky Survey
Data Release Seven (SDSS DR7). Each algorithm is defined by a set of parameters
which, when varied, produce different final classification trees. We
extensively explore the parameter space of each algorithm, using the set of
SDSS objects with spectroscopic data as the training set. The
efficiency of star-galaxy separation is measured using the completeness
function. We find that the Functional Tree algorithm (FT) yields the best
results as measured by the mean completeness in two magnitude intervals: () and (). We compare the performance of the
tree generated with the optimal FT configuration to the classifications
provided by the SDSS parametric classifier, 2DPHOT and Ball et al. (2006). We
find that our FT classifier is comparable or better in completeness over the
full magnitude range , with much lower contamination than all but
the Ball et al. classifier. At the faintest magnitudes (), our classifier
is the only one able to maintain high completeness (80%) while still
achieving low contamination (). Finally, we apply our FT classifier
to separate stars from galaxies in the full set of SDSS
photometric objects in the magnitude range .Comment: Submitted to A
Taxonomic evidence applying intelligent information algorithm and the principle of maximum entropy: the case of asteroids families
The Numeric Taxonomy aims to group operational taxonomic units in clusters (OTUs or taxons or taxa), using the denominated structure analysis by means of numeric methods. These clusters that constitute families are the purpose of this series of projects and they emerge of the structural analysis, of their phenotypical characteristic, exhibiting the relationships in terms of grades of similarity of the OTUs, employing tools such as i) the Euclidean distance and ii) nearest neighbor techniques. Thus taxonomic evidence is gathered so as to quantify the similarity for each pair of OTUs (pair-group method) obtained from the basic data matrix and in this way the significant concept of spectrum of the OTUs is introduced, being based the same one on the state of their characters. A new taxonomic criterion is thereby formulated and a new approach to Computational Taxonomy is presented, that has been already employed with reference to Data Mining, when apply of Machine Learning techniques, in particular to the C4.5 algorithms, created by Quinlan, the degree of efficiency achieved by the TDIDT family´s algorithms when are generating valid models of the data in classification problems with the Gain of Entropy through Maximum Entropy Principle.Fil: Perichinsky, Gregorio. Universidad de Buenos Aires. Facultad de Ingeniería; ArgentinaFil: Jiménez Rey, Elizabeth Miriam. Universidad de Buenos Aires. Facultad de Ingeniería; ArgentinaFil: Grossi, María Delia. Universidad de Buenos Aires. Facultad de Ingeniería; ArgentinaFil: Vallejos, Félix Anibal. Universidad de Buenos Aires. Facultad de Ingeniería; Argentina. Universidad Nacional de La Plata. Facultad de Ciencias Astronómicas y Geofísicas; ArgentinaFil: Servetto, Arturo Carlos. Universidad de Buenos Aires. Facultad de Ingeniería; ArgentinaFil: Orellana, Rosa Beatriz. Universidad Nacional de La Plata. Facultad de Ciencias Astronómicas y Geofísicas; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Plastino, Ángel Luis. Universidad Nacional de La Plata. Facultad de Ciencias Exactas. Departamento de Física; Argentin
Recommended from our members
The effect of missing values using genetic programming on evolvable diagnosis
Medical databases usually contain missing values due the policy of
reducing stress and harm to the patient. In practice missing values has been a
problem mainly due to the necessity to evaluate mathematical equations obtained
by genetic programming. The solution to this problem is to use fill in methods to
estimate the missing values. This paper analyses three fill in methods: (1) attribute
means, (2) conditional means, and (3) random number generation. The methods
are evaluated using sensitivity, specificity, and entropy to explain the exchange in
knowledge of the results. The results are illustrated based on the breast cancer
database. Conditional means produced the best fill in experimental results
Probabilistic Inference from Arbitrary Uncertainty using Mixtures of Factorized Generalized Gaussians
This paper presents a general and efficient framework for probabilistic
inference and learning from arbitrary uncertain information. It exploits the
calculation properties of finite mixture models, conjugate families and
factorization. Both the joint probability density of the variables and the
likelihood function of the (objective or subjective) observation are
approximated by a special mixture model, in such a way that any desired
conditional distribution can be directly obtained without numerical
integration. We have developed an extended version of the expectation
maximization (EM) algorithm to estimate the parameters of mixture models from
uncertain training examples (indirect observations). As a consequence, any
piece of exact or uncertain information about both input and output values is
consistently handled in the inference and learning stages. This ability,
extremely useful in certain situations, is not found in most alternative
methods. The proposed framework is formally justified from standard
probabilistic principles and illustrative examples are provided in the fields
of nonparametric pattern classification, nonlinear regression and pattern
completion. Finally, experiments on a real application and comparative results
over standard databases provide empirical evidence of the utility of the method
in a wide range of applications
Classification of Categorical Uncertain Data Using Decision Tree
Certain data is a data whose values are known precisely whereas uncertain data means whose value are not known precisely. But data is always uncertain in real life applications. In data uncertainty attribute value is represented by a set of values. There are two types of attributes in data sets namely, numerical and categorical attributes. Data uncertainty can arise in both numerical and categorical attributes. Traditional decision tree algorithms work with certain data only. The classification performance of decision tree can be improved if complete information of data is considered. Probability Density Function (PDF) is used to improve the accuracy of decision tree classifier. Existing system to handle uncertain data works on only numerical attributes means only range of values. They cannot works uncertain categorical attributes. This paper proposes a method for handling data uncertainty on categorical attributes. The decision tree algorithm is extended to handle uncertain data. The experiments show that the classification performance of this decision tree can be enhanced.
DOI: 10.17762/ijritcc2321-8169.15066
Assessing and Remedying Coverage for a Given Dataset
Data analysis impacts virtually every aspect of our society today. Often,
this analysis is performed on an existing dataset, possibly collected through a
process that the data scientists had limited control over. The existing data
analyzed may not include the complete universe, but it is expected to cover the
diversity of items in the universe. Lack of adequate coverage in the dataset
can result in undesirable outcomes such as biased decisions and algorithmic
racism, as well as creating vulnerabilities such as opening up room for
adversarial attacks.
In this paper, we assess the coverage of a given dataset over multiple
categorical attributes. We first provide efficient techniques for traversing
the combinatorial explosion of value combinations to identify any regions of
attribute space not adequately covered by the data. Then, we determine the
least amount of additional data that must be obtained to resolve this lack of
adequate coverage. We confirm the value of our proposal through both
theoretical analyses and comprehensive experiments on real data.Comment: in ICDE 201
Cost-Sensitive Decision Trees with Completion Time Requirements
In many classification tasks, managing costs and completion times are the main concerns. In this paper, we assume that the completion time for classifying an instance is determined by its class label, and that a late penalty cost is incurred if the deadline is not met. This time requirement enriches the classification problem but posts a challenge to developing a solution algorithm. We propose an innovative approach for the decision tree induction, which produces multiple candidate trees by allowing more than one splitting attribute at each node. The user can specify the maximum number of candidate trees to control the computational efforts required to produce the final solution. In the tree-induction process, an allocation scheme is used to dynamically distribute the given number of candidate trees to splitting attributes according to their estimated contributions to cost reduction. The algorithm finds the final tree by backtracking. An extensive experiment shows that the algorithm outperforms the top-down heuristic and can effectively obtain the optimal or near-optimal decision trees without an excessive computation time.classification, decision tree, cost and time sensitive learning, late penalty
- …