2,188 research outputs found
A Naive Bayes Source Classifier for X-ray Sources
The Chandra Carina Complex Project (CCCP) provides a sensitive X-ray survey
of a nearby starburst region over >1 square degree in extent. Thousands of
faint X-ray sources are found, many concentrated into rich young stellar
clusters. However, significant contamination from unrelated Galactic and
extragalactic sources is present in the X-ray catalog. We describe the use of a
naive Bayes classifier to assign membership probabilities to individual
sources, based on source location, X-ray properties, and visual/infrared
properties. For the particular membership decision rule adopted, 75% of CCCP
sources are classified as members, 11% are classified as contaminants, and 14%
remain unclassified. The resulting sample of stars likely to be Carina members
is used in several other studies, which appear in a Special Issue of the ApJS
devoted to the CCCP.Comment: Accepted for the ApJS Special Issue on the Chandra Carina Complex
Project (CCCP), scheduled for publication in May 2011. All 16 CCCP Special
Issue papers are available at
http://cochise.astro.psu.edu/Carina_public/special_issue.html through 2011 at
least. 19 pages, 7 figure
Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications
Food authenticity studies are concerned with determining if food samples have been correctly labelled or not. Discriminant analysis methods are an integral part of the methodology for food authentication. Motivated by food authenticity applications, a model-based discriminant analysis method that includes variable selection is presented. The discriminant analysis model is fitted in a semi-supervised manner using both labeled and unlabeled data. The method is shown to give excellent classification
performance on several high-dimensional multiclass food authenticity datasets with more variables than observations. The variables selected by the proposed method provide information about which variables are meaningful for classification purposes. A headlong search strategy for variable selection is shown to be efficient in terms of computation and achieves excellent classification performance. In applications to several food authenticity datasets, our proposed method outperformed default implementations of Random Forests, AdaBoost, transductive SVMs and Bayesian Multinomial Regression by substantial margins
Exploration of Large Digital Sky Surveys
We review some of the scientific opportunities and technical challenges posed
by the exploration of the large digital sky surveys, in the context of a
Virtual Observatory (VO). The VO paradigm will profoundly change the way
observational astronomy is done. Clustering analysis techniques can be used to
discover samples of rare, unusual, or even previously unknown types of
astronomical objects and phenomena. Exploration of the previously poorly probed
portions of the observable parameter space are especially promising. We
illustrate some of the possible types of studies with examples drawn from
DPOSS; much more complex and interesting applications are forthcoming.
Development of the new tools needed for an efficient exploration of these vast
data sets requires a synergy between astronomy and information sciences, with
great potential returns for both fields.Comment: To appear in: Mining the Sky, eds. A. Banday et al., ESO Astrophysics
Symposia, Berlin: Springer Verlag, in press (2001). Latex file, 18 pages, 6
encapsulated postscript figures, style files include
Exploration of Parameter Spaces in a Virtual Observatory
Like every other field of intellectual endeavor, astronomy is being
revolutionised by the advances in information technology. There is an ongoing
exponential growth in the volume, quality, and complexity of astronomical data
sets, mainly through large digital sky surveys and archives. The Virtual
Observatory (VO) concept represents a scientific and technological framework
needed to cope with this data flood. Systematic exploration of the observable
parameter spaces, covered by large digital sky surveys spanning a range of
wavelengths, will be one of the primary modes of research with a VO. This is
where the truly new discoveries will be made, and new insights be gained about
the already known astronomical objects and phenomena. We review some of the
methodological challenges posed by the analysis of large and complex data sets
expected in the VO-based research. The challenges are driven both by the size
and the complexity of the data sets (billions of data vectors in parameter
spaces of tens or hundreds of dimensions), by the heterogeneity of the data and
measurement errors, including differences in basic survey parameters for the
federated data sets (e.g., in the positional accuracy and resolution,
wavelength coverage, time baseline, etc.), various selection effects, as well
as the intrinsic clustering properties (functional form, topology) of the data
distributions in the parameter spaces of observed attributes. Answering these
challenges will require substantial collaborative efforts and partnerships
between astronomers, computer scientists, and statisticians.Comment: Invited review, 10 pages, Latex file with 4 eps figures, style files
included. To appear in Proc. SPIE, v. 4477 (2001
Comparison of different processing approaches by SVM and RF on HS-MS eNose and NIR Spectrometry data for the discrimination of gasoline samples
In the quality control of flammable and combustible liquids, such as gasoline, both rapid analysis and automated data processing are of great importance from an economical viewpoint for the petroleum industry. The present work aims to evaluate the chemometric tools to be applied on the Headspace Mass Spectrometry (HS-MS eNose) and Near-Infrared Spectroscopy (NIRS) results to discriminate gasoline according to their Research Octane Number (RON). For this purpose, data from a total of 50 gasoline samples of two types of RON-95 and 98-analyzed by the two above-mentioned techniques were studied. The HS-MS eNose and NIRS data were com-bined with non-supervised exploratory techniques, such as Hierarchical Cluster Analysis (HCA), as well as other supervised classification techniques, namely Support Vector Machine (SVM) and Random Forest (RF). For su-pervised classification, the low-level data fusion was additionally applied to evaluate if the combined use of the data increases the scope of relevant information. The HCA results showed a clear clustering trend of the gasoline samples according to their RON with HS-MS eNose data. SVM in combination with 5-Fold Cross-Validation successfully classified 100% of the samples with the HS-MS eNose data set. The RF algorithm in combination with 5-Fold Cross-Validation achieved the best accuracy rate for the test set with the low-level data fusion system. Furthermore, it allowed us to identify the most important features that could define the differences between RON 95 and RON 98 gasoline. On the other hand, using the HS-MS eNose and NIRS low-level data fusion reached better results than those obtained using NIRS data individually, with accuracy rates of 100% in both SVM and RF performances with the test set. In general, the performance of the SVM and RF algorithms was found to be similar
Rapid Detection and Quantification of Adulterants in Fruit Juices Using Machine Learning Tools and Spectroscopy Data
Fruit juice production is one of the most important sectors in the beverage industry, and its adulteration by adding cheaper juices is very common. This study presents a methodology based on the combination of machine learning models and near-infrared spectroscopy for the detection and quantification of juice-to-juice adulteration. We evaluated 100% squeezed apple, pineapple, and orange juices, which were adulterated with grape juice at different percentages (5%, 10%, 15%, 20%, 30%, 40%, and 50%). The spectroscopic data have been combined with different machine learning tools to develop predictive models for the control of the juice quality. The use of non-supervised techniques, specifically model-based clustering, revealed a grouping trend of the samples depending on the type of juice. The use of supervised techniques such as random forest and linear discriminant analysis models has allowed for the detection of the adulterated samples with an accuracy of 98% in the test set. In addition, a Boruta algorithm was applied which selected 89 variables as significant for adulterant quantification, and support vector regression achieved a regression coefficient of 0.989 and a root mean squared error of 1.683 in the test set. These results show the suitability of the machine learning tools combined with spectroscopic data as a screening method for the quality control of fruit juices. In addition, a prototype application has been developed to share the models with other users and facilitate the detection and quantification of adulteration in juices
- …