2,188 research outputs found

    A Naive Bayes Source Classifier for X-ray Sources

    Full text link
    The Chandra Carina Complex Project (CCCP) provides a sensitive X-ray survey of a nearby starburst region over >1 square degree in extent. Thousands of faint X-ray sources are found, many concentrated into rich young stellar clusters. However, significant contamination from unrelated Galactic and extragalactic sources is present in the X-ray catalog. We describe the use of a naive Bayes classifier to assign membership probabilities to individual sources, based on source location, X-ray properties, and visual/infrared properties. For the particular membership decision rule adopted, 75% of CCCP sources are classified as members, 11% are classified as contaminants, and 14% remain unclassified. The resulting sample of stars likely to be Carina members is used in several other studies, which appear in a Special Issue of the ApJS devoted to the CCCP.Comment: Accepted for the ApJS Special Issue on the Chandra Carina Complex Project (CCCP), scheduled for publication in May 2011. All 16 CCCP Special Issue papers are available at http://cochise.astro.psu.edu/Carina_public/special_issue.html through 2011 at least. 19 pages, 7 figure

    Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications

    Get PDF
    Food authenticity studies are concerned with determining if food samples have been correctly labelled or not. Discriminant analysis methods are an integral part of the methodology for food authentication. Motivated by food authenticity applications, a model-based discriminant analysis method that includes variable selection is presented. The discriminant analysis model is fitted in a semi-supervised manner using both labeled and unlabeled data. The method is shown to give excellent classification performance on several high-dimensional multiclass food authenticity datasets with more variables than observations. The variables selected by the proposed method provide information about which variables are meaningful for classification purposes. A headlong search strategy for variable selection is shown to be efficient in terms of computation and achieves excellent classification performance. In applications to several food authenticity datasets, our proposed method outperformed default implementations of Random Forests, AdaBoost, transductive SVMs and Bayesian Multinomial Regression by substantial margins

    Exploration of Large Digital Sky Surveys

    Get PDF
    We review some of the scientific opportunities and technical challenges posed by the exploration of the large digital sky surveys, in the context of a Virtual Observatory (VO). The VO paradigm will profoundly change the way observational astronomy is done. Clustering analysis techniques can be used to discover samples of rare, unusual, or even previously unknown types of astronomical objects and phenomena. Exploration of the previously poorly probed portions of the observable parameter space are especially promising. We illustrate some of the possible types of studies with examples drawn from DPOSS; much more complex and interesting applications are forthcoming. Development of the new tools needed for an efficient exploration of these vast data sets requires a synergy between astronomy and information sciences, with great potential returns for both fields.Comment: To appear in: Mining the Sky, eds. A. Banday et al., ESO Astrophysics Symposia, Berlin: Springer Verlag, in press (2001). Latex file, 18 pages, 6 encapsulated postscript figures, style files include

    Exploration of Parameter Spaces in a Virtual Observatory

    Get PDF
    Like every other field of intellectual endeavor, astronomy is being revolutionised by the advances in information technology. There is an ongoing exponential growth in the volume, quality, and complexity of astronomical data sets, mainly through large digital sky surveys and archives. The Virtual Observatory (VO) concept represents a scientific and technological framework needed to cope with this data flood. Systematic exploration of the observable parameter spaces, covered by large digital sky surveys spanning a range of wavelengths, will be one of the primary modes of research with a VO. This is where the truly new discoveries will be made, and new insights be gained about the already known astronomical objects and phenomena. We review some of the methodological challenges posed by the analysis of large and complex data sets expected in the VO-based research. The challenges are driven both by the size and the complexity of the data sets (billions of data vectors in parameter spaces of tens or hundreds of dimensions), by the heterogeneity of the data and measurement errors, including differences in basic survey parameters for the federated data sets (e.g., in the positional accuracy and resolution, wavelength coverage, time baseline, etc.), various selection effects, as well as the intrinsic clustering properties (functional form, topology) of the data distributions in the parameter spaces of observed attributes. Answering these challenges will require substantial collaborative efforts and partnerships between astronomers, computer scientists, and statisticians.Comment: Invited review, 10 pages, Latex file with 4 eps figures, style files included. To appear in Proc. SPIE, v. 4477 (2001

    Comparison of different processing approaches by SVM and RF on HS-MS eNose and NIR Spectrometry data for the discrimination of gasoline samples

    Get PDF
    In the quality control of flammable and combustible liquids, such as gasoline, both rapid analysis and automated data processing are of great importance from an economical viewpoint for the petroleum industry. The present work aims to evaluate the chemometric tools to be applied on the Headspace Mass Spectrometry (HS-MS eNose) and Near-Infrared Spectroscopy (NIRS) results to discriminate gasoline according to their Research Octane Number (RON). For this purpose, data from a total of 50 gasoline samples of two types of RON-95 and 98-analyzed by the two above-mentioned techniques were studied. The HS-MS eNose and NIRS data were com-bined with non-supervised exploratory techniques, such as Hierarchical Cluster Analysis (HCA), as well as other supervised classification techniques, namely Support Vector Machine (SVM) and Random Forest (RF). For su-pervised classification, the low-level data fusion was additionally applied to evaluate if the combined use of the data increases the scope of relevant information. The HCA results showed a clear clustering trend of the gasoline samples according to their RON with HS-MS eNose data. SVM in combination with 5-Fold Cross-Validation successfully classified 100% of the samples with the HS-MS eNose data set. The RF algorithm in combination with 5-Fold Cross-Validation achieved the best accuracy rate for the test set with the low-level data fusion system. Furthermore, it allowed us to identify the most important features that could define the differences between RON 95 and RON 98 gasoline. On the other hand, using the HS-MS eNose and NIRS low-level data fusion reached better results than those obtained using NIRS data individually, with accuracy rates of 100% in both SVM and RF performances with the test set. In general, the performance of the SVM and RF algorithms was found to be similar

    Rapid Detection and Quantification of Adulterants in Fruit Juices Using Machine Learning Tools and Spectroscopy Data

    Get PDF
    Fruit juice production is one of the most important sectors in the beverage industry, and its adulteration by adding cheaper juices is very common. This study presents a methodology based on the combination of machine learning models and near-infrared spectroscopy for the detection and quantification of juice-to-juice adulteration. We evaluated 100% squeezed apple, pineapple, and orange juices, which were adulterated with grape juice at different percentages (5%, 10%, 15%, 20%, 30%, 40%, and 50%). The spectroscopic data have been combined with different machine learning tools to develop predictive models for the control of the juice quality. The use of non-supervised techniques, specifically model-based clustering, revealed a grouping trend of the samples depending on the type of juice. The use of supervised techniques such as random forest and linear discriminant analysis models has allowed for the detection of the adulterated samples with an accuracy of 98% in the test set. In addition, a Boruta algorithm was applied which selected 89 variables as significant for adulterant quantification, and support vector regression achieved a regression coefficient of 0.989 and a root mean squared error of 1.683 in the test set. These results show the suitability of the machine learning tools combined with spectroscopic data as a screening method for the quality control of fruit juices. In addition, a prototype application has been developed to share the models with other users and facilitate the detection and quantification of adulteration in juices
    corecore