4,659 research outputs found

    Data Mining and Machine Learning in Astronomy

    Full text link
    We review the current state of data mining and machine learning in astronomy. 'Data Mining' can have a somewhat mixed connotation from the point of view of a researcher in this field. If used correctly, it can be a powerful approach, holding the potential to fully exploit the exponentially increasing amount of available data, promising great scientific advance. However, if misused, it can be little more than the black-box application of complex computing algorithms that may give little physical insight, and provide questionable results. Here, we give an overview of the entire data mining process, from data collection through to the interpretation of results. We cover common machine learning algorithms, such as artificial neural networks and support vector machines, applications from a broad range of astronomy, emphasizing those where data mining techniques directly resulted in improved science, and important current and future directions, including probability density functions, parallel algorithms, petascale computing, and the time domain. We conclude that, so long as one carefully selects an appropriate algorithm, and is guided by the astronomical problem at hand, data mining can be very much the powerful tool, and not the questionable black box.Comment: Published in IJMPD. 61 pages, uses ws-ijmpd.cls. Several extra figures, some minor additions to the tex

    Learning Interpretable Rules for Multi-label Classification

    Full text link
    Multi-label classification (MLC) is a supervised learning problem in which, contrary to standard multiclass classification, an instance can be associated with several class labels simultaneously. In this chapter, we advocate a rule-based approach to multi-label classification. Rule learning algorithms are often employed when one is not only interested in accurate predictions, but also requires an interpretable theory that can be understood, analyzed, and qualitatively evaluated by domain experts. Ideally, by revealing patterns and regularities contained in the data, a rule-based theory yields new insights in the application domain. Recently, several authors have started to investigate how rule-based models can be used for modeling multi-label data. Discussing this task in detail, we highlight some of the problems that make rule learning considerably more challenging for MLC than for conventional classification. While mainly focusing on our own previous work, we also provide a short overview of related work in this area.Comment: Preprint version. To appear in: Explainable and Interpretable Models in Computer Vision and Machine Learning. The Springer Series on Challenges in Machine Learning. Springer (2018). See http://www.ke.tu-darmstadt.de/bibtex/publications/show/3077 for further informatio

    SparsePak: A Formatted Fiber Field-Unit for The WIYN Telescope Bench Spectrograph. II. On-Sky Performance

    Full text link
    We present a performance analysis of SparsePak and the WIYN Bench Spectrograph for precision studies of stellar and ionized gas kinematics of external galaxies. We focus on spectrograph configurations with echelle and low-order gratings yielding spectral resolutions of ~10000 between 500-900nm. These configurations are of general relevance to the spectrograph performance. Benchmarks include spectral resolution, sampling, vignetting, scattered light, and an estimate of the system absolute throughput. Comparisons are made to other, existing, fiber feeds on the WIYN Bench Spectrograph. Vignetting and relative throughput are found to agree with a geometric model of the optical system. An aperture-correction protocol for spectrophotometric standard-star calibrations has been established using independent WIYN imaging data and the unique capabilities of the SparsePak fiber array. The WIYN point-spread-function is well-fit by a Moffat profile with a constant power-law outer slope of index -4.4. We use SparsePak commissioning data to debunk a long-standing myth concerning sky-subtraction with fibers: By properly treating the multi-fiber data as a ``long-slit'' it is possible to achieve precision sky subtraction with a signal-to-noise performance as good or better than conventional long-slit spectroscopy. No beam-switching is required, and hence the method is efficient. Finally, we give several examples of science measurements which SparsePak now makes routine. These include Hα\alpha velocity fields of low surface-brightness disks, gas and stellar velocity-fields of nearly face-on disks, and stellar absorption-line profiles of galaxy disks at spectral resolutions of ~24,000.Comment: To appear in ApJSupp (Feb 2005); 19 pages text; 7 tables; 27 figures (embedded); high-resolution version at http://www.astro.wisc.edu/~mab/publications/spkII_pre.pd

    The sense of rotation of subhaloes in cosmological dark matter haloes

    Full text link
    We present a detailed analysis of the velocity distribution and orientation of orbits of subhaloes in high resolution cosmological simulations of dark matter haloes. We find a trend for substructure to preferentially revolve in the same direction as the sense of rotation of the host halo: there is an excess of prograde satellite haloes. Throughout our suite of nine host haloes (eight cluster sized objects and one galactic halo) there are on average 59% of the satellites corotating with the host. Even when including satellites out to five virial radii of the host, the signal still remains pointing out the relation of the signal with the infall pattern of subhaloes. However, the fraction of prograde satellites weakens to about 53% when observing the data along a (random) line-of-sight and deriving the distributions in a way an observer would infer them. This decrease in the observed prograde fraction has its origin in the technique used by the observer to determine the sense of rotation, which results in a possible misclassification of non-circular orbits. We conclude that the existence of satellites on corotating orbits is another prediction of the cold dark matter structure formation scenario, although there will be difficulties to verify it observationally. Since the galactic halo simulation gave the same result as the cluster-sized simulations, we assume that the fraction of prograde orbits is independent of the scale of the system, though more galactic simulations would be necessary to confirm this.Comment: 16 pages, 9 figures, accepted by MNRAS; extended comparison with previous work (mistake corrected) and observations, typos correcte

    Structural advances for pattern discovery in multi-relational databases

    Get PDF
    With ever-growing storage needs and drift towards very large relational storage settings, multi-relational data mining has become a prominent and pertinent field for discovering unique and interesting relational patterns. As a consequence, a whole suite of multi-relational data mining techniques is being developed. These techniques may either be extensions to the already existing single-table mining techniques or may be developed from scratch. For the traditionalists, single-table mining algorithms can be used to work on multi-relational settings by making inelegant and time consuming joins of all target relations. However, complex relational patterns cannot be expressed in a single-table format and thus, cannot be discovered. This work presents a new multi-relational frequent pattern mining algorithm termed Multi-Relational Frequent Pattern Growth (MRFP Growth). MRFP Growth is capable of mining multiple relations, linked with referential integrity, for frequent patterns that satisfy a user specified support threshold. Empirical results on MRFP Growth performance and its comparison with the state-of-the-art multirelational data mining algorithms like WARMR and Decentralized Apriori are discussed at length. MRFP Growth scores over the latter two techniques in number of patterns generated and speed. The realm of multi-relational clustering is also explored in this thesis. A multi-Relational Item Clustering approach based on Hypergraphs (RICH) is proposed. Experimentally RICH combined with MRFP Growth proves to be a competitive approach for clustering multi-relational data. The performance and iii quality of clusters generated by RICH are compared with other clustering algorithms. Finally, the thesis demonstrates the applied utility of the theoretical implications of the above mentioned algorithms in an application framework for auto-annotation of images in an image database. The system is called CoMMA which stands for Combining Multi-relational Multimedia for Associations

    Non-Query-Based Pattern Mining and Sentiment Analysis for Massive Microblogging Online Texts

    Get PDF
    Pattern mining has been widely studied in the last decade given its great interest for research and its numerous applications in the real world. In this paper the definition of query and non-query based systems is proposed, highlighting the needs of non-query based systems in the era of Big Data. For this, we propose a new approach of a non-query based system that combines association rules, generalized rules and sentiment analysis in order to catalogue and discover opinion patterns in the social network Twitter. Association rules have been previously applied for sentiment analysis, but in most cases, they are used once the process of sentiment analysis is finished to see which tokens appear commonly related to a certain sentiment. On the other hand, they have also been used to discover patterns between sentiments. Our work differs from these in that it proposes a non-query based system which combines both techniques, in a mixed proposal of sentiment analysis and association rules to discover patterns and sentiment patterns in microblogging texts. The obtained rules generalize and summarize the sentiments obtained from a group of tweets about any character, brand or product mentioned in them. To study the performance of the proposed system, an initial set of 1.7 million tweets have been employed to analyse the most salient sentiments during the American pre-election campaign. The analysis of the obtained results supports the capability of the system of obtaining association rules and patterns with great descriptive value in this use case. Parallelisms can be established in these patterns that match perfectly with real life events.COPKIT Project, through the European Union's Horizon 2020 Research and Innovation Programme 786687Spanish Ministry for Economy and Competitiveness TIN2015-64776-C3-1-RAndalusian Government, through Data Analysis in Medicine: from Medical Records to Big Data Project P18-RT-2947Spanish Ministry of Education, Culture, and Sport FPU18/00150University of Granad
    corecore