4,659 research outputs found
Data Mining and Machine Learning in Astronomy
We review the current state of data mining and machine learning in astronomy.
'Data Mining' can have a somewhat mixed connotation from the point of view of a
researcher in this field. If used correctly, it can be a powerful approach,
holding the potential to fully exploit the exponentially increasing amount of
available data, promising great scientific advance. However, if misused, it can
be little more than the black-box application of complex computing algorithms
that may give little physical insight, and provide questionable results. Here,
we give an overview of the entire data mining process, from data collection
through to the interpretation of results. We cover common machine learning
algorithms, such as artificial neural networks and support vector machines,
applications from a broad range of astronomy, emphasizing those where data
mining techniques directly resulted in improved science, and important current
and future directions, including probability density functions, parallel
algorithms, petascale computing, and the time domain. We conclude that, so long
as one carefully selects an appropriate algorithm, and is guided by the
astronomical problem at hand, data mining can be very much the powerful tool,
and not the questionable black box.Comment: Published in IJMPD. 61 pages, uses ws-ijmpd.cls. Several extra
figures, some minor additions to the tex
Learning Interpretable Rules for Multi-label Classification
Multi-label classification (MLC) is a supervised learning problem in which,
contrary to standard multiclass classification, an instance can be associated
with several class labels simultaneously. In this chapter, we advocate a
rule-based approach to multi-label classification. Rule learning algorithms are
often employed when one is not only interested in accurate predictions, but
also requires an interpretable theory that can be understood, analyzed, and
qualitatively evaluated by domain experts. Ideally, by revealing patterns and
regularities contained in the data, a rule-based theory yields new insights in
the application domain. Recently, several authors have started to investigate
how rule-based models can be used for modeling multi-label data. Discussing
this task in detail, we highlight some of the problems that make rule learning
considerably more challenging for MLC than for conventional classification.
While mainly focusing on our own previous work, we also provide a short
overview of related work in this area.Comment: Preprint version. To appear in: Explainable and Interpretable Models
in Computer Vision and Machine Learning. The Springer Series on Challenges in
Machine Learning. Springer (2018). See
http://www.ke.tu-darmstadt.de/bibtex/publications/show/3077 for further
informatio
SparsePak: A Formatted Fiber Field-Unit for The WIYN Telescope Bench Spectrograph. II. On-Sky Performance
We present a performance analysis of SparsePak and the WIYN Bench
Spectrograph for precision studies of stellar and ionized gas kinematics of
external galaxies. We focus on spectrograph configurations with echelle and
low-order gratings yielding spectral resolutions of ~10000 between 500-900nm.
These configurations are of general relevance to the spectrograph performance.
Benchmarks include spectral resolution, sampling, vignetting, scattered light,
and an estimate of the system absolute throughput. Comparisons are made to
other, existing, fiber feeds on the WIYN Bench Spectrograph. Vignetting and
relative throughput are found to agree with a geometric model of the optical
system. An aperture-correction protocol for spectrophotometric standard-star
calibrations has been established using independent WIYN imaging data and the
unique capabilities of the SparsePak fiber array. The WIYN
point-spread-function is well-fit by a Moffat profile with a constant power-law
outer slope of index -4.4. We use SparsePak commissioning data to debunk a
long-standing myth concerning sky-subtraction with fibers: By properly treating
the multi-fiber data as a ``long-slit'' it is possible to achieve precision sky
subtraction with a signal-to-noise performance as good or better than
conventional long-slit spectroscopy. No beam-switching is required, and hence
the method is efficient. Finally, we give several examples of science
measurements which SparsePak now makes routine. These include H
velocity fields of low surface-brightness disks, gas and stellar
velocity-fields of nearly face-on disks, and stellar absorption-line profiles
of galaxy disks at spectral resolutions of ~24,000.Comment: To appear in ApJSupp (Feb 2005); 19 pages text; 7 tables; 27 figures
(embedded); high-resolution version at
http://www.astro.wisc.edu/~mab/publications/spkII_pre.pd
The sense of rotation of subhaloes in cosmological dark matter haloes
We present a detailed analysis of the velocity distribution and orientation
of orbits of subhaloes in high resolution cosmological simulations of dark
matter haloes. We find a trend for substructure to preferentially revolve in
the same direction as the sense of rotation of the host halo: there is an
excess of prograde satellite haloes. Throughout our suite of nine host haloes
(eight cluster sized objects and one galactic halo) there are on average 59% of
the satellites corotating with the host. Even when including satellites out to
five virial radii of the host, the signal still remains pointing out the
relation of the signal with the infall pattern of subhaloes. However, the
fraction of prograde satellites weakens to about 53% when observing the data
along a (random) line-of-sight and deriving the distributions in a way an
observer would infer them. This decrease in the observed prograde fraction has
its origin in the technique used by the observer to determine the sense of
rotation, which results in a possible misclassification of non-circular orbits.
We conclude that the existence of satellites on corotating orbits is another
prediction of the cold dark matter structure formation scenario, although there
will be difficulties to verify it observationally. Since the galactic halo
simulation gave the same result as the cluster-sized simulations, we assume
that the fraction of prograde orbits is independent of the scale of the system,
though more galactic simulations would be necessary to confirm this.Comment: 16 pages, 9 figures, accepted by MNRAS; extended comparison with
previous work (mistake corrected) and observations, typos correcte
Structural advances for pattern discovery in multi-relational databases
With ever-growing storage needs and drift towards very large relational storage settings, multi-relational data mining has become a prominent and pertinent field for discovering unique and interesting relational patterns. As a consequence, a whole suite of multi-relational data mining techniques is being developed. These techniques may either be extensions to the already existing single-table mining techniques or may be developed from scratch. For the traditionalists, single-table mining algorithms can be used to work on multi-relational settings by making inelegant and time consuming joins of all target relations. However, complex relational patterns cannot be expressed in a single-table format and thus, cannot be discovered. This work presents a new multi-relational frequent pattern mining algorithm termed Multi-Relational Frequent Pattern Growth (MRFP Growth). MRFP Growth is capable of mining multiple relations, linked with referential integrity, for frequent patterns that satisfy a user specified support threshold. Empirical results on MRFP Growth performance and its comparison with the state-of-the-art multirelational data mining algorithms like WARMR and Decentralized Apriori are discussed at length. MRFP Growth scores over the latter two techniques in number of patterns generated and speed. The realm of multi-relational clustering is also explored in this thesis. A multi-Relational Item Clustering approach based on Hypergraphs (RICH) is proposed. Experimentally RICH combined with MRFP Growth proves to be a competitive approach for clustering multi-relational data. The performance and iii quality of clusters generated by RICH are compared with other clustering algorithms. Finally, the thesis demonstrates the applied utility of the theoretical implications of the above mentioned algorithms in an application framework for auto-annotation of images in an image database. The system is called CoMMA which stands for Combining Multi-relational Multimedia for Associations
Non-Query-Based Pattern Mining and Sentiment Analysis for Massive Microblogging Online Texts
Pattern mining has been widely studied in the last decade given its great interest for research and its numerous applications in the real world. In this paper the definition of query and non-query based systems is proposed, highlighting the needs of non-query based systems in the era of Big Data. For this, we propose a new approach of a non-query based system that combines association rules, generalized rules and sentiment analysis in order to catalogue and discover opinion patterns in the social network Twitter. Association rules have been previously applied for sentiment analysis, but in most cases, they are used once the process of sentiment analysis is finished to see which tokens appear commonly related to a certain sentiment. On the other hand, they have also been used to discover patterns between sentiments. Our work differs from these in that it proposes a non-query based system which combines both techniques, in a mixed proposal of sentiment analysis and association rules to discover patterns and sentiment patterns in microblogging texts. The obtained rules generalize and summarize the sentiments obtained from a group of tweets about any character, brand or product mentioned in them. To study the performance of the proposed system, an initial set of 1.7 million tweets have been employed to analyse the most salient sentiments during the American pre-election campaign. The analysis of the obtained results supports the capability of the system of obtaining association rules and patterns with great descriptive value in this use case. Parallelisms can be established in these patterns that match perfectly with real life events.COPKIT Project, through the European Union's Horizon 2020 Research and Innovation Programme
786687Spanish Ministry for Economy and Competitiveness
TIN2015-64776-C3-1-RAndalusian Government, through Data Analysis in Medicine: from Medical Records to Big Data Project
P18-RT-2947Spanish Ministry of Education, Culture, and Sport
FPU18/00150University of Granad
- …