31 research outputs found

    Data Mining and Machine Learning in Astronomy

    Full text link
    We review the current state of data mining and machine learning in astronomy. 'Data Mining' can have a somewhat mixed connotation from the point of view of a researcher in this field. If used correctly, it can be a powerful approach, holding the potential to fully exploit the exponentially increasing amount of available data, promising great scientific advance. However, if misused, it can be little more than the black-box application of complex computing algorithms that may give little physical insight, and provide questionable results. Here, we give an overview of the entire data mining process, from data collection through to the interpretation of results. We cover common machine learning algorithms, such as artificial neural networks and support vector machines, applications from a broad range of astronomy, emphasizing those where data mining techniques directly resulted in improved science, and important current and future directions, including probability density functions, parallel algorithms, petascale computing, and the time domain. We conclude that, so long as one carefully selects an appropriate algorithm, and is guided by the astronomical problem at hand, data mining can be very much the powerful tool, and not the questionable black box.Comment: Published in IJMPD. 61 pages, uses ws-ijmpd.cls. Several extra figures, some minor additions to the tex

    Data challenges of time domain astronomy

    Full text link
    Astronomy has been at the forefront of the development of the techniques and methodologies of data intensive science for over a decade with large sky surveys and distributed efforts such as the Virtual Observatory. However, it faces a new data deluge with the next generation of synoptic sky surveys which are opening up the time domain for discovery and exploration. This brings both new scientific opportunities and fresh challenges, in terms of data rates from robotic telescopes and exponential complexity in linked data, but also for data mining algorithms used in classification and decision making. In this paper, we describe how an informatics-based approach-part of the so-called "fourth paradigm" of scientific discovery-is emerging to deal with these. We review our experiences with the Palomar-Quest and Catalina Real-Time Transient Sky Surveys; in particular, addressing the issue of the heterogeneity of data associated with transient astronomical events (and other sensor networks) and how to manage and analyze it.Comment: 15 pages, 3 figures, to appear in special issue of Distributed and Parallel Databases on Data Intensive eScienc

    Fitting the integrated Spectral Energy Distributions of Galaxies

    Full text link
    Fitting the spectral energy distributions (SEDs) of galaxies is an almost universally used technique that has matured significantly in the last decade. Model predictions and fitting procedures have improved significantly over this time, attempting to keep up with the vastly increased volume and quality of available data. We review here the field of SED fitting, describing the modelling of ultraviolet to infrared galaxy SEDs, the creation of multiwavelength data sets, and the methods used to fit model SEDs to observed galaxy data sets. We touch upon the achievements and challenges in the major ingredients of SED fitting, with a special emphasis on describing the interplay between the quality of the available data, the quality of the available models, and the best fitting technique to use in order to obtain a realistic measurement as well as realistic uncertainties. We conclude that SED fitting can be used effectively to derive a range of physical properties of galaxies, such as redshift, stellar masses, star formation rates, dust masses, and metallicities, with care taken not to over-interpret the available data. Yet there still exist many issues such as estimating the age of the oldest stars in a galaxy, finer details ofdust properties and dust-star geometry, and the influences of poorly understood, luminous stellar types and phases. The challenge for the coming years will be to improve both the models and the observational data sets to resolve these uncertainties. The present review will be made available on an interactive, moderated web page (sedfitting.org), where the community can access and change the text. The intention is to expand the text and keep it up to date over the coming years.Comment: 54 pages, 26 figures, Accepted for publication in Astrophysics & Space Scienc

    Report from the TeraGrid Evaluation Study, Part 1: Project Findings

    Full text link
    TeraGrid integrates multiple high-performance computing resources at distributed provider facilities. In 2006, the National Science Foundation (NSF) awarded a grant to the University of Michigan's School of Information (UM-SI) to conduct an external evaluation of TeraGrid. The primary goals of the evaluation were to provide specific information to TeraGrid managers that will increase the likelihood of TeraGrid success, and to give NSF and policy makers general data that will assist them in making strategic decisions about future directions for cyberinfrastructure. In order to accomplish these objectives, the UM-SI study assessed four aspects of the TeraGrid project: 1) progress in meeting user requirements; 2) impact of TeraGrid on research outcomes; 3) quality and content of TeraGrid education, outreach, and training activities; and 4) satisfaction among TeraGrid partners. We employed a mixed method approach that consisted of a user workshop; participant observation; document analysis; interviews with 86 individuals representing five different categories; a survey of a sample of 595 TeraGrid users; and two surveys to assess TeraGrid tutorials held in 2006 and 2007. Most of the data were collected from June 2006 through May 2007. Findings from the evaluation study are presented in two parts. In this first part, we report results from analyses of all data collected during the investigation. Detailed findings from the user survey are presented in Part 2 of the report.National Science Foundationhttps://deepblue.lib.umich.edu/bitstream/2027.42/61838/2/TeraGrid_Evaluation_Report_Project_Findings_August_2008.pdfDescription of TeraGrid_Evaluation_Report_Project_Findings_August_2008.pdf : Final repor

    Visuelle Analyse großer Partikeldaten

    Get PDF
    Partikelsimulationen sind eine bewährte und weit verbreitete numerische Methode in der Forschung und Technik. Beispielsweise werden Partikelsimulationen zur Erforschung der Kraftstoffzerstäubung in Flugzeugturbinen eingesetzt. Auch die Entstehung des Universums wird durch die Simulation von dunkler Materiepartikeln untersucht. Die hierbei produzierten Datenmengen sind immens. So enthalten aktuelle Simulationen Billionen von Partikeln, die sich über die Zeit bewegen und miteinander interagieren. Die Visualisierung bietet ein großes Potenzial zur Exploration, Validation und Analyse wissenschaftlicher Datensätze sowie der zugrundeliegenden Modelle. Allerdings liegt der Fokus meist auf strukturierten Daten mit einer regulären Topologie. Im Gegensatz hierzu bewegen sich Partikel frei durch Raum und Zeit. Diese Betrachtungsweise ist aus der Physik als das lagrange Bezugssystem bekannt. Zwar können Partikel aus dem lagrangen in ein reguläres eulersches Bezugssystem, wie beispielsweise in ein uniformes Gitter, konvertiert werden. Dies ist bei einer großen Menge an Partikeln jedoch mit einem erheblichen Aufwand verbunden. Darüber hinaus führt diese Konversion meist zu einem Verlust der Präzision bei gleichzeitig erhöhtem Speicherverbrauch. Im Rahmen dieser Dissertation werde ich neue Visualisierungstechniken erforschen, welche speziell auf der lagrangen Sichtweise basieren. Diese ermöglichen eine effiziente und effektive visuelle Analyse großer Partikeldaten

    Galaxy evolution, cosmology and HPC : clustering studies applied to astronomy

    Get PDF
    Tools to measure clustering are essential for analysis of Astronomical datasets and can potentially be used in other fields for data mining. The Two-point Correlation Function (TPCF), in particular, is used to characterize the distribution of matter and objects such as galaxies in the Universe. However, it's computational time will be restrictively slow given the significant increase in the size of datasets expected from surveys in the future. Thus, new computational techniques are necessary in order to measure clustering efficiently. The objective of this research was to investigate methods to accelerate the computation of the TPCF and to use the TPCF to probe an interesting scientific question dealing with the masses of galaxy clusters measured using data from the Planck satellite. An investigation was conducted to explore different techniques and architectures that can be used to accelerate the computation of the TPCF. The code CUTE, was selected in particular to test shared-memory systems using OpenMP and GPU acceleration using CUDA. Modification were then made to the code, to improve the nearest neighbour boxing technique. The results show that the modified code offers a significant improved performance. Additionally, a particularly effective implementation was used to measure the clustering of galaxy clusters detected by the Planck satellite: our results indicated that the clusters were more massive than had been inferred in previous work, providing an explanation for apparent inconsistencies in the Planck data

    High performance computing in the cloud

    Get PDF
    In recent years, the interest in both scientific and business workflows has increased. A workflow is composed of a series of tools, which should be executed in a predefined order to perform an analysis. Traditionally, these workflows were executed in a manual way, sending the output of one tool to the next one in the analysis process. Many applications to execute workflows automatically, appeared recently. These applications ease the work of the users while executing their analyses. In addition, from the computational point of view, some workflows require a significant amount of resources. Consequently, workflow execution moved from single workstations to distributed environments such as Grids or Clouds. Data management and tasks scheduling are required to execute workflows in an efficient way in such environments. In this thesis, we propose a cloud-based HPC environment, focusing on tasks scheduling, resources auto-scaling, data management and simplifying the access to the resources with software clients. First, the cloud computing infrastructure is devised, which includes the base software (i.e. OpenStack) plus several additional modules aimed at improving authentication (i.e. LDAP) and data management (i.e. GridFTP, Globus Online and CloudFuse). Second, built on top of the mentioned infrastructure, the TORQUE distributed resources manager and the Maui scheduler have been configured to schedule and distribute tasks to the cloud-based workers. To reduce the number of idle nodes and the incurred cost of the active cloud resources, we also propose a configurable auto-scaling technique, which is able to scale the execution cluster depending on the workload. Additionally, in order to simplify tasks submission to the TORQUE execution cluster, we have interconnected the Galaxy workflows management system with it, therefore users benefit from a simple way to execute their tasks. Finally, we conducted an experimental evaluation, composed by a number of different studies with synthetic and real-world applications, to show the behaviour of the auto-scaled execution cluster managed by TORQUE and Maui. All experiments have been performed by using an OpenStack cloud computing environment and the benchmarked applications correspond to the benchmarking suite, which is specially designed for workflows scheduling in the cloud computing environment. Cybershake, Ligo and Montage have been the selected synthetic applications from the benchmarking suite. GECKO and a GWAS pipeline represent the real-world test use cases, both having a diverse and heterogeneous set of tasks.The numerous technological advances in data acquisition techniques allow the massive production of enormous amounts of data in diverse fields such as astronomy, health and social networks. Nowadays, only a small part of this data can be analysed because of the lack of computational resources. High Performance Computing (HPC) strategies represent the single choice to analyse such overwhelming amount of data. However, in general, HPC techniques require the use of big and expensive computing and storage infrastructures, usually not affordable or available for most users. Cloud computing, where users pay for the resources they need and when they actually need them, appears as an interesting alternative. Besides the savings in hardware infrastructure, cloud computing offers further advantages such as the removal of installation, administration and supplying requirements. In addition, it enables users to use better hardware than the one they can usually afford, scale the resources depending on their needs, and a greater fault-tolerance, amongst others. The efficient utilisation of HPC resources becomes a fundamental task, particularly in cloud computing. We need to consider the cost of using HPC resources, specially in the case of cloud-based infrastructures, where users have to pay for storing, transferring and analysing data. Therefore, it is really important the usage of generic tasks scheduling and auto-scaling techniques to efficiently exploit the computational resources. It is equally important to make these tasks user-friendly through the development of tools/applications (software clients), which act as interface between the user and the infrastructure

    LDRD Annual Report FY2006

    Full text link
    corecore