33 research outputs found
Machine Learning and Data Analysis in Astroinformatics
Astroinformatics is a new discipline at the cross-road of astronomy, advanced statistics and computer science. With next generation sky surveys, space missions and modern instrumentation astronomy will enter the Petascale regime raising the demand for advanced computer science techniques with hard- and software solutions for data management, analysis, efficient automation and knowledge discovery. This tutorial reviews important developments in astroinformatics over the past years and discusses some relevant research questions and concrete problems. The contribution ends with a short review of the special session papers in these proceedings, as well as perspectives and challenges for the near future
Accelerating inference in cosmology and seismology with generative models
Statistical analyses in many physical sciences require running simulations of the system that is being examined. Such simulations provide complementary information to the theoretical analytic models, and represent an invaluable tool to investigate the dynamics of complex systems. However, running simulations is often computationally expensive, and the high number of required mocks to obtain sufficient statistical precision often makes the problem intractable. In recent years, machine learning has emerged as a possible solution to speed up the generation of scientific simulations. Machine learning generative models usually rely on iteratively feeding some true simulations to the algorithm, until it learns the important common features and is capable of producing accurate simulations in a fraction of the time. In this thesis, advanced machine learning algorithms are explored and applied to the challenge of accelerating physical simulations. Various techniques are applied to problems in cosmology and seismology, showing benefits and limitations of such an approach through a critical analysis. The algorithms are applied to compelling problems in the fields, including surrogate models for the seismic wave equation, the emulation of cosmological summary statistics, and the fast generation of large simulations of the Universe. These problems are formulated within a relevant statistical framework, and tied to real data analysis pipelines. In the conclusions, a critical overview of the results is provided, together with an outlook over possible future expansions of the work presented in the thesis
Classification of Supernovae and Stars in the Era of Big Data and Artificial Intelligence
In recent years, artificial intelligence (AI) has been applied in many fields of research. It is particularly well suited to astronomy, in which very large datasets from sky surveys cover a wide range of observations. The upcoming Legacy Survey of Space and Time (LSST) presents unprecedented big data challenges, requiring state-of-the-art methods to produce, process and analyse information. Observations of Type Ia supernovae help constrain cosmological parameters such as the dark energy equation of state, and AI will be instrumental in the next generation of cosmological measurements due to limited spectroscopic resources. AI also has the ability to improve our astrophysical understanding by perceiving patterns in data which may not be obvious to humans. In this thesis we investigate how advanced AI methods can be used in classification tasks: to identify Type Ia supernovae for cosmology from photometry using supervised learning; by determining a low-dimensional representation of stellar spectra, and inferring astrophysical concepts through unsupervised learning. In preparation for photometric classification of transients from LSST we run tests with different training samples. Using estimates of the depth to which the 4-metre Multi-Object Spectroscopic Telescope (4MOST) Time-Domain Extragalactic Survey (TiDES) can classify transients, we simulate a magnitude-limited training sample reaching rAB = 22.5 mag. We run our simulations with the software snmachine, a photometric classification pipeline using machine learning. The machine-learning algorithms struggle to classify supernovae when the training sample is magnitude-limited as its features are not representative of the test set. In contrast, representative training samples perform very well, particularly when redshift information is included. Classification performance noticeably improves when we combine the magnitude-limited training sample with a simulated realistic sample of faint, high-redshift supernovae observed from larger spectroscopic facilities; the algorithms' range of average area under ROC curve (AUC) scores over 10 runs increases from 0.547-0.628 to 0.946-0.969 and purity of the classified sample reaches 95% in all runs for 2 of the 4 algorithms. By creating new, artificial light curves using the augmentation software avocado, we achieve a purity in our classified sample of 95% in all 10 runs performed for all machine-learning algorithms considered. We also reach a highest average AUC score of 0.986 with the artificial neural network algorithm. Having real faint supernovae to complement our magnitude-limited sample is a crucial requirement in optimisation of a 4MOST spectroscopic sample. However, our results are a proof of concept that augmentation is also necessary to achieve the best classification results. During our investigation into an optimised training sample, we assumed that every training object has the correct class label. Spectroscopy is a reliable method to confirm object classification and is used to define our training sample. However, it is not necessarily perfect and we therefore consider the impact of potential misclassifications of training objects. Taking the predicted error rates in spectroscopic classification from the literature, we apply contamination to a TiDES training sample using simulated LSST data. With the recurrent neural network from the software SuperNNova, we determine appropriate hyperparameters using a perfect, uncontaminated TiDES training sample and then train a model on its contaminated counterpart to study its effects on photometric classification. We find that a contaminated training sample produces very little difference in classification performance, even when increasing contamination to 5%. Contamination causes more objects of both Type Ia and non-Ia to be classified as Ia, increasing efficiency, but decreasing purity, with changes of less than 1% on average. Similarly, we see a decrease of 0.1% in average accuracy, and no clear difference in AUC score, only varying at the fourth significant figure. These results are promising for photometric classification. Contaminated training appears to have little impact and propagation to cosmological measurements is expected to be minimal. In a separate study, we apply deep learning to data in the European Southern Observatory (ESO) archive using an autoencoder neural network with the aim of improving similarity-based searches using the network's own interpretation of the data. We train the network to reconstruct stellar spectra by passing them through an information bottleneck, creating a low-dimensional representation of the data. We find that this representation includes several informative dimensions and, comparing to known astrophysical labels, see clear correlations for two key nodes; the network learns concepts of radial velocity and effective temperature, completely unsupervised. The interpretation of the other informative nodes appears ambiguous, leaving room for future investigation. The results presented in this thesis emphasise the practical capabilities of AI in an astronomical context: Classification of astrophysical objects can be conducted through supervised learning using known labels, as well as unsupervised learning in a physics-agnostic process
Advancing the search for gravitational waves using machine learning
Over 100 years ago Einstein formulated his now famous theory of General Relativity. In his theory he lays out a set of equations which lead to the beginning of a brand-new astronomical field, Gravitational wave (GW) astronomy. The LIGO-Virgo-KAGRA Collaboration (LVK)’s aim is the detection of GW events from some of the most violent and cataclysmic events in the known universe. The LVK detectors are composed of large-scale Michelson Morley interferometers which are able to detect GWs from a range of sources including: binary black holes (BBHs), binary neutron stars (BNSs), neutron star black holes (NSBHs), supernovae and stochastic GWs. Although these GW events release an incredible amount of energy, the amplitudes of the GWs from such events are also incredibly small.
The LVK uses sophisticated techniques such as matched filtering and Bayesian inference in order to both detect and infer source parameters from GW events. Although optimal under many circumstances, these standard methods are computationally expensive to use. Given that the expected number of GW detections by the LVK will be of order 100s in the coming years, there is an urgent need for less computationally expensive detection and parameter inference techniques. A possible solution to reducing the computational expense of such techniques is the exciting field of machine learning (ML).
In the first chapter of this thesis, GWs are introduced and it is explained how GWs are detected by the LVK. The sources of GWs are given, as well as methodologies for detecting various source types, such as matched filtering. In addition to GW signal detection techniques, the methods for estimating the parameters of detected GW signals is described (i.e. Bayesian inference). In the second chapter several machine learning algorithms are introduced including: perceptrons, convolutional neural networks (CNNs), autoencoders (AEs), variational autoencoders (VAEs) and conditional variational autoencoders (CVAEs). Practical advice on training/data augmentation techniques is also provided to the reader. In the third chapter, a survey on several ML techniques applied a variety of GW problems are shown.
In this thesis, various ML and statistical techniques were deployed such as CVAEs and CNNs in two first-of-their-kind proof-of-principle studies. In the fourth chapter it is described how a CNN may be used to match the sensitivity of matched filtering, the standard technique used by the LVK for detecting GWs. It was shown how a CNN may be trained using simulated BBH waveforms buried in Gaussian noise and signals with Gaussian noise alone. Results of the CNN classification predictions were compared to results from matched filtering given the same testing data as the CNN. In the results it was demonstrated through receiver operating characteristics and efficiency curves that the ML approach is able to achieve the same levels of sensitivity as that of matched filtering. It is also shown that the CNN approach is able to generate predictions in low-latency. Given approximately 25000 GW time series, the CNN is able to produce classification predictions for all 25000 in 1s.
In the fifth and sixth chapters, it is shown how CVAEs may be used in order to perform Bayesian inference. A CVAE was trained using simulated BBH waveforms in Gaussian noise, as well as the source parameter values of those waveforms. When testing, the CVAE is only supplied the BBH waveform and is able to produce samples from the Bayesian posterior. Results were compared to that of several standard Bayesian samplers used by the LVK including: Dynesty, ptemcee, emcee, and CPnest. It is shown that when properly trained the CVAE method is able to produce Bayesian posteriors which are consistent with other Bayesian samplers. Results are quantified using a variety of figures of merit such as probability-probability (p-p) plots in order to check the 1-dimensional marginalised posteriors from all approaches are self-consistent with the frequentist perspective. The Jensen—Shannon (JS)-divergence was also employed in order to compute the similarity of different posterior distributions from one another, as well as other figures of merit. It was also demonstrated that the CVAE model was able to produce posteriors with 8000 samples in under a second, representing a 6 order of magnitude increase in performance over traditional sampling methods
Unsupervised machine learning clustering and data exploration of radio-astronomical images
In this thesis, I demonstrate a novel and efficient unsupervised clustering and data exploration method with the combination of a Self-Organising Map (SOM) and a Convolutional Autoencoder, applied to radio-astronomical images from the Radio Galaxy Zoo (RGZ) dataset. The rapidly increasing volume and complexity of radio-astronomical data have ushered in a new era of big-data astronomy which has increased the demand for Machine Learning (ML) solutions. In this era, the sheer amount of image data produced with modern instruments and has resulted in a significant data deluge. Furthermore, the morphologies of objects captured in these radio-astronomical images are highly complex and challenging to classify conclusively due to their intricate and indiscrete nature. Additionally, major radio-astronomical discoveries are unplanned and found in the unexpected, making unsupervised ML highly desirable by operating with few assumptions and without labelled training data. In this thesis, I developed a novel unsupervised ML approach as a practical solution to these astronomy challenges. Using this system, I demonstrated the use of convolutional autoencoders and SOM’s as a dimensionality reduction method to delineate the complexity and volume of astronomical data. My optimised system shows that the coupling of these methods is a powerful method of data exploration and unsupervised clustering of radio-astronomical images. The results of this thesis show this approach is capable of accurately separating features by complexity on a SOM manifold and unified distance matrix with neighbourhood similarity and hierarchical clustering of the mapped astronomical features. This method provides an effective means to explore the high-level topological relationships of image features and morphology in large datasets automatically with minimal processing time and computational resources. I achieved these capabilities with a new and innovative method of SOM training using the autoencoder compressed latent feature vector representations of radio-astronomical data, rather than raw images. Using this system, I successfully investigated SOM affine transformation invariance and analysed the true nature of rotational effects on this manifold using autoencoder random rotation training augmentations. Throughout this thesis, I present my method as a powerful new approach to data exploration technique and contribution to the field. The speed and effectiveness of this method indicates excellent scalability and holds implications for use on large future surveys, large-scale instruments such as the Square Kilometre Array and in other big-data and complexity analysis applications
Deep learning in high angular-resolution radio interferometry
This thesis has addressed several challenges of the big data era in the field of high angular-resolution radio astronomy using machine learning algorithms. The methodologies presented in this thesis were designed with the aim of minimizing the need for human interactions, while still providing robust results. This thesis has an interdisciplinary approach and uses knowledge in computer science to advance our understanding of the radio sky. The main objectives of this thesis can be categorized into four subjects. First, it provides an analysis to the properties of the detected radio sources with Very Long Baseline Array (VLBA). Then we have provided the details of our developed source detection and characterization pipeline that can localize the source in any observed image from the VLBA. Beside source detection, the implemented pipeline can remove the observational noise, restore the structure of the celestial sources and predict their properties, such as size and brightness. In the fourth chapter, we have designed an algorithm that can find rare types of galaxies, called strong gravitationally lensed systems, among the many observed radio emitting objects observed with the International LOFAR Telescope. We also have provided preliminary results on using machine learning algorithms to predict the lensing parameters such as the Einstein radius, axis ratio and position angle
Computational studies of genome evolution and regulation
This thesis takes on the challenge of extracting information from large volumes of biological data produced with newly established experimental techniques. The different types of information present in a particular dataset have been carefully identified to maximise the information gained from the data. This also precludes the attempts to infer the types of information that are not present in the data. In the first part of the thesis I examined the evolutionary origins of de novo taxonomically restricted genes (TRGs) in Drosophila subgenus. De novo TRGs are genes that have originated after the speciation of a particular clade from previously non-coding regions - functional ncRNA, within introns or alternative frames of older protein-coding genes, or from intergenic sequences. TRGs are clade-specific tool-kits that are likely to contain proteins with yet undocumented functions and new protein folds that are yet to be discovered. One of the main challenges in studying de novo TRGs is the trade-off between false positives (non-functional open reading frames) and false negatives (true TRGs that have properties distinct from well established genes). Here I identified two de novo TRG families in Drosophila subgenus that have not been previously reported as de novo originated genes, and to our knowledge they are the best candidates identified so far for experimental studies aimed at elucidating the properties of de novo genes. In the second part of the thesis I examined the information contained in single cell RNA sequencing (scRNA-seq) data and propose a method for extracting biological knowledge from this data using generative neural networks. The main challenge is the noisiness of scRNA-seq data - the number of transcripts sequenced is not proportional to the number of mRNAs present in the cell. I used an autoencoder to reduce the dimensionality of the data without making untestable assumptions about the data. This embedding into lower dimensional space alongside the features learned by an autoencoder contains information about the cell populations, differentiation trajectories and the regulatory relationships between the genes. Unlike most methods currently used, an autoencoder does not assume that these regulatory relationships are the same in all cells in the data set. The main advantages of our approach is that it makes minimal assumptions about the data, it is robust to noise and it is possible to assess its performance. In the final part of the thesis I summarise lessons learnt from analysing various types of biological data and make suggestions for the future direction of similar computational studies
Numerics and Theory of High-Energy Relativistic Astrophysical Transients:(Alternative Format Thesis)
End-to-end anomaly detection in stream data
Nowadays, huge volumes of data are generated with increasing velocity through various systems, applications, and activities. This increases the demand for stream and time series analysis to react to changing conditions in real-time for enhanced efficiency and quality of service delivery as well as upgraded safety and security in private and public sectors. Despite its very rich history, time series anomaly detection is still one of the vital topics in machine learning research and is receiving increasing attention. Identifying hidden patterns and selecting an appropriate model that fits the observed data well and also carries over to unobserved data is not a trivial task. Due to the increasing diversity of data sources and associated stochastic processes, this pivotal data analysis topic is loaded with various challenges like complex latent patterns, concept drift, and overfitting that may mislead the model and cause a high false alarm rate. Handling these challenges leads the advanced anomaly detection methods to develop sophisticated decision logic, which turns them into mysterious and inexplicable black-boxes. Contrary to this trend, end-users expect transparency and verifiability to trust a model and the outcomes it produces. Also, pointing the users to the most anomalous/malicious areas of time series and causal features could save them time, energy, and money. For the mentioned reasons, this thesis is addressing the crucial challenges in an end-to-end pipeline of stream-based anomaly detection through the three essential phases of behavior prediction, inference, and interpretation. The first step is focused on devising a time series model that leads to high average accuracy as well as small error deviation. On this basis, we propose higher-quality anomaly detection and scoring techniques that utilize the related contexts to reclassify the observations and post-pruning the unjustified events. Last but not least, we make the predictive process transparent and verifiable by providing meaningful reasoning behind its generated results based on the understandable concepts by a human. The provided insight can pinpoint the anomalous regions of time series and explain why the current status of a system has been flagged as anomalous. Stream-based anomaly detection research is a principal area of innovation to support our economy, security, and even the safety and health of societies worldwide. We believe our proposed analysis techniques can contribute to building a situational awareness platform and open new perspectives in a variety of domains like cybersecurity, and health