286 research outputs found

    Contributions to probabilistic non-negative matrix factorization - Maximum marginal likelihood estimation and Markovian temporal models

    Get PDF
    Non-negative matrix factorization (NMF) has become a popular dimensionality reductiontechnique, and has found applications in many different fields, such as audio signal processing,hyperspectral imaging, or recommender systems. In its simplest form, NMF aims at finding anapproximation of a non-negative data matrix (i.e., with non-negative entries) as the product of twonon-negative matrices, called the factors. One of these two matrices can be interpreted as adictionary of characteristic patterns of the data, and the other one as activation coefficients ofthese patterns. This low-rank approximation is traditionally retrieved by optimizing a measure of fitbetween the data matrix and its approximation. As it turns out, for many choices of measures of fit,the problem can be shown to be equivalent to the joint maximum likelihood estimation of thefactors under a certain statistical model describing the data. This leads us to an alternativeparadigm for NMF, where the learning task revolves around probabilistic models whoseobservation density is parametrized by the product of non-negative factors. This general framework, coined probabilistic NMF, encompasses many well-known latent variable models ofthe literature, such as models for count data. In this thesis, we consider specific probabilistic NMFmodels in which a prior distribution is assumed on the activation coefficients, but the dictionary remains a deterministic variable. The objective is then to maximize the marginal likelihood in thesesemi-Bayesian NMF models, i.e., the integrated joint likelihood over the activation coefficients.This amounts to learning the dictionary only; the activation coefficients may be inferred in asecond step if necessary. We proceed to study in greater depth the properties of this estimation process. In particular, two scenarios are considered. In the first one, we assume the independence of the activation coefficients sample-wise. Previous experimental work showed that dictionarieslearned with this approach exhibited a tendency to automatically regularize the number of components, a favorable property which was left unexplained. In the second one, we lift thisstandard assumption, and consider instead Markov structures to add statistical correlation to themodel, in order to better analyze temporal data

    soMLier: A South African Wine Recommender System

    Get PDF
    Though several commercial wine recommender systems exist, they are largely tailored to consumers outside of South Africa (SA). Consequently, these systems are of limited use to novice wine consumers in SA. To address this, the aim of this research is to develop a system for South African consumers that yields high-quality wine recommendations, maximises the accuracy of predicted ratings for those recommendations and provides insights into why those suggestions were made. To achieve this, a hybrid system “soMLier” (pronounced “sommelier”) is built in this thesis that makes use of two datasets. Firstly, a database containing several attributes of South African wines such as the chemical composition, style, aroma, price and description was supplied by wine.co.za (a SA wine retailer). Secondly, for each wine in that database, the numeric 5-star ratings and textual reviews made by users worldwide were further scraped from Vivino.com to serve as a dataset of user preferences. Together, these are used to develop and compare several systems, the most optimal of which are combined in the final system. Item-based collaborative filtering methods are investigated first along with model-based techniques (such as matrix factorisation and neural networks) when applied to the user rating dataset to generate wine recommendations through the ranking of rating predictions. Respectively, these methods are determined to excel at generating lists of relevant wine recommendations and producing accurate corresponding predicted ratings. Next, the wine attribute data is used to explore the efficacy of content-based systems. Numeric features (such as price) are compared along with categorical features (such as style) using various distance measures and the relationships between the textual descriptions of the wines are determined using natural language processing methods. These methods are found to be most appropriate for explaining wine recommendations. Hence, the final hybrid system makes use of collaborative filtering to generate recommendations, matrix factorisation to predict user ratings, and content-based techniques to rationalise the wine suggestions made. This thesis contributes the “soMLier” system that is of specific use to SA wine consumers as it bridges the gap between the technologies used by highly-developed existing systems and the SA wine market. Though this final system would benefit from more explicit user data to establish a richer model of user preferences, it can ultimately assist consumers in exploring unfamiliar wines, discovering wines they will likely enjoy, and understanding their preferences of SA wine

    Gaussian Processes on Hypergraphs

    Get PDF
    We derive a Matern Gaussian process (GP) on the vertices of a hypergraph. This enables estimation of regression models of observed or latent values associated with the vertices, in which the correlation and uncertainty estimates are informed by the hypergraph structure. We further present a framework for embedding the vertices of a hypergraph into a latent space using the hypergraph GP. Finally, we provide a scheme for identifying a small number of representative inducing vertices that enables scalable inference through sparse GPs. We demonstrate the utility of our framework on three challenging real-world problems that concern multi-class classification for the political party affiliation of legislators on the basis of voting behaviour, probabilistic matrix factorisation of movie reviews, and embedding a hypergraph of animals into a low-dimensional latent space

    Automatic Detection of Volcanic Unrest Using Interferometric Synthetic Aperture Radar

    Get PDF
    A diverse set of hazards are posed by the world's 1500 subaerial volcanoes, yet the majority of them remain unmonitored. Measurements of deformation provide a way to monitor volcanoes, and synthetic aperture RaDAR (SAR) provides a powerful tool to measure deformation at the majority of the world's subaerial volcanoes. This is due to recent changes in how regularly SAR data are acquired, how they are distributed to the scientific community, and how quickly they can be processed to create time series of interferograms. However, for interferometric SAR (InSAR) to be used to monitor the world's volcanoes, an algorithm is required to automatically detect signs of deformation-generating volcanic unrest in a time series of interferograms, as the volume of new interferograms produced each week precludes this task being achieved by human interpreters. In this thesis, I introduce two complementary methods that can be used to detect signs of volcanic unrest. The first method centres on the use of blind signal separation (BSS) methods to isolate signals of geophysical interest from nuisance signals, such as those due to changes in the refractive index of the atmosphere between two SAR acquisitions. This is achieved through first comparing which of non-negative matrix factorisation (NMF), principal component analysis (PCA), and independent component analysis (ICA) are best suited for solving BSS problems involving time series of InSAR data, and how InSAR data should best be arranged for its use with these methods. I find that NMF can be used with InSAR data, providing the time series is formatted in a novel way that reduces the likelihood of any pixels having negative values. However, when NMF, PCA, and ICA are applied to a set of synthetic data, I find that the most accurate recovery of signals of interest is achieved when ICA is set to recover spatially independent sources (termed sICA). I find that the best results are produced by sICA when interferograms are ordered as a simple ``daisy chain'' of short temporal baselines, and when sICA is set to recover around 1-3 more sources than were thought to have contributed to the time series. However, I also show that in cases such as deformation centred under a stratovolcano, the overlapping nature of a topographically correlated atmospheric phase screen (APS) signal and a deformation signal produces a pair of signals that are no longer spatially statistically independent, and so cannot be recovered accurately by sICA. To validate these results, I apply sICA to a time series of Sentinel-1 interferograms that span the 2015 eruption of Wolf volcano (Galapagos archipelago, Ecuador) and automatically isolate three signals of geophysical interest, which I validate by comparing with the results of other studies. I also apply the sICA algorithm to a time series of interferograms that image Mt Etna, and through isolating signals that are likely to be due to instability of the east flank of the volcano, show that the method can be applied to stratovolcanoes to recover useful signals. Utilising the ability of sICA to isolate signals of interest, I introduce a prototype detection algorithm that tracks changes in the behaviour of a subaerial volcano, and show that it could have been used to detect the onset of the 2015 eruption of Wolf. However, for use in an detection algorithm that is to be applied globally, the signals recovered by sICA cannot be manually validated through comparison with other studies. Therefore, I seek to incorporate a module into my detection algorithm that is able to quantify the significance of the sources recovered by sICA. I achieve this through extensively modernising the ICASO algorithm to create a new algorithm, ICASAR, that is optimised for use with InSAR time series. This algorithm allows me to assess the significance of signals recovered by sICA at a given volcano, and to then prioritise the tracking of any changes they exhibit when they are used in my detection algorithm. To further develop the detection algorithm, I create two synthetic time series that characterise the different types of unrest that could occur at a volcanic centre. The first features the introduction of a new signal, and my algorithm is able to detect when this signal enters the time series by tracking how well the baseline sources are able to fit new interferograms. The second features the change in rate of a signal that was present during the baseline stage, and my algorithm is able to detect when this change in rate occurs by tracking how sources recovered from the baseline data are used through time. To further test the algorithm, I extended the Sentinel-1 time series I used to study the 2015 eruption of Wolf to include the 2018 eruption of Sierra Negra, and I find that my algorithm is able to detect the increase in inflation that precedes the eruption, and the eruption itself. I also perform a small study into the pre-eruptive inflation seen at Sierra Negra using the deformation signal and its time history that are outputted by ICASAR. A Bayesian inversion is performed using the GBIS software package, in which the inflation signal is modelled as a horizontal rectangular dislocation with variable opening and uniform overpressure. Coupled with the time history of the inflation signal provided by ICASAR, this allows me to determine the temporal evolution of the pre-eruptive overpressure since the beginning of the Sentinel-1 time series in 2014. To extend this back to the end of the previous eruption in 2005, I use GPS data that spans the entire interruptive period. I find that the total interruptive pressure change is ~13.5 MPa, which is significantly larger than the values required for tensile failure of an elastic medium overlying an inflating body. I conclude that it is likely that one or more processes occurred to reduce the overpressure within the sill, and that the change in rate of inflation prior to the final failure of the sill is unlikely to be coincidental. The second method I develop to detect volcanic deformation in a time series of interferograms uses a convolutional neural network (CNN) to classify and locate deformation signals as each new interferogram is added to the time series. I achieve this through building a model that uses the five convolutional blocks of a previously state-of-the-art classification and localisation model, VGG16, but incorporates a classification output/head, and a localisation output/head. In order to train the model, I perform transfer learning and utilise the weights made freely available for the convolutional blocks of a version of VGG16 that was trained to classify natural images. I then synthesise a set of training data, but find that better performance is achieved on a testing set of Sentinel-1 interferograms when the model is trained with a mixture of both synthetic and real data. I conclude that CNNs can be built that are able to differentiate between different styles of volcanic deformation, and that they can perform localisation by globally reasoning with a 224 x 224 pixel interferogram without the need for a sliding window approach. The results I present in this thesis show that many machine learning methods can be applied to both time series of interferograms, and individual interferograms. sICA provides a powerful tool to separate some geophysical signals from atmospheric ones, and the ICASAR algorithm that I develop allows a user to evaluate the significance of the results provided by sICA. I incorporate these methods into an deformation detection algorithm, and show that this could be used to detect several types of volcanic unrest using data produced by the latest generation of SAR satellites. Additionally, the CNN I develop is able to differentiate between deformation signals in a single interferogram, and provides a complementary way to monitor volcanoes using InSAR

    On-premise containerized, light-weight software solutions for Biomedicine

    Get PDF
    Bioinformatics software systems are critical tools for analysing large-scale biological data, but their design and implementation can be challenging due to the need for reliability, scalability, and performance. This thesis investigates the impact of several software approaches on the design and implementation of bioinformatics software systems. These approaches include software patterns, microservices, distributed computing, containerisation and container orchestration. The research focuses on understanding how these techniques affect bioinformatics software systems’ reliability, scalability, performance, and efficiency. Furthermore, this research highlights the challenges and considerations involved in their implementation. This study also examines potential solutions for implementing container orchestration in bioinformatics research teams with limited resources and the challenges of using container orchestration. Additionally, the thesis considers microservices and distributed computing and how these can be optimised in the design and implementation process to enhance the productivity and performance of bioinformatics software systems. The research was conducted using a combination of software development, experimentation, and evaluation. The results show that implementing software patterns can significantly improve the code accessibility and structure of bioinformatics software systems. Specifically, microservices and containerisation also enhanced system reliability, scalability, and performance. Additionally, the study indicates that adopting advanced software engineering practices, such as model-driven design and container orchestration, can facilitate efficient and productive deployment and management of bioinformatics software systems, even for researchers with limited resources. Overall, we develop a software system integrating all our findings. Our proposed system demonstrated the ability to address challenges in bioinformatics. The thesis makes several key contributions in addressing the research questions surrounding the design, implementation, and optimisation of bioinformatics software systems using software patterns, microservices, containerisation, and advanced software engineering principles and practices. Our findings suggest that incorporating these technologies can significantly improve bioinformatics software systems’ reliability, scalability, performance, efficiency, and productivity.Bioinformatische Software-Systeme stellen bedeutende Werkzeuge fĂŒr die Analyse umfangreicher biologischer Daten dar. Ihre Entwicklung und Implementierung kann jedoch aufgrund der erforderlichen ZuverlĂ€ssigkeit, Skalierbarkeit und LeistungsfĂ€higkeit eine Herausforderung darstellen. Das Ziel dieser Arbeit ist es, die Auswirkungen von Software-Mustern, Microservices, verteilten Systemen, Containerisierung und Container-Orchestrierung auf die Architektur und Implementierung von bioinformatischen Software-Systemen zu untersuchen. Die Forschung konzentriert sich darauf, zu verstehen, wie sich diese Techniken auf die ZuverlĂ€ssigkeit, Skalierbarkeit, LeistungsfĂ€higkeit und Effizienz von bioinformatischen Software-Systemen auswirken und welche Herausforderungen mit ihrer Konzeptualisierungen und Implementierung verbunden sind. Diese Arbeit untersucht auch potenzielle Lösungen zur Implementierung von Container-Orchestrierung in bioinformatischen Forschungsteams mit begrenzten Ressourcen und die EinschrĂ€nkungen bei deren Verwendung in diesem Kontext. Des Weiteren werden die SchlĂŒsselfaktoren, die den Erfolg von bioinformatischen Software-Systemen mit Containerisierung, Microservices und verteiltem Computing beeinflussen, untersucht und wie diese im Design- und Implementierungsprozess optimiert werden können, um die ProduktivitĂ€t und Leistung bioinformatischer Software-Systeme zu steigern. Die vorliegende Arbeit wurde mittels einer Kombination aus Software-Entwicklung, Experimenten und Evaluation durchgefĂŒhrt. Die erzielten Ergebnisse zeigen, dass die Implementierung von Software-Mustern, die ZuverlĂ€ssigkeit und Skalierbarkeit von bioinformatischen Software-Systemen erheblich verbessern kann. Der Einsatz von Microservices und Containerisierung trug ebenfalls zur Steigerung der ZuverlĂ€ssigkeit, Skalierbarkeit und LeistungsfĂ€higkeit des Systems bei. DarĂŒber hinaus legt die Arbeit dar, dass die Anwendung von SoftwareEngineering-Praktiken, wie modellgesteuertem Design und Container-Orchestrierung, die effiziente und produktive Bereitstellung und Verwaltung von bioinformatischen Software-Systemen erleichtern kann. Zudem löst die Implementierung dieses SoftwareSystems, Herausforderungen fĂŒr Forschungsgruppen mit begrenzten Ressourcen. Insgesamt hat das System gezeigt, dass es in der Lage ist, Herausforderungen im Bereich der Bioinformatik zu bewĂ€ltigen und stellt somit ein wertvolles Werkzeug fĂŒr Forscher in diesem Bereich dar. Die vorliegende Arbeit leistet mehrere wichtige BeitrĂ€ge zur Beantwortung von Forschungsfragen im Zusammenhang mit dem Entwurf, der Implementierung und der Optimierung von Software-Systemen fĂŒr die Bioinformatik unter Verwendung von Prinzipien und Praktiken der Softwaretechnik. Unsere Ergebnisse deuten darauf hin, dass die Einbindung dieser Technologien die ZuverlĂ€ssigkeit, Skalierbarkeit, LeistungsfĂ€higkeit, Effizienz und ProduktivitĂ€t bioinformatischer Software-Systeme erheblich verbessern kann

    Machine learning for improving heuristic optimisation

    Get PDF
    Heuristics, metaheuristics and hyper-heuristics are search methodologies which have been preferred by many researchers and practitioners for solving computationally hard combinatorial optimisation problems, whenever the exact methods fail to produce high quality solutions in a reasonable amount of time. In this thesis, we introduce an advanced machine learning technique, namely, tensor analysis, into the field of heuristic optimisation. We show how the relevant data should be collected in tensorial form, analysed and used during the search process. Four case studies are presented to illustrate the capability of single and multi-episode tensor analysis processing data with high and low abstraction levels for improving heuristic optimisation. A single episode tensor analysis using data at a high abstraction level is employed to improve an iterated multi-stage hyper-heuristic for cross-domain heuristic search. The empirical results across six different problem domains from a hyper-heuristic benchmark show that significant overall performance improvement is possible. A similar approach embedding a multi-episode tensor analysis is applied to the nurse rostering problem and evaluated on a benchmark of a diverse collection of instances, obtained from different hospitals across the world. The empirical results indicate the success of the tensor-based hyper-heuristic, improving upon the best-known solutions for four particular instances. Genetic algorithm is a nature inspired metaheuristic which uses a population of multiple interacting solutions during the search. Mutation is the key variation operator in a genetic algorithm and adjusts the diversity in a population throughout the evolutionary process. Often, a fixed mutation probability is used to perturb the value at each locus, representing a unique component of a given solution. A single episode tensor analysis using data with a low abstraction level is applied to an online bin packing problem, generating locus dependent mutation probabilities. The tensor approach improves the performance of a standard genetic algorithm on almost all instances, significantly. A multi-episode tensor analysis using data with a low abstraction level is embedded into multi-agent cooperative search approach. The empirical results once again show the success of the proposed approach on a benchmark of flow shop problem instances as compared to the approach which does not make use of tensor analysis. The tensor analysis can handle the data with different levels of abstraction leading to a learning approach which can be used within different types of heuristic optimisation methods based on different underlying design philosophies, indeed improving their overall performance

    Strain Elevation Tension Spring embedding and Cascading failures on the power-grid

    Get PDF
    Understanding the dynamics and properties of networks is of great importance in our highly connected data-driven society. When the networks relate to infrastructure, such understanding can have a substantial impact on public welfare. As such, there is a need for algorithms that can provide insights into the observable and latent properties of these structures. This thesis presents a novel embedding algorithm: the Strain Elevation Tension Spring embedding (SETSe), as a method of understanding complex networks. The algorithm is a deterministic physics model that incorporates both node and edge features into the final embedding. SETSe distinguishes itself from most embeddings methods by not having a loss function in the conventional sense and by not trying to place similar nodes close together. Instead, SETSe acts as a smoothing function for node features across the network topology. This approach produces embeddings that are intuitive and interpretable. In this thesis, I demonstrate how SETSe outperforms alternative embedding methods on node level and graph level tasks using networks made from stochastic block models and social networks with over 40,000 nodes and over 1 million edges. I also highlight a weakness of traditional methods to analysing cascading failures on power grids and demonstrate that SETSe is not susceptible to such issues. I then show how SETSe can be used as a measure of robustness in addition to providing a means to create interpretable maps in the geographical space given its smoothing embedding method. The framework has been made widely available through two open source R packages contributions, 1) the implementation of SETSe ("rsetse" on CRAN), and 2) a package for analysing cascading failures on power grids
    • 

    corecore