151 research outputs found

    APPLICATIONS OF MACHINE LEARNING IN MICROBIAL FORENSICS

    Get PDF
    Microbial ecosystems are complex, with hundreds of members interacting with each other and the environment. The intricate and hidden behaviors underlying these interactions make research questions challenging – but can be better understood through machine learning. However, most machine learning that is used in microbiome work is a black box form of investigation, where accurate predictions can be made, but the inner logic behind what is driving prediction is hidden behind nontransparent layers of complexity. Accordingly, the goal of this dissertation is to provide an interpretable and in-depth machine learning approach to investigate microbial biogeography and to use micro-organisms as novel tools to detect geospatial location and object provenance (previous known origin). These contributions follow with a framework that allows extraction of interpretable metrics and actionable insights from microbiome-based machine learning models. The first part of this work provides an overview of machine learning in the context of microbial ecology, human microbiome studies and environmental monitoring – outlining common practice and shortcomings. The second part of this work demonstrates a field study to demonstrate how machine learning can be used to characterize patterns in microbial biogeography globally – using microbes from ports located around the world. The third part of this work studies the persistence and stability of natural microbial communities from the environment that have colonized objects (vessels) and stay attached as they travel through the water. Finally, the last part of this dissertation provides a robust framework for investigating the microbiome. This framework provides a reasonable understanding of the data being used in microbiome-based machine learning and allows researchers to better apprehend and interpret results. Together, these extensive experiments assist an understanding of how to carry an in-silico design that characterizes candidate microbial biomarkers from real world settings to a rapid, field deployable diagnostic assay. The work presented here provides evidence for the use of microbial forensics as a toolkit to expand our basic understanding of microbial biogeography, microbial community stability and persistence in complex systems, and the ability of machine learning to be applied to downstream molecular detection platforms for rapid and accurate detection

    On relational learning and discovery in social networks: a survey

    Get PDF
    The social networking scene has evolved tremendously over the years. It has grown in relational complexities that extend a vast presence onto popular social media platforms on the internet. With the advance of sentimental computing and social complexity, relationships which were once thought to be simple have now become multi-dimensional and widespread in the online scene. This explosion in the online social scene has attracted much research attention. The main aims of this work revolve around the knowledge discovery and datamining processes of these feature-rich relations. In this paper, we provide a survey of relational learning and discovery through popular social analysis of different structure types which are integral to applications within the emerging field of sentimental and affective computing. It is hoped that this contribution will add to the clarity of how social networks are analyzed with the latest groundbreaking methods and provide certain directions for future improvements

    Computerized Analysis of Magnetic Resonance Images to Study Cerebral Anatomy in Developing Neonates

    Get PDF
    The study of cerebral anatomy in developing neonates is of great importance for the understanding of brain development during the early period of life. This dissertation therefore focuses on three challenges in the modelling of cerebral anatomy in neonates during brain development. The methods that have been developed all use Magnetic Resonance Images (MRI) as source data. To facilitate study of vascular development in the neonatal period, a set of image analysis algorithms are developed to automatically extract and model cerebral vessel trees. The whole process consists of cerebral vessel tracking from automatically placed seed points, vessel tree generation, and vasculature registration and matching. These algorithms have been tested on clinical Time-of- Flight (TOF) MR angiographic datasets. To facilitate study of the neonatal cortex a complete cerebral cortex segmentation and reconstruction pipeline has been developed. Segmentation of the neonatal cortex is not effectively done by existing algorithms designed for the adult brain because the contrast between grey and white matter is reversed. This causes pixels containing tissue mixtures to be incorrectly labelled by conventional methods. The neonatal cortical segmentation method that has been developed is based on a novel expectation-maximization (EM) method with explicit correction for mislabelled partial volume voxels. Based on the resulting cortical segmentation, an implicit surface evolution technique is adopted for the reconstruction of the cortex in neonates. The performance of the method is investigated by performing a detailed landmark study. To facilitate study of cortical development, a cortical surface registration algorithm for aligning the cortical surface is developed. The method first inflates extracted cortical surfaces and then performs a non-rigid surface registration using free-form deformations (FFDs) to remove residual alignment. Validation experiments using data labelled by an expert observer demonstrate that the method can capture local changes and follow the growth of specific sulcus

    Machine learning model selection with multi-objective Bayesian optimization and reinforcement learning

    Get PDF
    A machine learning system, including when used in reinforcement learning, is usually fed with only limited data, while aimed at training a model with good predictive performance that can generalize to an underlying data distribution. Within certain hypothesis classes, model selection chooses a model based on selection criteria calculated from available data, which usually serve as estimators of generalization performance of the model. One major challenge for model selection that has drawn increasing attention is the discrepancy between the data distribution where training data is sampled from and the data distribution at deployment. The model can over-fit in the training distribution, and fail to extrapolate in unseen deployment distributions, which can greatly harm the reliability of a machine learning system. Such a distribution shift challenge can become even more pronounced in high-dimensional data types like gene expression data, functional data and image data, especially in a decentralized learning scenario. Another challenge for model selection is efficient search in the hypothesis space. Since training a machine learning model usually takes a fair amount of resources, searching for an appropriate model with favorable configurations is by inheritance an expensive process, thus calling for efficient optimization algorithms. To tackle the challenge of distribution shift, novel resampling methods for the evaluation of robustness of neural network was proposed, as well as a domain generalization method using multi-objective bayesian optimization in decentralized learning scenario and variational inference in a domain unsupervised manner. To tackle the expensive model search problem, combining bayesian optimization and reinforcement learning in an interleaved manner was proposed for efficient search in a hierarchical conditional configuration space. Additionally, the effectiveness of using multi-objective bayesian optimization for model search in a decentralized learning scenarios was proposed and verified. A model selection perspective to reinforcement learning was proposed with associated contributions in tackling the problem of exploration in high dimensional state action spaces and sparse reward. Connections between statistical inference and control was summarized. Additionally, contributions in open source software development in related machine learning sub-topics like feature selection and functional data analysis with advanced tuning method and abundant benchmarking were also made

    Democratizing machine learning

    Get PDF
    Modelle des maschinellen Lernens sind zunehmend in der Gesellschaft verankert, oft in Form von automatisierten Entscheidungsprozessen. Ein wesentlicher Grund dafür ist die verbesserte Zugänglichkeit von Daten, aber auch von Toolkits für maschinelles Lernen, die den Zugang zu Methoden des maschinellen Lernens für Nicht-Experten ermöglichen. Diese Arbeit umfasst mehrere Beiträge zur Demokratisierung des Zugangs zum maschinellem Lernen, mit dem Ziel, einem breiterem Publikum Zugang zu diesen Technologien zu er- möglichen. Die Beiträge in diesem Manuskript stammen aus mehreren Bereichen innerhalb dieses weiten Gebiets. Ein großer Teil ist dem Bereich des automatisierten maschinellen Lernens (AutoML) und der Hyperparameter-Optimierung gewidmet, mit dem Ziel, die oft mühsame Aufgabe, ein optimales Vorhersagemodell für einen gegebenen Datensatz zu finden, zu vereinfachen. Dieser Prozess besteht meist darin ein für vom Benutzer vorgegebene Leistungsmetrik(en) optimales Modell zu finden. Oft kann dieser Prozess durch Lernen aus vorhergehenden Experimenten verbessert oder beschleunigt werden. In dieser Arbeit werden drei solcher Methoden vorgestellt, die entweder darauf abzielen, eine feste Menge möglicher Hyperparameterkonfigurationen zu erhalten, die wahrscheinlich gute Lösungen für jeden neuen Datensatz enthalten, oder Eigenschaften der Datensätze zu nutzen, um neue Konfigurationen vorzuschlagen. Darüber hinaus wird eine Sammlung solcher erforderlichen Metadaten zu den Experimenten vorgestellt, und es wird gezeigt, wie solche Metadaten für die Entwicklung und als Testumgebung für neue Hyperparameter- Optimierungsmethoden verwendet werden können. Die weite Verbreitung von ML-Modellen in vielen Bereichen der Gesellschaft erfordert gleichzeitig eine genauere Untersuchung der Art und Weise, wie aus Modellen abgeleitete automatisierte Entscheidungen die Gesellschaft formen, und ob sie möglicherweise Individuen oder einzelne Bevölkerungsgruppen benachteiligen. In dieser Arbeit wird daher ein AutoML-Tool vorgestellt, das es ermöglicht, solche Überlegungen in die Suche nach einem optimalen Modell miteinzubeziehen. Diese Forderung nach Fairness wirft gleichzeitig die Frage auf, ob die Fairness eines Modells zuverlässig geschätzt werden kann, was in einem weiteren Beitrag in dieser Arbeit untersucht wird. Da der Zugang zu Methoden des maschinellen Lernens auch stark vom Zugang zu Software und Toolboxen abhängt, sind mehrere Beiträge in Form von Software Teil dieser Arbeit. Das R-Paket mlr3pipelines ermöglicht die Einbettung von Modellen in sogenan- nte Machine Learning Pipelines, die Vor- und Nachverarbeitungsschritte enthalten, die im maschinellen Lernen und AutoML häufig benötigt werden. Das mlr3fairness R-Paket hingegen ermöglicht es dem Benutzer, Modelle auf potentielle Benachteiligung hin zu über- prüfen und diese durch verschiedene Techniken zu reduzieren. Eine dieser Techniken, multi-calibration wurde darüberhinaus als seperate Software veröffentlicht.Machine learning artifacts are increasingly embedded in society, often in the form of automated decision-making processes. One major reason for this, along with methodological improvements, is the increasing accessibility of data but also machine learning toolkits that enable access to machine learning methodology for non-experts. The core focus of this thesis is exactly this – democratizing access to machine learning in order to enable a wider audience to benefit from its potential. Contributions in this manuscript stem from several different areas within this broader area. A major section is dedicated to the field of automated machine learning (AutoML) with the goal to abstract away the tedious task of obtaining an optimal predictive model for a given dataset. This process mostly consists of finding said optimal model, often through hyperparameter optimization, while the user in turn only selects the appropriate performance metric(s) and validates the resulting models. This process can be improved or sped up by learning from previous experiments. Three such methods one with the goal to obtain a fixed set of possible hyperparameter configurations that likely contain good solutions for any new dataset and two using dataset characteristics to propose new configurations are presented in this thesis. It furthermore presents a collection of required experiment metadata and how such meta-data can be used for the development and as a test bed for new hyperparameter optimization methods. The pervasion of models derived from ML in many aspects of society simultaneously calls for increased scrutiny with respect to how such models shape society and the eventual biases they exhibit. Therefore, this thesis presents an AutoML tool that allows incorporating fairness considerations into the search for an optimal model. This requirement for fairness simultaneously poses the question of whether we can reliably estimate a model’s fairness, which is studied in a further contribution in this thesis. Since access to machine learning methods also heavily depends on access to software and toolboxes, several contributions in the form of software are part of this thesis. The mlr3pipelines R package allows for embedding models in so-called machine learning pipelines that include pre- and postprocessing steps often required in machine learning and AutoML. The mlr3fairness R package on the other hand enables users to audit models for potential biases as well as reduce those biases through different debiasing techniques. One such technique, multi-calibration is published as a separate software package, mcboost

    Quantum Monte Carlo simulations for estimating FOREX markets: A speculative attacks experience

    Full text link
    The foreign exchange markets, renowned as the largest financial markets globally, also stand out as one of the most intricate due to their substantial volatility, nonlinearity, and irregular nature. Owing to these challenging attributes, various research endeavors have been undertaken to effectively forecast future currency prices in foreign exchange with precision. The studies performed have built models utilizing statistical methods, being the Monte Carlo algorithm the most popular. In this study, we propose to apply Auxiliary-Field Quantum Monte Carlo to increase the precision of the FOREX markets models from different sample sizes to test simulations in different stress contexts. Our findings reveal that the implementation of Auxiliary-Field Quantum Monte Carlo significantly enhances the accuracy of these models, as evidenced by the minimal error and consistent estimations achieved in the FOREX market. This research holds valuable implications for both the general public and financial institutions, empowering them to effectively anticipate significant volatility in exchange rate trends and the associated risks. These insights provide crucial guidance for future decision-making processes

    Quantum Monte Carlo simulations for estimating FOREX markets: a speculative attacks experience.

    Get PDF
    The foreign exchange markets, renowned as the largest financial markets globally, also stand out as one of the most intricate due to their substantial volatility, nonlinearity, and irregular nature. Owing to these challenging attributes, various research endeavors have been undertaken to effectively forecast future currency prices in foreign exchange with precision. The studies performed have built models utilizing statistical methods, being the Monte Carlo algorithm the most popular. In this study, we propose to apply Auxiliary-Field Quantum Monte Carlo to increase the precision of the FOREX markets models from different sample sizes to test simulations in different stress contexts. Our findings reveal that the imple- mentation of Auxiliary-Field Quantum Monte Carlo significantly enhances the accuracy of these models, as evidenced by the minimal error and consistent estimations achieved in the FOREX market. This research holds valuable implications for both the general public and financial institutions, empowering them to effectively anticipate significant volatility in exchange rate trends and the associated risks. These insights provide crucial guidance for future decision-making processes.This research was funded by the Universidad de Málaga. We would also like to thank the Universitat de Barcelona, UB-AE-AS017634, for funding this research

    Scale and ecological and historical determinants of a species' geographic range: The plant parasite Phoradendron californicum Nutt. (Viscaceae)

    Get PDF
    Geographic ranges of species are fundamental units of study in ecology and evolutionary biology, since they summarize views of how species' populations and individuals are organized in space and time. Here, I assess how abiotic and biotic factors limit and constrain species' geographic range, structure its distributions, and change in importance at multiple spatial and temporal scales. I approach this challenge using models and testable hypothesis frameworks in the context of ecological, geographic, and historical conditions. Concentrating on a single species, the desert mistletoe, Phoradendron californicum, I assess the relative importance of factors associated with dispersal, host-parasite-vector niche overlap, and phylogeographic patterns for cpDNA within a 6 mya timeframe and at local-to-regional geographic extents. Results from a comparison of correlative and process-based modeling approaches at resolutions 1-50 km show that dispersal-related parameters are more relevant at finer resolutions (1-5 km), but that importance of extinction-related parameters did not change with scale. Here, a clearer and more comprehensive mechanistic understanding was derived from the process-based algorithm than can be obtained from correlative approaches. In a range-wide analysis, niche comparisons among parasite, hosts, and dispersers supported the parasite niche hypothesis, but not alternative hypotheses, suggesting that mistletoe infections occur in non-random environmental subsets of host and disperser ecological niches, but that different hosts get infected under similar climatic conditions, basically where their distributions overlap that of the mistletoe. In a study of 40 species, including insects, plants, birds, mammals, and worms distributed across the globe, genetic diversity showed a negative relationship with distance to environmental niche centroid, but no consistent relationship with distance to geographic range center. Finally, P. californicum's cpDNA phylogenetic/phylogeographic relationships were most probable under a model of geologic events related to formation of the Baja California Peninsula and seaways across it in the Pliocene and the Pleistocene; however, fossil record, niche projections to the LGM, and haplotype distribution suggested shifting distributions of host-mistletoe interactions and evidence of host races, which may explain some of the genealogical history of the cpDNA. In sum, the chapters presented here provide robust examples and methodologies applied to estimating the importance and scale at which different sets of abiotic and biotic factors act to structure a species' geographic range

    A Computational and Experimental Investigation of Lignin Metabolism in Arabidopsis thaliana.

    Get PDF
    Predominantly localized in plant secondary cell walls, lignin is a highly crosslinked, aromatic polymer that imparts structural support to plant vasculature, and renders biomass recalcitrant to pretreatment techniques impeding the economical production of biofuels. Lignin is synthesized via the phenylpropanoid pathway where the primary precursor phenylalanine (Phe) undergoes a series of functional modifications catalyzed by 11 enzyme families to produce p-coumaryl, coniferyl, and sinapyl alcohol, which undergo random polymerization into lignin. Several metabolic engineering efforts have aimed to alter lignin content and composition, and make biofuel feedstock more amenable to pretreatment techniques. Despite significant advances, several questions pertaining to carbon flux distribution in the phenylpropanoid network remain unanswered. Furthermore, complexity of the metabolic pathway and a lack of sensitive analytical tools add to the challenges of mechanistically understanding lignin synthesis. In this work, I describe improvements in analytical techniques used to characterize phenylpropanoid metabolism that have been applied to obtain a comprehensive quantitative mass balance of the phenylpropanoid pathway. Finally, machine learning and artificial intelligence were utilized to make predictions about optimal lignin amount and composition for improving saccharification. In summary, the overarching goal of this thesis was to further the understanding of lignin metabolism in the model system, Arabidopis thaliana, employing a combination of experimental and computational strategies. First, we developed comprehensive and sensitive analytical methods based on liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) to quantify intermediates of the phenylpropanoid pathway. Compared to existing targeted profiling techniques, the methods were capable of quantifying a wider range of phenylpropanoid intermediates, at lower concentrations, with minimal sample preparation. The technique was used to generate flux maps for wild type and mutant Arabidopsis stems that were fed exogenously 13C6-Phe. Flux maps computed in this work; (i) suggest the presence of a hitherto uncharacterized alternative route to caffeic acid and lignin synthesis, (ii) shed light on flux splits at key branch points of the network, and (iii) indicate presence of inactive pools for a number of metabolites. Finally, we present a machine learning based model that captures the non-linear relationship between lignin content and composition, and saccharification efficiency. A support vector machine (SVM) based regression technique was developed to predict saccharification efficiency and biomass yields as a function of lignin content, and composition of monomers that make up lignin, namely p-coumaryl (H), coniferyl (G), and sinapyl (S) alcohol derived lignin. The model was trained on data obtained from the literature and validated on Arabidopsis mutants that were excluded from the training data set. Functional forms obtained from SVM regression were further optimized using genetic algorithms (GA) to maximize total sugar yields. Our efforts resulted in two optimal solutions with lower lignin content and interestingly varying H:G:S composition that were conducive to saccharide extractability
    corecore