68 research outputs found

    Understanding Data Manipulation and How to Leverage it To Improve Generalization

    Get PDF
    Augmentations and other transformations of data, either in the input or latent space, are a critical component of modern machine learning systems. While these techniques are widely used in practice and known to provide improved generalization in many cases, it is still unclear how data manipulation impacts learning and generalization. To take a step toward addressing the problem, this thesis focuses on understanding and leveraging data augmentation and alignment for improving machine learning performance and transfer. In the first part of the thesis, we establish a novel theoretical framework to understand how data augmentation (DA) impacts learning in linear regression and classification tasks. The results demonstrate how the augmented transformed data spectrum plays a key role in characterizing the behavior of different augmentation strategies, especially in the overparameterized regime. The tools developed in this aim provide simple guidelines to build new augmentation strategies and a simple framework for comparing the generalization of different types of DA. In the second part of the thesis, we demonstrate how latent data alignment can be used to tackle the domain transfer problem, where training and testing datasets vary in distribution. Our algorithm builds upon joint clustering and data-matching through optimal transport, and outperforms the pure matching algorithm baselines in both synthetic and real datasets. Extension of the generalization analysis and algorithm design for data augmentation and alignment for nonlinear models such as artificial neural networks and random feature models are discussed. This thesis provides tools and analyses for better data manipulation design, which benefit both supervised and unsupervised learning schemes.Ph.D

    Using Gaze for Behavioural Biometrics

    Get PDF
    A principled approach to the analysis of eye movements for behavioural biometrics is laid down. The approach grounds in foraging theory, which provides a sound basis to capture the unique- ness of individual eye movement behaviour. We propose a composite Ornstein-Uhlenbeck process for quantifying the exploration/exploitation signature characterising the foraging eye behaviour. The rel- evant parameters of the composite model, inferred from eye-tracking data via Bayesian analysis, are shown to yield a suitable feature set for biometric identification; the latter is eventually accomplished via a classical classification technique. A proof of concept of the method is provided by measuring its identification performance on a publicly available dataset. Data and code for reproducing the analyses are made available. Overall, we argue that the approach offers a fresh view on either the analyses of eye-tracking data and prospective applications in this field

    Lernen, Lehren und Forschen in einer digital geprĂ€gten Welt. Gesellschaft fĂŒr Didaktik der Chemie und Physik. Jahrestagung in Aachen 2022

    Full text link
    Die Tagung der Gesellschaft fĂŒr Didaktik der Chemie und Physik (GDCP) fand vom 12. bis zum 15. September 2022 an der RWTH Aachen statt. Der vorliegende Band umfasst die ausgearbeiteten BeitrĂ€ge der Teilnehmenden zum Thema: "Lernen, Lehren und Forschen in der digital geprĂ€gten Welt"

    Probabilistic Numerical Linear Algebra for Machine Learning

    Get PDF
    Machine learning models are becoming increasingly essential in domains where critical decisions must be made under uncertainty, such as in public policy, medicine or robotics. For a model to be useful for decision-making, it must convey a degree of certainty in its predictions. Bayesian models are well-suited to such settings due to their principled uncertainty quantification, given a set of assumptions about the problem and data-generating process. While in theory, inference in a Bayesian model is fully specified, in practice, numerical approximations have a significant impact on the resulting posterior. Therefore, model-based decisions are not just determined by the data but also by the numerical method. This begs the question of how we can account for the adverse impact of numerical approximations on inference. Arguably, the most common numerical task in scientific computing is the solution of linear systems, which arise in probabilistic inference, graph theory, differential equations and optimization. In machine learning, these systems are typically large-scale, subject to noise and arise from generative processes. These unique characteristics call for specialized solvers. In this thesis, we propose a class of probabilistic linear solvers, which infer the solution to a linear system and can be interpreted as learning algorithms themselves. Importantly, they can leverage problem structure and propagate their error to the prediction of the underlying probabilistic model. Next, we apply such solvers to accelerate Gaussian process inference. While Gaussian processes are a principled and flexible model class, for large datasets inference is computationally prohibitive both in time and memory due to the required computations with the kernel matrix. We show that by approximating the posterior with a probabilistic linear solver, we can invest an arbitrarily small amount of computation and still obtain a provably coherent prediction that quantifies uncertainty exactly. Finally, we demonstrate that Gaussian process hyperparameter optimization can similarly be accelerated by leveraging structural prior knowledge in the model via preconditioning of iterative methods. Combined with modern parallel hardware, this enables training Gaussian process models on datasets with hundreds of thousands of data points. In summary, we demonstrate that interpreting numerical methods in linear algebra as probabilistic learning algorithms unlocks significant performance improvements for Gaussian process models. Crucially, we show how to account for the impact of numerical approximations on model predictions via uncertainty quantification. This enables an explicit trade-off between computational resources and confidence in a prediction. The techniques developed in this thesis have advanced the understanding of probabilistic linear solvers, they have shifted the goalposts of what can be expected from Gaussian process approximations and they have defined the way large-scale Gaussian process hyperparameter optimization is performed in GPyTorch, arguably the most popular library for Gaussian processes in Python

    Pessimistic Bayesianism for conservative optimization and imitation

    Get PDF
    Subject to several assumptions, sufficiently advanced reinforcement learners would likely face an incentive and likely have an ability to intervene in the provision of their reward, with catastrophic consequences. In this thesis, I develop a theory of pessimism and show how it can produce safe advanced artificial agents. Not only do I demonstrate that the assumptions mentioned above can be avoided; I prove theorems which demonstrate safety. First, I develop an idealized pessimistic reinforcement learner. For any given novel event that a mentor would never cause, a sufficiently pessimistic reinforcement learner trained with the help of that mentor would probably avoid causing it. This result is without precedent in the literature. Next, on similar principles, I develop an idealized pessimistic imitation learner. If the probability of an event when the demonstrator acts can be bounded above, then the probability can be bounded above when the imitator acts instead; this kind of result is unprecedented when the imitator learns online and the environment never resets. In an environment that never resets, no one has previously demonstrated, to my knowledge, that an imitation learner even exists. Finally, both of the agents above demand more efficient algorithms for high-quality uncertainty quantification, so I have developed a new kernel for Gaussian process modelling that allows for log-linear time complexity and linear space complexity, instead of a naïve cubic time complexity and quadratic space complexity. This is not the first Gaussian process with this time complexity—inducing points methods have linear complexity—but we do outperform such methods significantly on regression benchmarks, as one might expect given the much higher dimensionality of our kernel. This thesis shows the viability of pessimism with respect to well-quantified epistemic uncertainty as a path to safe artificial agency

    Advances in Molecular Simulation

    Get PDF
    Molecular simulations are commonly used in physics, chemistry, biology, material science, engineering, and even medicine. This book provides a wide range of molecular simulation methods and their applications in various fields. It reflects the power of molecular simulation as an effective research tool. We hope that the presented results can provide an impetus for further fruitful studies

    Multivariate Statistical Machine Learning Methods for Genomic Prediction

    Get PDF
    This book is open access under a CC BY 4.0 license This open access book brings together the latest genome base prediction models currently being used by statisticians, breeders and data scientists. It provides an accessible way to understand the theory behind each statistical learning tool, the required pre-processing, the basics of model building, how to train statistical learning methods, the basic R scripts needed to implement each statistical learning tool, and the output of each tool. To do so, for each tool the book provides background theory, some elements of the R statistical software for its implementation, the conceptual underpinnings, and at least two illustrative examples with data from real-world genomic selection experiments. Lastly, worked-out examples help readers check their own comprehension. The book will greatly appeal to readers in plant (and animal) breeding, geneticists and statisticians, as it provides in a very accessible way the necessary theory, the appropriate R code, and illustrative examples for a complete understanding of each statistical learning tool. In addition, it weighs the advantages and disadvantages of each tool

    Non-parametric machine learning for biological sequence data

    Get PDF
    In the past decade there has been a massive increase in the volume of biological sequence data, driven by massively parallel sequencing technologies. This has enabled data-driven statistical analyses using non-parametric predictive models (including those from machine learning) to complement more traditional, hypothesis-driven approaches. This thesis addresses several challenges that arise when applying non-parametric predictive models to biological sequence data. Some of these challenges arise due to the nature of the biological system of interest. For example, in the study of the human microbiome the phylogenetic relationships between microorganisms are often ignored in statistical analyses. This thesis outlines a novel approach to modelling phylogenetic similarity using string kernels and demonstrates its utility in the two-sample test and host-trait prediction. Other challenges arise from limitations in our understanding of the models themselves. For example, calculating variable importance (a key task in biomedical applications) is not possible for many models. This thesis describes a novel extension of an existing approach to compute importance scores for grouped variables in a Bayesian neural network. It also explores the behaviour of random forest classifiers when applied to microbial datasets, with a focus on the robustness of the biological findings under different modelling assumptions.Open Acces

    Postharvest Management of Fruits and Vegetables

    Get PDF
    All articles in the presented collection are high-quality examples of both basic and applied research. The publications collectively refer to apples, bananas, cherries, kiwi fruit, mango, grapes, green bean pods, pomegranates, sweet pepper, sweet potato tubers and tomato and are aimed at improving the postharvest quality and storage extension of fresh produce. The experimental works include the following postharvest treatments: 1-methylcycloprpene, methyl jasmonate, immersion in edible coatings (aloe, chitosan, plant extracts, nanoemulsions, ethanol, ascorbic acid and essential oils solutions), heat treatments, packaging, innovative packaging materials, low temperature, low O2 and high CO2 modified atmosphere, and non-destructible technique development to measure soluble solids with infra- and near infra-red spectroscopy. Preharvest treatments were also included, such as chitosan application, fruit kept on the vine, and cultivation under far-red light. Quality assessment was dependent on species, treatment and storage conditions in each case and included evaluation of color, bruising, water loss, organoleptic estimation and texture changes in addition to changes in the concentrations of sugars, organic acids, amino acids, fatty acids, carotenoids, tocopherols, phytosterols, phenolic compounds and aroma volatiles. Gene transcription related to ethylene biosynthesis, modification of cell wall components, synthesis of aroma compounds and lipid metabolism were also the focus of some of the articles

    Log-Linear-Time Gaussian Processes Using Binary Tree Kernels

    Full text link
    Gaussian processes (GPs) produce good probabilistic models of functions, but most GP kernels require O((n+m)n2)O((n+m)n^2) time, where nn is the number of data points and mm the number of predictive locations. We present a new kernel that allows for Gaussian process regression in O((n+m)log⁥(n+m))O((n+m)\log(n+m)) time. Our "binary tree" kernel places all data points on the leaves of a binary tree, with the kernel depending only on the depth of the deepest common ancestor. We can store the resulting kernel matrix in O(n)O(n) space in O(nlog⁥n)O(n \log n) time, as a sum of sparse rank-one matrices, and approximately invert the kernel matrix in O(n)O(n) time. Sparse GP methods also offer linear run time, but they predict less well than higher dimensional kernels. On a classic suite of regression tasks, we compare our kernel against Mat\'ern, sparse, and sparse variational kernels. The binary tree GP assigns the highest likelihood to the test data on a plurality of datasets, usually achieves lower mean squared error than the sparse methods, and often ties or beats the Mat\'ern GP. On large datasets, the binary tree GP is fastest, and much faster than a Mat\'ern GP.Comment: NeurIPS 2022; 9 pages + appendice
    • 

    corecore