291 research outputs found

    Statistical computation with kernels

    Get PDF
    Modern statistical inference has seen a tremendous increase in the size and complexity of models and datasets. As such, it has become reliant on advanced com- putational tools for implementation. A first canonical problem in this area is the numerical approximation of integrals of complex and expensive functions. Numerical integration is required for a variety of tasks, including prediction, model comparison and model choice. A second canonical problem is that of statistical inference for models with intractable likelihoods. These include models with intractable normal- isation constants, or models which are so complex that their likelihood cannot be evaluated, but from which data can be generated. Examples include large graphical models, as well as many models in imaging or spatial statistics. This thesis proposes to tackle these two problems using tools from the kernel methods and Bayesian non-parametrics literature. First, we analyse a well-known algorithm for numerical integration called Bayesian quadrature, and provide consis- tency and contraction rates. The algorithm is then assessed on a variety of statistical inference problems, and extended in several directions in order to reduce its compu- tational requirements. We then demonstrate how the combination of reproducing kernels with Stein’s method can lead to computational tools which can be used with unnormalised densities, including numerical integration and approximation of probability measures. We conclude by studying two minimum distance estimators derived from kernel-based statistical divergences which can be used for unnormalised and generative models. In each instance, the tractability provided by reproducing kernels and their properties allows us to provide easily-implementable algorithms whose theoretical foundations can be studied in depth

    Bayesian inference by active sampling

    Get PDF

    Clustering and the Three-Point Function

    Full text link
    We develop analytical methods for computing the structure constant for three heavy operators, starting from the recently proposed hexagon approach. Such a structure constant is a semiclassical object, with the scale set by the inverse length of the operators playing the role of the Planck constant. We reformulate the hexagon expansion in terms of multiple contour integrals and recast it as a sum over clusters generated by the residues of the measure of integration. We test the method on two examples. First, we compute the asymptotic three-point function of heavy fields at any coupling and show the result in the semiclassical limit matches both the string theory computation at strong coupling and the tree-level results obtained before. Second, in the case of one non-BPS and two BPS operators at strong coupling we sum up all wrapping corrections associated with the opposite bridge to the non-trivial operator, or the "bottom" mirror channel. We also give an alternative interpretation of the results in terms of a gas of fermions and show that they can be expressed compactly as an operator-valued super-determinant.Comment: 52 pages + a few appendices; v2 typos correcte

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Automating data mart construction from semi-structured data sources

    Get PDF
    The global food and agricultural industry has a total market value of USD 8 trillion in 2016, and decision makers in the Agri sector require appropriate tools and up-to-date information to make predictions across a range of products and areas. Traditionally, these requirements are met with information processed into a data warehouse and data marts constructed for analyses. Increasingly however, data is coming from outside the enterprise and often in unprocessed forms. As these sources are outside the control of companies, they are prone to change and new sources may appear. In these cases, the process of accommodating these sources can be costly and very time consuming. To automate this process, what is required is a sufficiently robust Extract-Transform-Load (ETL) process; external sources are mapped to some form of ontology, and an integration process to merge the specific data sources. In this paper, we present an approach to automating the integration of data sources in an Agri environment, where new sources are examined before an attempt to merge them with existing data marts. Our validation uses three separate case studies of real world data to demonstrate the robustness of our approach and the efficiency of materialising data mart

    On Novel Approaches to Model-Based Structural Health Monitoring

    Get PDF
    Structural health monitoring (SHM) strategies have classically fallen into two main categories of approach: model-driven and data-driven methods. The former utilises physics-based models and inverse techniques as a method for inferring the health state of a structure from changes to updated parameters; hence defined as inverse model-driven approaches. The other frames SHM within a statistical pattern recognition paradigm. These methods require no physical modelling, instead inferring relationships between data and health states directly. Although successes with both approaches have been made, they both suffer from significant drawbacks, namely parameter estimation and interpretation difficulties within the inverse model-driven framework, and a lack of available full-system damage state data for data-driven techniques. Consequently, this thesis seeks to outline and develop a framework for an alternative category of approach; forward model-driven SHM. This class of strategies utilise calibrated physics-based models, in a forward manner, to generate health state data (i.e. the undamaged condition and damage states of interest) for training machine learning or pattern recognition technologies. As a result the framework seeks to provide potential solutions to these issues by removing the need for making health decisions from updated parameters and providing a mechanism for obtaining health state data. In light of this objective, a framework for forward model-driven SHM is established, highlighting key challenges and technologies that are required for realising this category of approach. The framework is constructed from two main components: generating physics-based models that accurately predict outputs under various damage scenarios, and machine learning methods used to infer decision bounds. This thesis deals with the former, developing technologies and strategies for producing statistically representative predictions from physics-based models. Specifically this work seeks to define validation within this context and propose a validation strategy, develop technologies that infer uncertainties from various sources, including model discrepancy, and offer a solution to the issue of validating full-system predictions when data is not available at this level. The first section defines validation within a forward model-driven context, offering a strategy of hypothesis testing, statistical distance metrics, visualisation tools, such as the witness function, and deterministic metrics. The statistical distances field is shown to provide a wealth of potential validation metrics that consider whole probability distributions. Additionally, existing validation metrics can be categorised within this fields terminology, providing greater insight. In the second part of this study emulator technologies, specifically Gaussian Process (GP) methods, are discussed. Practical implementation considerations are examined, including the establishment of validation and diagnostic techniques. Various GP extensions are outlined, with particular focus on technologies for dealing with large data sets and their applicability as emulators. Utilising these technologies two techniques for calibrating models, whilst accounting for and inferring model discrepancies, are demonstrated: Bayesian Calibration and Bias Correction (BCBC) and Bayesian History Matching (BHM). Both methods were applied to representative building structures in order to demonstrate their effectiveness within a forward model-driven SHM strategy. Sequential design heuristics were developed for BHM along with an importance sampling based technique for inferring the functional model discrepancy uncertainties. The third body of work proposes a multi-level uncertainty integration strategy by developing a subfunction discrepancy approach. This technique seeks to construct a methodology for producing valid full-system predictions through a combination of validated sub-system models where uncertainties and model discrepancy have been quantified. This procedure is demonstrated on a numerical shear structure where it is shown to be effective. Finally, conclusions about the aforementioned technologies are provided. In addition, a review of the future directions for forward model-driven SHM are outlined with the hope that this category receives wider investigation within the SHM community

    ToyArchitecture: Unsupervised Learning of Interpretable Models of the World

    Full text link
    Research in Artificial Intelligence (AI) has focused mostly on two extremes: either on small improvements in narrow AI domains, or on universal theoretical frameworks which are usually uncomputable, incompatible with theories of biological intelligence, or lack practical implementations. The goal of this work is to combine the main advantages of the two: to follow a big picture view, while providing a particular theory and its implementation. In contrast with purely theoretical approaches, the resulting architecture should be usable in realistic settings, but also form the core of a framework containing all the basic mechanisms, into which it should be easier to integrate additional required functionality. In this paper, we present a novel, purposely simple, and interpretable hierarchical architecture which combines multiple different mechanisms into one system: unsupervised learning of a model of the world, learning the influence of one's own actions on the world, model-based reinforcement learning, hierarchical planning and plan execution, and symbolic/sub-symbolic integration in general. The learned model is stored in the form of hierarchical representations with the following properties: 1) they are increasingly more abstract, but can retain details when needed, and 2) they are easy to manipulate in their local and symbolic-like form, thus also allowing one to observe the learning process at each level of abstraction. On all levels of the system, the representation of the data can be interpreted in both a symbolic and a sub-symbolic manner. This enables the architecture to learn efficiently using sub-symbolic methods and to employ symbolic inference.Comment: Revision: changed the pdftitl

    Advances in Non-parametric Hypothesis Testing with Kernels

    Get PDF
    Non-parametric statistical hypothesis testing procedures aim to distinguish the null hypothesis against the alternative with minimal assumptions on the model distributions. In recent years, the maximum mean discrepancy (MMD) has been developed as a measure to compare two distributions, which is applicable to two-sample problems and independence tests. With the aid of reproducing kernel Hilbert spaces (RKHS) that are rich-enough, MMD enjoys desirable statistical properties including characteristics, consistency, and maximal test power. Moreover, MMD receives empirical successes in complex tasks such as training and comparing generative models. Stein’s method also provides an elegant probabilistic tool to compare unnormalised distributions, which commonly appear in practical machine learning tasks. Combined with rich-enough RKHS, the kernel Stein discrepancy (KSD) has been developed as a proper discrepancy measure between distributions, which can be used to tackle one-sample problems (or goodness-of-fit tests). The existing development of KSD applies to a limited choice of domains, such as Euclidean space or finite discrete sets, and requires complete data observations, while the current MMD constructions are limited by the choice of simple kernels where the power of the tests suffer, e.g. high-dimensional image data. The main focus of this thesis is on the further advancement of kernel-based statistics for hypothesis testings. Firstly, Stein operators are developed that are compatible with broader data domains to perform the corresponding goodness-of-fit tests. Goodness-of-fit tests for general unnormalised densities on Riemannian manifolds, which are of the non-Euclidean topology, have been developed. In addition, novel non-parametric goodness-of-fit tests for data with censoring are studied. Then the tests for data observations with left truncation are studied, e.g. times of entering the hospital always happen before death time in the hospital, and we say the death time is truncated by the entering time. We test the notion of independence beyond truncation by proposing a kernelised measure for quasi-independence. Finally, we study the deep kernel architectures to improve the two-sample testing performances
    corecore