6 research outputs found

    Learning Kernel Tests Without Data Splitting

    Full text link
    Modern large-scale kernel-based tests such as maximum mean discrepancy (MMD) and kernelized Stein discrepancy (KSD) optimize kernel hyperparameters on a held-out sample via data splitting to obtain the most powerful test statistics. While data splitting results in a tractable null distribution, it suffers from a reduction in test power due to smaller test sample size. Inspired by the selective inference framework, we propose an approach that enables learning the hyperparameters and testing on the full sample without data splitting. Our approach can correctly calibrate the test in the presence of such dependency, and yield a test threshold in closed form. At the same significance level, our approach's test power is empirically larger than that of the data-splitting approach, regardless of its split proportion.Comment: 24 (10+14) pages, 9 figures. Under Review v2: added missing references and acknowledgment

    On the use of random forest for two-sample testing

    Full text link
    Following the line of classification-based two-sample testing, tests based on the Random Forest classifier are proposed. The developed tests are easy to use, require almost no tuning, and are applicable for any distribution on R^d. Furthermore, the built-in variable importance measure of the Random Forest gives potential insights into which variables make out the difference in distribution. An asymptotic power analysis for the proposed tests is conducted. Finally, two real-world applications illustrate the usefulness of the introduced methodology. To simplify the use of the method, the R-package “hypoRF” is provided

    Non-parametric machine learning for biological sequence data

    Get PDF
    In the past decade there has been a massive increase in the volume of biological sequence data, driven by massively parallel sequencing technologies. This has enabled data-driven statistical analyses using non-parametric predictive models (including those from machine learning) to complement more traditional, hypothesis-driven approaches. This thesis addresses several challenges that arise when applying non-parametric predictive models to biological sequence data. Some of these challenges arise due to the nature of the biological system of interest. For example, in the study of the human microbiome the phylogenetic relationships between microorganisms are often ignored in statistical analyses. This thesis outlines a novel approach to modelling phylogenetic similarity using string kernels and demonstrates its utility in the two-sample test and host-trait prediction. Other challenges arise from limitations in our understanding of the models themselves. For example, calculating variable importance (a key task in biomedical applications) is not possible for many models. This thesis describes a novel extension of an existing approach to compute importance scores for grouped variables in a Bayesian neural network. It also explores the behaviour of random forest classifiers when applied to microbial datasets, with a focus on the robustness of the biological findings under different modelling assumptions.Open Acces

    Validação de modelos computacionais: um estudo integrando generative adversarial networks e simulação a eventos discretos

    Get PDF
    Computer model validation of Discrete Event Simulation (DES) is essential for project success since this stage guarantees that the simulation model corresponds to the real system. Nevertheless, it is not possible to assure that the model represents 100% of the real system. The literature suggests using more than one validation technique, but statistical tests are preferable. However, they have limitations, since the tests usually test the mean or standard deviation individually, and do not consider that the data may be within a pre-established tolerance limit. In this way, Generative Adversarial Networks (GANs) can be used to train, evaluate and discriminate data and validate DES models, because they are two competing neural networks, where one generates data and the other discriminates them. The proposed method is divided into two phases. The first is the "Training Phase" and it aims to train the data. The second, the "Test Phase" aims to discriminate the data. In addition, in the second phase, the Equivalence Test is performed, which statistically analyze if the difference between the judgments is within the tolerance range determined by the modeler. To validate the proposed method and to verify the Power Test, experiments were carried out in continuous, discrete, and conditional distributions and in a DES model. From the tests, the Power Test curves were generated considering a real tolerance of 5.0%, 10.0% and 20.0%. The results showed that it is more efficient to use the dataset that presents larger sample in the “Test Phase” while the set with smaller sample size needs to be used in the “Training Phase”. In addition, the confidence of the Power Test increases with big higher dataset in first phase, presenting smaller confidence intervals. Also, the more metrics are evaluated at once, the greater the amount of data inputted in the GANs' training. The method suggests classifying a validation based on the achieve tolerance: Very Strong, Strong, Satisfying, Marginal, Deficient and Unsatisfying. Finally, the method was applied to three real models, two of them in manufacturing and the last one in the health sector. We conclude that the proposed method was efficient and was able to show the degree of validation of the models that represent the real system.A validação de modelos computacionais de Simulação a Eventos Discretos (SED) é primordial para o sucesso do projeto, pois é a partir dela que se garante que o modelo simulado corresponde ao sistema real. Apesar disso, não é possível assegurar que o modelo represente 100% o sistema real. A literatura sugere várias técnicas de validação, porém é preferível o uso de testes estatísticos pois eles apresentam evidências matemáticas. Entretanto, existem limitações, pois testam média ou desvio padrão de forma individual, sem levar em consideração que os dados podem estar dentro de uma tolerância pré-estabelecida. Pode-se utilizar as Generative Adversarial Networks (GANs) para treinar, avaliar, discriminar dados e validar modelos de SED. As GANs são duas redes neurais que competem entre si, sendo que uma gera dados e a outra os discrimina. Assim, a tese tem como objetivo propor um método de validação de modelos computacionais de SED para avaliar uma ou mais métricas de saída, considerando uma tolerância para a comparação dos dados simulados com os dados reais. O método proposto foi dividido em duas fases, onde a primeira, denominada “Fase de Treinamento”, tem como objetivo o treinamento dos dados e a segunda, “Fase de Teste”, visa discriminar os dados. Na segunda fase, é realizado o Teste de Equivalência, o qual analisa estatisticamente se a diferença entre o julgamento dos dados está dentro da faixa de tolerância determinada pelo modelador. Para validar o método proposto e verificar o Poder do Teste, foram realizados experimentos em distribuições teóricas e em um modelo de SED. Assim, as curvas com o Poder do Teste para a tolerância real de 5.0%, 10.0% e 20.0% foram geradas. Os resultados mostraram que é mais eficiente o uso do conjunto de dados que apresenta uma amostra maior na “Fase de Teste” e é mais adequado o conjunto de tamanho amostral menor na “Fase de Treinamento”. Além disso, a confiança do Poder do Teste aumenta, apresentando intervalos de confiança menores. Ainda, quanto mais métricas são avaliadas de uma só vez, maior deve ser a quantidade de dados inseridos no treinamento das GANs. O método ainda sugere classificar a validação em faixas que mostram o quão válido o modelo é: Muito Forte, Forte, Satisfatória, Marginal, Deficiente e Insatisfatória. Por fim, o método foi aplicado em três modelos reais, sendo dois deles na área de manufatura e um na área da saúde. Concluiu-se que o método proposto foi eficiente e conseguiu mostrar a o grau de validação dos modelos que representam o sistema real

    Learning and Testing Powerful Hypotheses

    Get PDF
    Progress in science is driven through the formulation of hypotheses about phenomena of interest and by collecting evidence for their validity or refuting them. While some hypotheses are amenable to deductive proofs, other hypotheses can only be accessed in a data-driven manner. For most phenomena, scientists cannot control all degrees of freedom and hence data is often inherently stochastic. This stochasticity disallows to test hypotheses with absolute certainty. The field of statistical hypothesis testing formalizes the probabilistic assessment of hypotheses, enabling researchers to control the error rates, for example, at which they reject a true hypothesis, while aiming to reject false hypotheses as often as possible. But how do we come up with promising hypotheses, and how can we test them efficiently? Can we use machine learning systems to automatically generate promising hypotheses? This thesis studies different aspects of this question. A simple rule for statistical hypothesis testing states that one should not peek at the data when formulating a hypothesis. This is indeed true if done naively, that is, when the hypothesis is then simply tested with the data as if one had not looked at it yet. However, we show that in principle using the same data for learning the hypothesis and testing it is feasible if we can correct for the selection of the hypothesis. We treat this in the case of the two-sample problem. Given two samples, the hypothesis to be tested is whether the samples originate from the same distribution. We can reformulate this by testing whether the maximum mean discrepancy over a (unit ball of a) reproducing kernel Hilbert space is zero. We show that we can learn the kernel function, hence the exact test we use, and perform the test with the same data, while still correctly controlling the Type-I error rates. Likewise, we demonstrate experimentally that taking all data into account can lead to more powerful testing procedures than the data splitting approach. However, deriving the formulae that correct for the selection procedure requires strong assumptions, which are only valid for a specific, the linear-time, estimate of the maximum mean discrepancy. In more general settings it is difficult, if not impossible, to adjust for the selection. We thus also analyze the case where we split the data and use part of it to learn a test statistic. The maximum mean discrepancy implicitly optimizes a mean discrepancy over the unit ball of a reproducing kernel Hilbert space, and often the kernel itself is optimized on held-out data.We instead propose to optimize a witness function directly on held-out data and use its mean discrepancy as a test statistic. This allows us to directly maximize the test power, simplifies the theoretical treatment, and makes testing more efficient.We provide and implement algorithms to learn the test statistics. Furthermore, we show analytically that the optimization objective to learn powerful tests for the two-sample problem is closely related to the objectives used in standard supervised learning tasks, namely the least-square loss and cross-entropy loss. This allows us to indeed use existing machine learning tools when learning powerful hypotheses. Furthermore, since we use held-out data for learning the test statistic, we can use any kind of model-selection and cross-validation techniques to maximize the performance. To facilitate this for practitioners, we provide an open-source Python package ’autotst’ implementing an interface to existing libraries and running the whole testing pipeline, including the learning of the hypothesis. Our presented methods reach state-of-the-art performance on two-sample testing tasks. We also show how to trade off the computational resources required for the test by sacrificing some statistical power, which can be important in practice. Furthermore, our test easily allows interpreting the results. Having more computational power potentially allows extracting more information from data and thus obtain more significant results. Hence, investigating whether quantum computers can help in machine learning tasks has gained popularity over the past years. We investigate this in light of the two-sample problem. We define the quantum mean embedding, mapping probability distributions onto quantum states, and analyze when this mapping is injective. While this is conceptually interesting on its own, we do not find a straight-forward way of harnessing any speed-up. The main problem here is that there is no known way to efficiently create the quantum mean embedding. On the contrary, fundamental results in quantum information theory show that this might generally be hard to do. For two-sample testing, the usage of reproducing kernel Hilbert spaces has been established for many years and proven important both theoretically and practically. In this case, we thus focused on practically relevant aspects to make the tests as powerful and easy to use as possible. For other hypothesis testing tasks, the usage of advanced machine learning tools still lags far behind. For specification tests based on conditional moment restrictions, popular in econometrics, we do the first steps by defining a consistent test based on kernel methods. Our test already has promising performance, but optimizing it, potentially with the other insights gained in this thesis, is an open task
    corecore