6 research outputs found
Learning Kernel Tests Without Data Splitting
Modern large-scale kernel-based tests such as maximum mean discrepancy (MMD)
and kernelized Stein discrepancy (KSD) optimize kernel hyperparameters on a
held-out sample via data splitting to obtain the most powerful test statistics.
While data splitting results in a tractable null distribution, it suffers from
a reduction in test power due to smaller test sample size. Inspired by the
selective inference framework, we propose an approach that enables learning the
hyperparameters and testing on the full sample without data splitting. Our
approach can correctly calibrate the test in the presence of such dependency,
and yield a test threshold in closed form. At the same significance level, our
approach's test power is empirically larger than that of the data-splitting
approach, regardless of its split proportion.Comment: 24 (10+14) pages, 9 figures. Under Review v2: added missing
references and acknowledgment
On the use of random forest for two-sample testing
Following the line of classification-based two-sample testing, tests based on the Random Forest classifier are proposed. The developed tests are easy to use, require almost no tuning, and are applicable for any distribution on R^d. Furthermore, the built-in variable importance measure of the Random Forest gives potential insights into which variables make out the difference in distribution. An asymptotic power analysis for the proposed tests is conducted. Finally, two real-world applications illustrate the usefulness of the introduced methodology. To simplify the use of the method, the R-package “hypoRF” is provided
Non-parametric machine learning for biological sequence data
In the past decade there has been a massive increase in the volume of biological sequence data, driven by massively parallel sequencing technologies. This has enabled data-driven statistical analyses using non-parametric predictive models (including those from machine learning) to complement more traditional, hypothesis-driven approaches. This thesis addresses several challenges that arise when applying non-parametric predictive models to biological sequence data.
Some of these challenges arise due to the nature of the biological system of interest. For example, in the study of the human microbiome the phylogenetic relationships between microorganisms are often ignored in statistical analyses. This thesis outlines a novel approach to modelling phylogenetic similarity using string kernels and demonstrates its utility in the two-sample test and host-trait prediction.
Other challenges arise from limitations in our understanding of the models themselves. For example, calculating variable importance (a key task in biomedical applications) is not possible for many models. This thesis describes a novel extension of an existing approach to compute importance scores for grouped variables in a Bayesian neural network. It also explores the behaviour of random forest classifiers when applied to microbial datasets, with a focus on the robustness of the biological findings under different modelling assumptions.Open Acces
Validação de modelos computacionais: um estudo integrando generative adversarial networks e simulação a eventos discretos
Computer model validation of Discrete Event Simulation (DES) is essential for project success since this stage guarantees that the simulation model corresponds to the real system. Nevertheless, it is not possible to assure that the model represents 100% of the real system. The literature suggests using more than one validation technique, but statistical tests are preferable. However, they have limitations, since the tests usually test the mean or standard deviation individually, and do not consider that the data may be within a pre-established tolerance limit. In this way, Generative Adversarial Networks (GANs) can be used to train, evaluate and discriminate data and validate DES models, because they are two competing neural networks, where one generates data and the other discriminates them. The proposed method is divided into two phases. The first is the "Training Phase" and it aims to train the data. The second, the "Test Phase" aims to discriminate the data. In addition, in the second phase, the Equivalence Test is performed, which statistically analyze if the difference between the judgments is within the tolerance range determined by the modeler. To validate the proposed method and to verify the Power Test, experiments were carried out in continuous, discrete, and conditional distributions and in a DES model. From the tests, the Power Test curves were generated considering a real tolerance of 5.0%, 10.0% and 20.0%. The results showed that it is more efficient to use the dataset that presents larger sample in the “Test Phase” while the set with smaller sample size needs to be used in the “Training Phase”. In addition, the confidence of the Power Test increases with big higher dataset in first phase, presenting smaller confidence intervals. Also, the more metrics are evaluated at once, the greater the amount of data inputted in the GANs' training. The method suggests classifying a validation based on the achieve tolerance: Very Strong, Strong, Satisfying, Marginal, Deficient and Unsatisfying. Finally, the method was applied to three real models, two of them in manufacturing and the last one in the health sector. We conclude that the proposed method was efficient and was able to show the degree of validation of the models that represent the real system.A validação de modelos computacionais de Simulação a Eventos Discretos (SED) é primordial para o sucesso do projeto, pois é a partir dela que se garante que o modelo simulado corresponde ao sistema real. Apesar disso, não é possível assegurar que o modelo represente 100% o sistema real. A literatura sugere várias técnicas de validação, porém é preferível o uso de testes estatísticos pois eles apresentam evidências matemáticas. Entretanto, existem limitações, pois testam média ou desvio padrão de forma individual, sem levar em consideração que os dados podem estar dentro de uma tolerância pré-estabelecida. Pode-se utilizar as Generative Adversarial Networks (GANs) para treinar, avaliar, discriminar dados e validar modelos de SED. As GANs são duas redes neurais que competem entre si, sendo que uma gera dados e a outra os discrimina. Assim, a tese tem como objetivo propor um método de validação de modelos computacionais de SED para avaliar uma ou mais métricas de saída, considerando uma tolerância para a comparação dos dados simulados com os dados reais. O método proposto foi dividido em duas fases, onde a primeira, denominada “Fase de Treinamento”, tem como objetivo o treinamento dos dados e a segunda, “Fase de Teste”, visa discriminar os dados. Na segunda fase, é realizado o Teste de Equivalência, o qual analisa estatisticamente se a diferença entre o julgamento dos dados está dentro da faixa de tolerância determinada pelo modelador. Para validar o método proposto e verificar o Poder do Teste, foram realizados experimentos em distribuições teóricas e em um modelo de SED. Assim, as curvas com o Poder do Teste para a tolerância real de 5.0%, 10.0% e 20.0% foram geradas. Os resultados mostraram que é mais eficiente o uso do conjunto de dados que apresenta uma amostra maior na “Fase de Teste” e é mais adequado o conjunto de tamanho amostral menor na “Fase de Treinamento”. Além disso, a confiança do Poder do Teste aumenta, apresentando intervalos de confiança menores. Ainda, quanto mais métricas são avaliadas de uma só vez, maior deve ser a quantidade de dados inseridos no treinamento das GANs. O método ainda sugere classificar a validação em faixas que mostram o quão válido o modelo é: Muito Forte, Forte, Satisfatória, Marginal, Deficiente e Insatisfatória. Por fim, o método foi aplicado em três modelos reais, sendo dois deles na área de manufatura e um na área da saúde. Concluiu-se que o método proposto foi eficiente e conseguiu mostrar a o grau de validação dos modelos que representam o sistema real
Learning and Testing Powerful Hypotheses
Progress in science is driven through the formulation of hypotheses about phenomena of interest and by
collecting evidence for their validity or refuting them. While some hypotheses are amenable to deductive
proofs, other hypotheses can only be accessed in a data-driven manner. For most phenomena, scientists cannot
control all degrees of freedom and hence data is often inherently stochastic. This stochasticity disallows to
test hypotheses with absolute certainty. The field of statistical hypothesis testing formalizes the probabilistic
assessment of hypotheses, enabling researchers to control the error rates, for example, at which they reject a
true hypothesis, while aiming to reject false hypotheses as often as possible.
But how do we come up with promising hypotheses, and how can we test them efficiently? Can we use
machine learning systems to automatically generate promising hypotheses? This thesis studies different
aspects of this question.
A simple rule for statistical hypothesis testing states that one should not peek at the data when formulating
a hypothesis. This is indeed true if done naively, that is, when the hypothesis is then simply tested with
the data as if one had not looked at it yet. However, we show that in principle using the same data for
learning the hypothesis and testing it is feasible if we can correct for the selection of the hypothesis. We treat
this in the case of the two-sample problem. Given two samples, the hypothesis to be tested is whether the
samples originate from the same distribution. We can reformulate this by testing whether the maximum
mean discrepancy over a (unit ball of a) reproducing kernel Hilbert space is zero. We show that we can
learn the kernel function, hence the exact test we use, and perform the test with the same data, while still
correctly controlling the Type-I error rates. Likewise, we demonstrate experimentally that taking all data into
account can lead to more powerful testing procedures than the data splitting approach. However, deriving
the formulae that correct for the selection procedure requires strong assumptions, which are only valid for a
specific, the linear-time, estimate of the maximum mean discrepancy. In more general settings it is difficult, if
not impossible, to adjust for the selection.
We thus also analyze the case where we split the data and use part of it to learn a test statistic. The maximum
mean discrepancy implicitly optimizes a mean discrepancy over the unit ball of a reproducing kernel Hilbert
space, and often the kernel itself is optimized on held-out data.We instead propose to optimize a witness
function directly on held-out data and use its mean discrepancy as a test statistic. This allows us to directly
maximize the test power, simplifies the theoretical treatment, and makes testing more efficient.We provide
and implement algorithms to learn the test statistics. Furthermore, we show analytically that the optimization
objective to learn powerful tests for the two-sample problem is closely related to the objectives used in
standard supervised learning tasks, namely the least-square loss and cross-entropy loss. This allows us to
indeed use existing machine learning tools when learning powerful hypotheses. Furthermore, since we use
held-out data for learning the test statistic, we can use any kind of model-selection and cross-validation
techniques to maximize the performance. To facilitate this for practitioners, we provide an open-source
Python package ’autotst’ implementing an interface to existing libraries and running the whole testing
pipeline, including the learning of the hypothesis. Our presented methods reach state-of-the-art performance
on two-sample testing tasks. We also show how to trade off the computational resources required for the test
by sacrificing some statistical power, which can be important in practice. Furthermore, our test easily allows
interpreting the results.
Having more computational power potentially allows extracting more information from data and thus obtain
more significant results. Hence, investigating whether quantum computers can help in machine learning tasks
has gained popularity over the past years. We investigate this in light of the two-sample problem. We define
the quantum mean embedding, mapping probability distributions onto quantum states, and analyze when
this mapping is injective. While this is conceptually interesting on its own, we do not find a straight-forward
way of harnessing any speed-up. The main problem here is that there is no known way to efficiently create
the quantum mean embedding. On the contrary, fundamental results in quantum information theory show
that this might generally be hard to do.
For two-sample testing, the usage of reproducing kernel Hilbert spaces has been established for many years
and proven important both theoretically and practically. In this case, we thus focused on practically relevant
aspects to make the tests as powerful and easy to use as possible. For other hypothesis testing tasks, the
usage of advanced machine learning tools still lags far behind. For specification tests based on conditional
moment restrictions, popular in econometrics, we do the first steps by defining a consistent test based on
kernel methods. Our test already has promising performance, but optimizing it, potentially with the other
insights gained in this thesis, is an open task