475,893 research outputs found
Large-Scale Kernel Methods for Independence Testing
Representations of probability measures in reproducing kernel Hilbert spaces
provide a flexible framework for fully nonparametric hypothesis tests of
independence, which can capture any type of departure from independence,
including nonlinear associations and multivariate interactions. However, these
approaches come with an at least quadratic computational cost in the number of
observations, which can be prohibitive in many applications. Arguably, it is
exactly in such large-scale datasets that capturing any type of dependence is
of interest, so striking a favourable tradeoff between computational efficiency
and test performance for kernel independence tests would have a direct impact
on their applicability in practice. In this contribution, we provide an
extensive study of the use of large-scale kernel approximations in the context
of independence testing, contrasting block-based, Nystrom and random Fourier
feature approaches. Through a variety of synthetic data experiments, it is
demonstrated that our novel large scale methods give comparable performance
with existing methods whilst using significantly less computation time and
memory.Comment: 29 pages, 6 figure
Resampling-based confidence regions and multiple tests for a correlated random vector
We derive non-asymptotic confidence regions for the mean of a random vector
whose coordinates have an unknown dependence structure. The random vector is
supposed to be either Gaussian or to have a symmetric bounded distribution, and
we observe i.i.d copies of it. The confidence regions are built using a
data-dependent threshold based on a weighted bootstrap procedure. We consider
two approaches, the first based on a concentration approach and the second on a
direct boostrapped quantile approach. The first one allows to deal with a very
large class of resampling weights while our results for the second are
restricted to Rademacher weights. However, the second method seems more
accurate in practice. Our results are motivated by multiple testing problems,
and we show on simulations that our procedures are better than the Bonferroni
procedure (union bound) as soon as the observed vector has sufficiently
correlated coordinates.Comment: submitted to COL
Large-Scale Multiple Testing of Composite Null Hypotheses Under Heteroskedasticity
Heteroskedasticity poses several methodological challenges in designing valid
and powerful procedures for simultaneous testing of composite null hypotheses.
In particular, the conventional practice of standardizing or re-scaling
heteroskedastic test statistics in this setting may severely affect the power
of the underlying multiple testing procedure. Additionally, when the
inferential parameter of interest is correlated with the variance of the test
statistic, methods that ignore this dependence may fail to control the type I
error at the desired level. We propose a new Heteroskedasticity Adjusted
Multiple Testing (HAMT) procedure that avoids data reduction by
standardization, and directly incorporates the side information from the
variances into the testing procedure. Our approach relies on an improved
nonparametric empirical Bayes deconvolution estimator that offers a practical
strategy for capturing the dependence between the inferential parameter of
interest and the variance of the test statistic. We develop theory to show that
HAMT is asymptotically valid and optimal for FDR control. Simulation results
demonstrate that HAMT outperforms existing procedures with substantial power
gain across many settings at the same FDR level. The method is illustrated on
an application involving the detection of engaged users on a mobile game app
Specification Testing in Hawkes Models
We propose various specification tests for Hawkes models based on the Lagrange Multiplier (LM) principle. Hawkes models can be used to model the occurrence of extreme events in financial markets. Our specific testing focus is on extending a univariate model to a multivariate model, that is, we examine whether there is a conditional dependence between extreme events in markets. Simulations show that the test has good size and power, in particular for sample sizes that are typically encountered in practice. Applying the specification test for dependence
to US stocks, bonds and exchange rate data, we find strong evidence for cross-excitation within segments as well as between segments. Therefore, we recommend that univariate Hawkes models be extended to account for the cross-triggering phenomenon
Representing time-dependent freezing behaviour in immersion mode ice nucleation
In order to understand the impact of ice formation in clouds, a quantitative understanding of ice nucleation is required, along with an accurate and efficient representation for use in cloud resolving models. Ice nucleation by atmospherically relevant particle types is complicated by interparticle variability in nucleating ability, as well as a stochastic, time-dependent, nature inherent to nucleation. Here we present a new and computationally efficient Framework for Reconciling Observable Stochastic Time-dependence (FROST) in immersion mode ice nucleation. This framework is underpinned by the finding that the temperature dependence of the nucleation-rate coefficient controls the residence-time and cooling-rate dependence of freezing. It is shown that this framework can be used to reconcile experimental data obtained on different timescales with different experimental systems, and it also provides a simple way of representing the complexities of ice nucleation in cloud resolving models. The routine testing and reporting of time-dependent behaviour in future experimental studies is recommended, along with the practice of presenting normalised data sets following the methods outlined here
Empowering customer engagement by informative billing: a European approach
Programmes aimed at improving end-use energy efficiency are a keystone in the market strategies of leading distribution system operators (DSOs) and energy retail companies and are increasing in application, soon expected to become a mainstream practice. Informative services based on electricity meter data collected for billing are powerful tools for energy savings in scale and increase customer engagement with the energy suppliers enabling the deployment of demand response programmes helping to optimise distribution grid operation. These
services are completely in line with Europe’s 2020 strategy for overall energy performance improvement (cf. directives 2006/32/EC, 2009/72/EC, 2012/27/EU).
The Intelligent Energy Europe project EMPOWERING involves 4 European utilities and an international team of university researchers, social scientists and energy experts for developing and providing insight based services and tools for 344.000 residential customers in Austria, France, Italy and Spain. The project adopts a systematic iterative approach of service development based on envisaging the utilities’, customers’ and legal requirements, and incorporates the feedback from testing in the design process.
The technological solution provided by the leading partner CIMNE is scalable open source Big Data Analytics System coupled with the DSO’s information systems and delivering a range of value adding services for the customer, such as:
- comparison with similar households
- indications of performance improvements over time
- consumption-weather dependence
- detailed consumption visualisation and breakdown
- personalised energy saving tips
- alerts (high consumption, high bill, extreme temperature, etc.)
The paper presents the development approach, describes the ICT system architecture and analyses the legal and regulatory context for providing this kind of services in the European Community. The limitations for third party data access, customer consent and data privacy are discussed, and how these have been overcome with the implementation of the “privacy by design” principle is explained
Recommended from our members
Method for Enabling Causal Inference in Relational Domains
The analysis of data from complex systems is quickly becoming a fundamental aspect of modern business, government, and science. The field of causal learning is concerned with developing a set of statistical methods that allow practitioners make inferences about unseen interventions. This field has seen significant advances in recent years. However, the vast majority of this work assumes that data instances are independent, whereas many systems are best described in terms of interconnected instances, i.e. relational systems. This discrepancy prevents causal inference techniques from being reliably applied in many real-world settings. In this thesis, I will present three contributions to the field of causal inference that seek to enable the analysis of relational systems. First, I will present theory for consistently testing statistical dependence in relational domains. I then show how the significance of this test can be measured in practice using a novel bootstrap method for structured domains. Second, I show that statistical dependence in relational domains is inherently asymmetric, implying a simple test of causal direction from observational data. This test requires no assumptions on either the marginal distributions of variables or the functional form of dependence. Third, I describe relational causal adjustment, a procedure to identify the effects of arbitrary interventions from observational relational data via an extension of Pearl\u27s backdoor criterion. A series of evaluations on synthetic domains shows the estimates obtained by relational causal adjustment are close to those obtained from explicit experimentation
Hypothesis Testing and Model Estimation with Dependent Observations in Heterogeneous Sensor Networks
Advances in microelectronics, communication and signal processing have enabled the development of inexpensive sensors that can be networked to collect vital information from their environment to be used in decision-making and inference. The sensors transmit their data to a central processor which integrates the information from the sensors using a so-called fusion algorithm. Many applications of sensor networks (SNs) involve hypothesis testing or the detection of a phenomenon. Many approaches to data fusion for hypothesis testing assume that, given each hypothesis, the sensors\u27 measurements are conditionally independent. However, since the sensors are densely deployed in practice, their field of views overlap and consequently their measurements are dependent. Moreover, a sensor\u27s measurement samples may be correlated over time. Another assumption often used in data fusion algorithms is that the underlying statistical model of sensors\u27 observations is completely known. However, in practice these statistics may not be available prior to deployment and may change over the lifetime of the network due to hardware changes, aging, and environmental conditions. In this dissertation, we consider the problem of data fusion in heterogeneous SNs (SNs in which the sensors are not identical) collecting dependent data. We develop the expectation maximization algorithm for hypothesis testing and model estimation. Copula distributions are used to model the correlation in the data. Moreover, it is assumed that the distribution of the sensors\u27 measurements is not completely known. we consider both parametric and non-parametric model estimation. The proposed approach is developed for both batch and online processing. In batch processing, fusion can only be performed after a block of data samples is received from each sensor, while in online processing, fusion is performed upon arrival of each data sample. Online processing is of great interest since for many applications, the long delay required for the accumulation of data in batch processing is not acceptable. To evaluate the proposed algorithms, both simulation data and real-world datasets are used. Detection performances of the proposed algorithms are compared with well-known supervised and unsupervised learning methods as well as with similar EM-based methods, which either partially or entirely ignore the dependence in the data
- …