475,893 research outputs found

    Large-Scale Kernel Methods for Independence Testing

    Get PDF
    Representations of probability measures in reproducing kernel Hilbert spaces provide a flexible framework for fully nonparametric hypothesis tests of independence, which can capture any type of departure from independence, including nonlinear associations and multivariate interactions. However, these approaches come with an at least quadratic computational cost in the number of observations, which can be prohibitive in many applications. Arguably, it is exactly in such large-scale datasets that capturing any type of dependence is of interest, so striking a favourable tradeoff between computational efficiency and test performance for kernel independence tests would have a direct impact on their applicability in practice. In this contribution, we provide an extensive study of the use of large-scale kernel approximations in the context of independence testing, contrasting block-based, Nystrom and random Fourier feature approaches. Through a variety of synthetic data experiments, it is demonstrated that our novel large scale methods give comparable performance with existing methods whilst using significantly less computation time and memory.Comment: 29 pages, 6 figure

    Resampling-based confidence regions and multiple tests for a correlated random vector

    Get PDF
    We derive non-asymptotic confidence regions for the mean of a random vector whose coordinates have an unknown dependence structure. The random vector is supposed to be either Gaussian or to have a symmetric bounded distribution, and we observe nn i.i.d copies of it. The confidence regions are built using a data-dependent threshold based on a weighted bootstrap procedure. We consider two approaches, the first based on a concentration approach and the second on a direct boostrapped quantile approach. The first one allows to deal with a very large class of resampling weights while our results for the second are restricted to Rademacher weights. However, the second method seems more accurate in practice. Our results are motivated by multiple testing problems, and we show on simulations that our procedures are better than the Bonferroni procedure (union bound) as soon as the observed vector has sufficiently correlated coordinates.Comment: submitted to COL

    Large-Scale Multiple Testing of Composite Null Hypotheses Under Heteroskedasticity

    Full text link
    Heteroskedasticity poses several methodological challenges in designing valid and powerful procedures for simultaneous testing of composite null hypotheses. In particular, the conventional practice of standardizing or re-scaling heteroskedastic test statistics in this setting may severely affect the power of the underlying multiple testing procedure. Additionally, when the inferential parameter of interest is correlated with the variance of the test statistic, methods that ignore this dependence may fail to control the type I error at the desired level. We propose a new Heteroskedasticity Adjusted Multiple Testing (HAMT) procedure that avoids data reduction by standardization, and directly incorporates the side information from the variances into the testing procedure. Our approach relies on an improved nonparametric empirical Bayes deconvolution estimator that offers a practical strategy for capturing the dependence between the inferential parameter of interest and the variance of the test statistic. We develop theory to show that HAMT is asymptotically valid and optimal for FDR control. Simulation results demonstrate that HAMT outperforms existing procedures with substantial power gain across many settings at the same FDR level. The method is illustrated on an application involving the detection of engaged users on a mobile game app

    Specification Testing in Hawkes Models

    Get PDF
    We propose various specification tests for Hawkes models based on the Lagrange Multiplier (LM) principle. Hawkes models can be used to model the occurrence of extreme events in financial markets. Our specific testing focus is on extending a univariate model to a multivariate model, that is, we examine whether there is a conditional dependence between extreme events in markets. Simulations show that the test has good size and power, in particular for sample sizes that are typically encountered in practice. Applying the specification test for dependence to US stocks, bonds and exchange rate data, we find strong evidence for cross-excitation within segments as well as between segments. Therefore, we recommend that univariate Hawkes models be extended to account for the cross-triggering phenomenon

    Representing time-dependent freezing behaviour in immersion mode ice nucleation

    Get PDF
    In order to understand the impact of ice formation in clouds, a quantitative understanding of ice nucleation is required, along with an accurate and efficient representation for use in cloud resolving models. Ice nucleation by atmospherically relevant particle types is complicated by interparticle variability in nucleating ability, as well as a stochastic, time-dependent, nature inherent to nucleation. Here we present a new and computationally efficient Framework for Reconciling Observable Stochastic Time-dependence (FROST) in immersion mode ice nucleation. This framework is underpinned by the finding that the temperature dependence of the nucleation-rate coefficient controls the residence-time and cooling-rate dependence of freezing. It is shown that this framework can be used to reconcile experimental data obtained on different timescales with different experimental systems, and it also provides a simple way of representing the complexities of ice nucleation in cloud resolving models. The routine testing and reporting of time-dependent behaviour in future experimental studies is recommended, along with the practice of presenting normalised data sets following the methods outlined here

    Empowering customer engagement by informative billing: a European approach

    Get PDF
    Programmes aimed at improving end-use energy efficiency are a keystone in the market strategies of leading distribution system operators (DSOs) and energy retail companies and are increasing in application, soon expected to become a mainstream practice. Informative services based on electricity meter data collected for billing are powerful tools for energy savings in scale and increase customer engagement with the energy suppliers enabling the deployment of demand response programmes helping to optimise distribution grid operation. These services are completely in line with Europe’s 2020 strategy for overall energy performance improvement (cf. directives 2006/32/EC, 2009/72/EC, 2012/27/EU). The Intelligent Energy Europe project EMPOWERING involves 4 European utilities and an international team of university researchers, social scientists and energy experts for developing and providing insight based services and tools for 344.000 residential customers in Austria, France, Italy and Spain. The project adopts a systematic iterative approach of service development based on envisaging the utilities’, customers’ and legal requirements, and incorporates the feedback from testing in the design process. The technological solution provided by the leading partner CIMNE is scalable open source Big Data Analytics System coupled with the DSO’s information systems and delivering a range of value adding services for the customer, such as: - comparison with similar households - indications of performance improvements over time - consumption-weather dependence - detailed consumption visualisation and breakdown - personalised energy saving tips - alerts (high consumption, high bill, extreme temperature, etc.) The paper presents the development approach, describes the ICT system architecture and analyses the legal and regulatory context for providing this kind of services in the European Community. The limitations for third party data access, customer consent and data privacy are discussed, and how these have been overcome with the implementation of the “privacy by design” principle is explained

    Hypothesis Testing and Model Estimation with Dependent Observations in Heterogeneous Sensor Networks

    Get PDF
    Advances in microelectronics, communication and signal processing have enabled the development of inexpensive sensors that can be networked to collect vital information from their environment to be used in decision-making and inference. The sensors transmit their data to a central processor which integrates the information from the sensors using a so-called fusion algorithm. Many applications of sensor networks (SNs) involve hypothesis testing or the detection of a phenomenon. Many approaches to data fusion for hypothesis testing assume that, given each hypothesis, the sensors\u27 measurements are conditionally independent. However, since the sensors are densely deployed in practice, their field of views overlap and consequently their measurements are dependent. Moreover, a sensor\u27s measurement samples may be correlated over time. Another assumption often used in data fusion algorithms is that the underlying statistical model of sensors\u27 observations is completely known. However, in practice these statistics may not be available prior to deployment and may change over the lifetime of the network due to hardware changes, aging, and environmental conditions. In this dissertation, we consider the problem of data fusion in heterogeneous SNs (SNs in which the sensors are not identical) collecting dependent data. We develop the expectation maximization algorithm for hypothesis testing and model estimation. Copula distributions are used to model the correlation in the data. Moreover, it is assumed that the distribution of the sensors\u27 measurements is not completely known. we consider both parametric and non-parametric model estimation. The proposed approach is developed for both batch and online processing. In batch processing, fusion can only be performed after a block of data samples is received from each sensor, while in online processing, fusion is performed upon arrival of each data sample. Online processing is of great interest since for many applications, the long delay required for the accumulation of data in batch processing is not acceptable. To evaluate the proposed algorithms, both simulation data and real-world datasets are used. Detection performances of the proposed algorithms are compared with well-known supervised and unsupervised learning methods as well as with similar EM-based methods, which either partially or entirely ignore the dependence in the data
    • …
    corecore