139,891 research outputs found
Detecting relevant changes in time series models
Most of the literature on change-point analysis by means of hypothesis
testing considers hypotheses of the form H0 : \theta_1 = \theta_2 vs. H1 :
\theta_1 != \theta_2, where \theta_1 and \theta_2 denote parameters of the
process before and after a change point. This paper takes a different
perspective and investigates the null hypotheses of no relevant changes, i.e.
H0 : ||\theta_1 - \theta_2|| ? \leq \Delta?, where || \cdot || is an
appropriate norm. This formulation of the testing problem is motivated by the
fact that in many applications a modification of the statistical analysis might
not be necessary, if the difference between the parameters before and after the
change-point is small. A general approach to problems of this type is developed
which is based on the CUSUM principle. For the asymptotic analysis weak
convergence of the sequential empirical process has to be established under the
alternative of non-stationarity, and it is shown that the resulting test
statistic is asymptotically normal distributed. Several applications of the
methodology are given including tests for relevant changes in the mean,
variance, parameter in a linear regression model and distribution function
among others. The finite sample properties of the new tests are investigated by
means of a simulation study and illustrated by analyzing a data example from
economics.Comment: Keywords: change-point analysis, CUSUM, relevant changes, precise
hypotheses, strong mixing, weak convergence under the alternative AMS Subject
Classification: 62M10, 62F05, 62G1
Exchangeable Variable Models
A sequence of random variables is exchangeable if its joint distribution is
invariant under variable permutations. We introduce exchangeable variable
models (EVMs) as a novel class of probabilistic models whose basic building
blocks are partially exchangeable sequences, a generalization of exchangeable
sequences. We prove that a family of tractable EVMs is optimal under zero-one
loss for a large class of functions, including parity and threshold functions,
and strictly subsumes existing tractable independence-based model families.
Extensive experiments show that EVMs outperform state of the art classifiers
such as SVMs and probabilistic models which are solely based on independence
assumptions.Comment: ICML 201
Testing statistical hypothesis on random trees and applications to the protein classification problem
Efficient automatic protein classification is of central importance in
genomic annotation. As an independent way to check the reliability of the
classification, we propose a statistical approach to test if two sets of
protein domain sequences coming from two families of the Pfam database are
significantly different. We model protein sequences as realizations of Variable
Length Markov Chains (VLMC) and we use the context trees as a signature of each
protein family. Our approach is based on a Kolmogorov--Smirnov-type
goodness-of-fit test proposed by Balding et al. [Limit theorems for sequences
of random trees (2008), DOI: 10.1007/s11749-008-0092-z]. The test statistic is
a supremum over the space of trees of a function of the two samples; its
computation grows, in principle, exponentially fast with the maximal number of
nodes of the potential trees. We show how to transform this problem into a
max-flow over a related graph which can be solved using a Ford--Fulkerson
algorithm in polynomial time on that number. We apply the test to 10 randomly
chosen protein domain families from the seed of Pfam-A database (high quality,
manually curated families). The test shows that the distributions of context
trees coming from different families are significantly different. We emphasize
that this is a novel mathematical approach to validate the automatic clustering
of sequences in any context. We also study the performance of the test via
simulations on Galton--Watson related processes.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS218 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
"Building" exact confidence nets
Confidence nets, that is, collections of confidence intervals that fill out
the parameter space and whose exact parameter coverage can be computed, are
familiar in nonparametric statistics. Here, the distributional assumptions are
based on invariance under the action of a finite reflection group. Exact
confidence nets are exhibited for a single parameter, based on the root system
of the group. The main result is a formula for the generating function of the
coverage interval probabilities. The proof makes use of the theory of
"buildings" and the Chevalley factorization theorem for the length distribution
on Cayley graphs of finite reflection groups.Comment: 20 pages. To appear in Bernoull
Enrichment Procedures for Soft Clusters: A Statistical Test and its Applications
Clusters, typically mined by modeling locality of attribute spaces, are often evaluated for their ability to demonstrate âenrichmentâ of categorical features. A cluster enrichment procedure evaluates the membership of a cluster for significant representation in pre-defined categories of interest. While classical enrichment procedures assume a hard clustering deïŹnition, in this paper we introduce a new statistical test that computes enrichments for soft clusters. We demonstrate an application of this test in reïŹning and evaluating soft clusters for classification of remotely sensed images
Recent advances in directional statistics
Mainstream statistical methodology is generally applicable to data observed
in Euclidean space. There are, however, numerous contexts of considerable
scientific interest in which the natural supports for the data under
consideration are Riemannian manifolds like the unit circle, torus, sphere and
their extensions. Typically, such data can be represented using one or more
directions, and directional statistics is the branch of statistics that deals
with their analysis. In this paper we provide a review of the many recent
developments in the field since the publication of Mardia and Jupp (1999),
still the most comprehensive text on directional statistics. Many of those
developments have been stimulated by interesting applications in fields as
diverse as astronomy, medicine, genetics, neurology, aeronautics, acoustics,
image analysis, text mining, environmetrics, and machine learning. We begin by
considering developments for the exploratory analysis of directional data
before progressing to distributional models, general approaches to inference,
hypothesis testing, regression, nonparametric curve estimation, methods for
dimension reduction, classification and clustering, and the modelling of time
series, spatial and spatio-temporal data. An overview of currently available
software for analysing directional data is also provided, and potential future
developments discussed.Comment: 61 page
- âŠ