13,279 research outputs found
Probabilistic Blocking with An Application to the Syrian Conflict
Entity resolution seeks to merge databases as to remove duplicate entries
where unique identifiers are typically unknown. We review modern blocking
approaches for entity resolution, focusing on those based upon locality
sensitive hashing (LSH). First, we introduce -means locality sensitive
hashing (KLSH), which is based upon the information retrieval literature and
clusters similar records into blocks using a vector-space representation and
projections. Second, we introduce a subquadratic variant of LSH to the
literature, known as Densified One Permutation Hashing (DOPH). Third, we
propose a weighted variant of DOPH. We illustrate each method on an application
to a subset of the ongoing Syrian conflict, giving a discussion of each method.Comment: 16 pages, 3 figures. arXiv admin note: substantial text overlap with
arXiv:1510.07714, arXiv:1710.0269
Statistical Challenges in Modeling Big Brain Signals
Brain signal data are inherently big: massive in amount, complex in
structure, and high in dimensions. These characteristics impose great
challenges for statistical inference and learning. Here we review several key
challenges, discuss possible solutions, and highlight future research
directions
Fast computation of p-values for the permutation test based on Pearson's correlation coefficient and other statistical tests
Permutation tests are among the simplest and most widely used statistical
tools. Their p-values can be computed by a straightforward sampling of
permutations. However, this way of computing p-values is often so slow that it
is replaced by an approximation, which is accurate only for part of the
interesting range of parameters. Moreover, the accuracy of the approximation
can usually not be improved by increasing the computation time.
We introduce a new sampling-based algorithm which uses the fast Fourier
transform to compute p-values for the permutation test based on Pearson's
correlation coefficient. The algorithm is practically and asymptotically faster
than straightforward sampling. Typically, its complexity is logarithmic in the
input size, while the complexity of straightforward sampling is linear. The
idea behind the algorithm can also be used to accelerate the computation of
p-values for many other common statistical tests. The algorithm is easy to
implement, but its analysis involves results from the representation theory of
the symmetric group.Comment: 11 page
A Tracy-Widom Empirical Estimator For Valid P-values With High-Dimensional Datasets
Recent technological advances in many domains including both genomics and
brain imaging have led to an abundance of high-dimensional and correlated data
being routinely collected. Classical multivariate approaches like Multivariate
Analysis of Variance (MANOVA) and Canonical Correlation Analysis (CCA) can be
used to study relationships between such multivariate datasets. Yet, special
care is required with high-dimensional data, as the test statistics may be
ill-defined and classical inference procedures break down.
In this work, we explain how valid p-values can be derived for these
multivariate methods even in high dimensional datasets. Our main contribution
is an empirical estimator for the largest root distribution of a singular
double Wishart problem; this general framework underlies many common
multivariate analysis approaches. From a small number of permutations of the
data, we estimate the location and scale parameters of a parametric Tracy-Widom
family that provides a good approximation of this distribution. Through
simulations, we show that this estimated distribution also leads to valid
p-values that can be used for high-dimensional inference. We then apply our
approach to a pathway-based analysis of the association between DNA methylation
and disease type in patients with systemic auto-immune rheumatic diseases
Dimensionality Reduction and (Bucket) Ranking: a Mass Transportation Approach
Whereas most dimensionality reduction techniques (e.g. PCA, ICA, NMF) for
multivariate data essentially rely on linear algebra to a certain extent,
summarizing ranking data, viewed as realizations of a random permutation
on a set of items indexed by , is a great
statistical challenge, due to the absence of vector space structure for the set
of permutations . It is the goal of this article to develop an
original framework for possibly reducing the number of parameters required to
describe the distribution of a statistical population composed of
rankings/permutations, on the premise that the collection of items under study
can be partitioned into subsets/buckets, such that, with high probability,
items in a certain bucket are either all ranked higher or else all ranked lower
than items in another bucket. In this context, 's distribution can be
hopefully represented in a sparse manner by a bucket distribution, i.e. a
bucket ordering plus the ranking distributions within each bucket. More
precisely, we introduce a dedicated distortion measure, based on a mass
transportation metric, in order to quantify the accuracy of such
representations. The performance of buckets minimizing an empirical version of
the distortion is investigated through a rate bound analysis. Complexity
penalization techniques are also considered to select the shape of a bucket
order with minimum expected distortion. Beyond theoretical concepts and
results, numerical experiments on real ranking data are displayed in order to
provide empirical evidence of the relevance of the approach promoted
Monte Carlo Methods and Path-Generation techniques for Pricing Multi-asset Path-dependent Options
We consider the problem of pricing path-dependent options on a basket of
underlying assets using simulations. As an example we develop our studies using
Asian options. Asian options are derivative contracts in which the underlying
variable is the average price of given assets sampled over a period of time.
Due to this structure, Asian options display a lower volatility and are
therefore cheaper than their standard European counterparts. This paper is a
survey of some recent enhancements to improve efficiency when pricing Asian
options by Monte Carlo simulation in the Black-Scholes model. We analyze the
dynamics with constant and time-dependent volatilities of the underlying asset
returns. We present a comparison between the precision of the standard Monte
Carlo method (MC) and the stratified Latin Hypercube Sampling (LHS). In
particular, we discuss the use of low-discrepancy sequences, also known as
Quasi-Monte Carlo method (QMC), and a randomized version of these sequences,
known as Randomized Quasi Monte Carlo (RQMC). The latter has proven to be a
useful variance reduction technique for both problems of up to 20 dimensions
and for very high dimensions. Moreover, we present and test a new path
generation approach based on a Kronecker product approximation (KPA) in the
case of time-dependent volatilities. KPA proves to be a fast generation
technique and reduces the computational cost of the simulation procedure.Comment: 34 pages, 4 figure, 15 table
FinInG: a package for Finite Incidence Geometry
FinInG is a package for computation in Finite Incidence Geometry. It provides
users with the basic tools to work in various areas of finite geometry from the
realms of projective spaces to the flat lands of generalised polygons. The
algebraic power of GAP is exploited, particularly in its facility with matrix
and permutation groups
Efficient Learning of Mixed Membership Models
We present an efficient algorithm for learning mixed membership models when
the number of variables is much larger than the number of hidden components
. This algorithm reduces the computational complexity of state-of-the-art
tensor methods, which require decomposing an tensor, to
factorizing sub-tensors each of size .
In addition, we address the issue of negative entries in the empirical method
of moments based estimators. We provide sufficient conditions under which our
approach has provable guarantees. Our approach obtains competitive empirical
results on both simulated and real data.Comment: 23 pages, Proceedings of the 34th International Conference on Machine
Learning (ICML), Sydney, Australia, 201
Solving Polynomial Systems in the Cloud with Polynomial Homotopy Continuation
Polynomial systems occur in many fields of science and engineering.
Polynomial homotopy continuation methods apply symbolic-numeric algorithms to
solve polynomial systems. We describe the design and implementation of our web
interface and reflect on the application of polynomial homotopy continuation
methods to solve polynomial systems in the cloud. Via the graph isomorphism
problem we organize and classify the polynomial systems we solved. The
classification with the canonical form of a graph identifies newly submitted
systems with systems that have already been solved.Comment: Accepted for publication in the Proceedings of CASC 201
A Joint Encryption-Encoding Scheme Using QC-LDPC Codes Based on Finite Geometry
Joint encryption-encoding schemes have been released to fulfill both
reliability and security desires in a single step. Using Low Density Parity
Check (LDPC) codes in joint encryption-encoding schemes, as an alternative to
classical linear codes, would shorten the key size as well as improving error
correction capability. In this article, we present a joint encryption-encoding
scheme using Quasi Cyclic-Low Density Parity Check (QC-LDPC) codes based on
finite geometry. We observed that our proposed scheme not only outperforms its
predecessors in key size and transmission rate, but also remains secure against
all known cryptanalyses of code-based secret key cryptosystems. We subsequently
show that our scheme benefits from low computational complexity. In our
proposed joint encryption-encoding scheme, by taking the advantage of QC-LDPC
codes based on finite geometries, the key size decreases to 1/5 of that of the
so far best similar system. In addition, using our proposed scheme a wide range
of desirable transmission rates are achievable. This variety of codes makes our
cryptosystem suitable for a number of different communication and cryptographic
standards.Comment: 21 pages, 2 figure
- …