13,279 research outputs found

    Probabilistic Blocking with An Application to the Syrian Conflict

    Full text link
    Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First, we introduce kk-means locality sensitive hashing (KLSH), which is based upon the information retrieval literature and clusters similar records into blocks using a vector-space representation and projections. Second, we introduce a subquadratic variant of LSH to the literature, known as Densified One Permutation Hashing (DOPH). Third, we propose a weighted variant of DOPH. We illustrate each method on an application to a subset of the ongoing Syrian conflict, giving a discussion of each method.Comment: 16 pages, 3 figures. arXiv admin note: substantial text overlap with arXiv:1510.07714, arXiv:1710.0269

    Statistical Challenges in Modeling Big Brain Signals

    Full text link
    Brain signal data are inherently big: massive in amount, complex in structure, and high in dimensions. These characteristics impose great challenges for statistical inference and learning. Here we review several key challenges, discuss possible solutions, and highlight future research directions

    Fast computation of p-values for the permutation test based on Pearson's correlation coefficient and other statistical tests

    Full text link
    Permutation tests are among the simplest and most widely used statistical tools. Their p-values can be computed by a straightforward sampling of permutations. However, this way of computing p-values is often so slow that it is replaced by an approximation, which is accurate only for part of the interesting range of parameters. Moreover, the accuracy of the approximation can usually not be improved by increasing the computation time. We introduce a new sampling-based algorithm which uses the fast Fourier transform to compute p-values for the permutation test based on Pearson's correlation coefficient. The algorithm is practically and asymptotically faster than straightforward sampling. Typically, its complexity is logarithmic in the input size, while the complexity of straightforward sampling is linear. The idea behind the algorithm can also be used to accelerate the computation of p-values for many other common statistical tests. The algorithm is easy to implement, but its analysis involves results from the representation theory of the symmetric group.Comment: 11 page

    A Tracy-Widom Empirical Estimator For Valid P-values With High-Dimensional Datasets

    Full text link
    Recent technological advances in many domains including both genomics and brain imaging have led to an abundance of high-dimensional and correlated data being routinely collected. Classical multivariate approaches like Multivariate Analysis of Variance (MANOVA) and Canonical Correlation Analysis (CCA) can be used to study relationships between such multivariate datasets. Yet, special care is required with high-dimensional data, as the test statistics may be ill-defined and classical inference procedures break down. In this work, we explain how valid p-values can be derived for these multivariate methods even in high dimensional datasets. Our main contribution is an empirical estimator for the largest root distribution of a singular double Wishart problem; this general framework underlies many common multivariate analysis approaches. From a small number of permutations of the data, we estimate the location and scale parameters of a parametric Tracy-Widom family that provides a good approximation of this distribution. Through simulations, we show that this estimated distribution also leads to valid p-values that can be used for high-dimensional inference. We then apply our approach to a pathway-based analysis of the association between DNA methylation and disease type in patients with systemic auto-immune rheumatic diseases

    Dimensionality Reduction and (Bucket) Ranking: a Mass Transportation Approach

    Full text link
    Whereas most dimensionality reduction techniques (e.g. PCA, ICA, NMF) for multivariate data essentially rely on linear algebra to a certain extent, summarizing ranking data, viewed as realizations of a random permutation Σ\Sigma on a set of items indexed by i∈{1,…,  n}i\in \{1,\ldots,\; n\}, is a great statistical challenge, due to the absence of vector space structure for the set of permutations Sn\mathfrak{S}_n. It is the goal of this article to develop an original framework for possibly reducing the number of parameters required to describe the distribution of a statistical population composed of rankings/permutations, on the premise that the collection of items under study can be partitioned into subsets/buckets, such that, with high probability, items in a certain bucket are either all ranked higher or else all ranked lower than items in another bucket. In this context, Σ\Sigma's distribution can be hopefully represented in a sparse manner by a bucket distribution, i.e. a bucket ordering plus the ranking distributions within each bucket. More precisely, we introduce a dedicated distortion measure, based on a mass transportation metric, in order to quantify the accuracy of such representations. The performance of buckets minimizing an empirical version of the distortion is investigated through a rate bound analysis. Complexity penalization techniques are also considered to select the shape of a bucket order with minimum expected distortion. Beyond theoretical concepts and results, numerical experiments on real ranking data are displayed in order to provide empirical evidence of the relevance of the approach promoted

    Monte Carlo Methods and Path-Generation techniques for Pricing Multi-asset Path-dependent Options

    Full text link
    We consider the problem of pricing path-dependent options on a basket of underlying assets using simulations. As an example we develop our studies using Asian options. Asian options are derivative contracts in which the underlying variable is the average price of given assets sampled over a period of time. Due to this structure, Asian options display a lower volatility and are therefore cheaper than their standard European counterparts. This paper is a survey of some recent enhancements to improve efficiency when pricing Asian options by Monte Carlo simulation in the Black-Scholes model. We analyze the dynamics with constant and time-dependent volatilities of the underlying asset returns. We present a comparison between the precision of the standard Monte Carlo method (MC) and the stratified Latin Hypercube Sampling (LHS). In particular, we discuss the use of low-discrepancy sequences, also known as Quasi-Monte Carlo method (QMC), and a randomized version of these sequences, known as Randomized Quasi Monte Carlo (RQMC). The latter has proven to be a useful variance reduction technique for both problems of up to 20 dimensions and for very high dimensions. Moreover, we present and test a new path generation approach based on a Kronecker product approximation (KPA) in the case of time-dependent volatilities. KPA proves to be a fast generation technique and reduces the computational cost of the simulation procedure.Comment: 34 pages, 4 figure, 15 table

    FinInG: a package for Finite Incidence Geometry

    Full text link
    FinInG is a package for computation in Finite Incidence Geometry. It provides users with the basic tools to work in various areas of finite geometry from the realms of projective spaces to the flat lands of generalised polygons. The algebraic power of GAP is exploited, particularly in its facility with matrix and permutation groups

    Efficient Learning of Mixed Membership Models

    Full text link
    We present an efficient algorithm for learning mixed membership models when the number of variables pp is much larger than the number of hidden components kk. This algorithm reduces the computational complexity of state-of-the-art tensor methods, which require decomposing an O(p3)O\left(p^3\right) tensor, to factorizing O(p/k)O\left(p/k\right) sub-tensors each of size O(k3)O\left(k^3\right). In addition, we address the issue of negative entries in the empirical method of moments based estimators. We provide sufficient conditions under which our approach has provable guarantees. Our approach obtains competitive empirical results on both simulated and real data.Comment: 23 pages, Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 201

    Solving Polynomial Systems in the Cloud with Polynomial Homotopy Continuation

    Full text link
    Polynomial systems occur in many fields of science and engineering. Polynomial homotopy continuation methods apply symbolic-numeric algorithms to solve polynomial systems. We describe the design and implementation of our web interface and reflect on the application of polynomial homotopy continuation methods to solve polynomial systems in the cloud. Via the graph isomorphism problem we organize and classify the polynomial systems we solved. The classification with the canonical form of a graph identifies newly submitted systems with systems that have already been solved.Comment: Accepted for publication in the Proceedings of CASC 201

    A Joint Encryption-Encoding Scheme Using QC-LDPC Codes Based on Finite Geometry

    Full text link
    Joint encryption-encoding schemes have been released to fulfill both reliability and security desires in a single step. Using Low Density Parity Check (LDPC) codes in joint encryption-encoding schemes, as an alternative to classical linear codes, would shorten the key size as well as improving error correction capability. In this article, we present a joint encryption-encoding scheme using Quasi Cyclic-Low Density Parity Check (QC-LDPC) codes based on finite geometry. We observed that our proposed scheme not only outperforms its predecessors in key size and transmission rate, but also remains secure against all known cryptanalyses of code-based secret key cryptosystems. We subsequently show that our scheme benefits from low computational complexity. In our proposed joint encryption-encoding scheme, by taking the advantage of QC-LDPC codes based on finite geometries, the key size decreases to 1/5 of that of the so far best similar system. In addition, using our proposed scheme a wide range of desirable transmission rates are achievable. This variety of codes makes our cryptosystem suitable for a number of different communication and cryptographic standards.Comment: 21 pages, 2 figure
    • …
    corecore