539,001 research outputs found

    Unsupervised Discovery of Co-occurrence in Sparse High Dimensional Data

    Get PDF
    An efficient min-Hash based algorithm for discovery of dependencies in sparse high-dimensional data is presented. The dependencies are represented by sets of features co-occurring with high probability and are called co-ocsets. Sparse high dimensional descriptors, such as bag of words, have been proven very effective in the domain of image retrieval. To maintain high efficiency even for very large data collection, features are assumed independent. We show experimentally that co-ocsets are not rare, i.e. the independence assumption is often violated, and that they may ruin retrieval performance if present in the query image. Two methods for managing co-ocsets in such cases are proposed. Both methods significantly outperform the state-of-the-art in image retrieval, one is also significantly faster

    Unsupervised discovery of temporal sequences in high-dimensional datasets, with applications to neuroscience.

    Get PDF
    Identifying low-dimensional features that describe large-scale neural recordings is a major challenge in neuroscience. Repeated temporal patterns (sequences) are thought to be a salient feature of neural dynamics, but are not succinctly captured by traditional dimensionality reduction techniques. Here, we describe a software toolbox-called seqNMF-with new methods for extracting informative, non-redundant, sequences from high-dimensional neural data, testing the significance of these extracted patterns, and assessing the prevalence of sequential structure in data. We test these methods on simulated data under multiple noise conditions, and on several real neural and behavioral datas. In hippocampal data, seqNMF identifies neural sequences that match those calculated manually by reference to behavioral events. In songbird data, seqNMF discovers neural sequences in untutored birds that lack stereotyped songs. Thus, by identifying temporal structure directly from neural data, seqNMF enables dissection of complex neural circuits without relying on temporal references from stimuli or behavioral outputs

    Discovery of low-dimensional structure in high-dimensional inference problems

    Full text link
    Many learning and inference problems involve high-dimensional data such as images, video or genomic data, which cannot be processed efficiently using conventional methods due to their dimensionality. However, high-dimensional data often exhibit an inherent low-dimensional structure, for instance they can often be represented sparsely in some basis or domain. The discovery of an underlying low-dimensional structure is important to develop more robust and efficient analysis and processing algorithms. The first part of the dissertation investigates the statistical complexity of sparse recovery problems, including sparse linear and nonlinear regression models, feature selection and graph estimation. We present a framework that unifies sparse recovery problems and construct an analogy to channel coding in classical information theory. We perform an information-theoretic analysis to derive bounds on the number of samples required to reliably recover sparsity patterns independent of any specific recovery algorithm. In particular, we show that sample complexity can be tightly characterized using a mutual information formula similar to channel coding results. Next, we derive major extensions to this framework, including dependent input variables and a lower bound for sequential adaptive recovery schemes, which helps determine whether adaptivity provides performance gains. We compute statistical complexity bounds for various sparse recovery problems, showing our analysis improves upon the existing bounds and leads to intuitive results for new applications. In the second part, we investigate methods for improving the computational complexity of subgraph detection in graph-structured data, where we aim to discover anomalous patterns present in a connected subgraph of a given graph. This problem arises in many applications such as detection of network intrusions, community detection, detection of anomalous events in surveillance videos or disease outbreaks. Since optimization over connected subgraphs is a combinatorial and computationally difficult problem, we propose a convex relaxation that offers a principled approach to incorporating connectivity and conductance constraints on candidate subgraphs. We develop a novel nearly-linear time algorithm to solve the relaxed problem, establish convergence and consistency guarantees and demonstrate its feasibility and performance with experiments on real networks

    Classes of Multiple Decision Functions Strongly Controlling FWER and FDR

    Full text link
    This paper provides two general classes of multiple decision functions where each member of the first class strongly controls the family-wise error rate (FWER), while each member of the second class strongly controls the false discovery rate (FDR). These classes offer the possibility that an optimal multiple decision function with respect to a pre-specified criterion, such as the missed discovery rate (MDR), could be found within these classes. Such multiple decision functions can be utilized in multiple testing, specifically, but not limited to, the analysis of high-dimensional microarray data sets.Comment: 19 page

    Data-driven discovery of coordinates and governing equations

    Full text link
    The discovery of governing equations from scientific data has the potential to transform data-rich fields that lack well-characterized quantitative descriptions. Advances in sparse regression are currently enabling the tractable identification of both the structure and parameters of a nonlinear dynamical system from data. The resulting models have the fewest terms necessary to describe the dynamics, balancing model complexity with descriptive ability, and thus promoting interpretability and generalizability. This provides an algorithmic approach to Occam's razor for model discovery. However, this approach fundamentally relies on an effective coordinate system in which the dynamics have a simple representation. In this work, we design a custom autoencoder to discover a coordinate transformation into a reduced space where the dynamics may be sparsely represented. Thus, we simultaneously learn the governing equations and the associated coordinate system. We demonstrate this approach on several example high-dimensional dynamical systems with low-dimensional behavior. The resulting modeling framework combines the strengths of deep neural networks for flexible representation and sparse identification of nonlinear dynamics (SINDy) for parsimonious models. It is the first method of its kind to place the discovery of coordinates and models on an equal footing.Comment: 25 pages, 6 figures; added acknowledgment

    Information shares in the U.S. treasury market

    Get PDF
    This paper is the first to characterize the tatonnement of high-frequency returns from U.S. Treasury spot and futures markets. In particular, we highlight the previously neglected role of the futures markets in price discovery. The lower-bound estimate of bivariate information shares for 30-year Treasury futures typically exceeds 50% from 1998 on. Standard liquidity measures, including the proportion of trades and relative bid-ask spreads, explain daily information shares. These conclusions still hold when one controls for days of macroeconomic announcements. Finally, a 5-dimensional cointegrated system explains a high percentage of Treasury returns. In that system, the 30-year futures contract and the 5-year spot market dominate price discovery. ; Earlier title: The microstructure of bond market tatonnement; Price discovery in the U.S. treasury marketBond market
    corecore