88,830 research outputs found

    Comparing reverse complementary genomic words based on their distance distributions and frequencies

    Get PDF
    In this work we study reverse complementary genomic word pairs in the human DNA, by comparing both the distance distribution and the frequency of a word to those of its reverse complement. Several measures of dissimilarity between distance distributions are considered, and it is found that the peak dissimilarity works best in this setting. We report the existence of reverse complementary word pairs with very dissimilar distance distributions, as well as word pairs with very similar distance distributions even when both distributions are irregular and contain strong peaks. The association between distribution dissimilarity and frequency discrepancy is explored also, and it is speculated that symmetric pairs combining low and high values of each measure may uncover features of interest. Taken together, our results suggest that some asymmetries in the human genome go far beyond Chargaff's rules. This study uses both the complete human genome and its repeat-masked version.Comment: Post-print of a paper accepted to publication in "Interdisciplinary Sciences: Computational Life Sciences" (ISSN: 1913-2751, ESSN: 1867-1462

    Semantic Stability in Social Tagging Streams

    Full text link
    One potential disadvantage of social tagging systems is that due to the lack of a centralized vocabulary, a crowd of users may never manage to reach a consensus on the description of resources (e.g., books, users or songs) on the Web. Yet, previous research has provided interesting evidence that the tag distributions of resources may become semantically stable over time as more and more users tag them. At the same time, previous work has raised an array of new questions such as: (i) How can we assess the semantic stability of social tagging systems in a robust and methodical way? (ii) Does semantic stabilization of tags vary across different social tagging systems and ultimately, (iii) what are the factors that can explain semantic stabilization in such systems? In this work we tackle these questions by (i) presenting a novel and robust method which overcomes a number of limitations in existing methods, (ii) empirically investigating semantic stabilization processes in a wide range of social tagging systems with distinct domains and properties and (iii) detecting potential causes for semantic stabilization, specifically imitation behavior, shared background knowledge and intrinsic properties of natural language. Our results show that tagging streams which are generated by a combination of imitation dynamics and shared background knowledge exhibit faster and higher semantic stability than tagging streams which are generated via imitation dynamics or natural language streams alone

    Sketching for Large-Scale Learning of Mixture Models

    Get PDF
    Learning parameters from voluminous data can be prohibitive in terms of memory and computational requirements. We propose a "compressive learning" framework where we estimate model parameters from a sketch of the training data. This sketch is a collection of generalized moments of the underlying probability distribution of the data. It can be computed in a single pass on the training set, and is easily computable on streams or distributed datasets. The proposed framework shares similarities with compressive sensing, which aims at drastically reducing the dimension of high-dimensional signals while preserving the ability to reconstruct them. To perform the estimation task, we derive an iterative algorithm analogous to sparse reconstruction algorithms in the context of linear inverse problems. We exemplify our framework with the compressive estimation of a Gaussian Mixture Model (GMM), providing heuristics on the choice of the sketching procedure and theoretical guarantees of reconstruction. We experimentally show on synthetic data that the proposed algorithm yields results comparable to the classical Expectation-Maximization (EM) technique while requiring significantly less memory and fewer computations when the number of database elements is large. We further demonstrate the potential of the approach on real large-scale data (over 10 8 training samples) for the task of model-based speaker verification. Finally, we draw some connections between the proposed framework and approximate Hilbert space embedding of probability distributions using random features. We show that the proposed sketching operator can be seen as an innovative method to design translation-invariant kernels adapted to the analysis of GMMs. We also use this theoretical framework to derive information preservation guarantees, in the spirit of infinite-dimensional compressive sensing

    Universal expressions of population change by the Price equation: natural selection, information, and maximum entropy production

    Full text link
    The Price equation shows the unity between the fundamental expressions of change in biology, in information and entropy descriptions of populations, and in aspects of thermodynamics. The Price equation partitions the change in the average value of a metric between two populations. A population may be composed of organisms or particles or any members of a set to which we can assign probabilities. A metric may be biological fitness or physical energy or the output of an arbitrarily complicated function that assigns quantitative values to members of the population. The first part of the Price equation describes how directly applied forces change the probabilities assigned to members of the population when holding constant the metrical values of the members---a fixed metrical frame of reference. The second part describes how the metrical values change, altering the metrical frame of reference. In canonical examples, the direct forces balance the changing metrical frame of reference, leaving the average or total metrical values unchanged. In biology, relative reproductive success (fitness) remains invariant as a simple consequence of the conservation of total probability. In physics, systems often conserve total energy. Nonconservative metrics can be described by starting with conserved metrics, and then studying how coordinate transformations between conserved and nonconserved metrics alter the geometry of the dynamics and the aggregate values of populations. From this abstract perspective, key results from different subjects appear more simply as universal geometric principles for the dynamics of populations subject to the constraints of particular conserved quantitiesComment: v2: Complete rewrite, new title and abstract. Changed focus to Price equation as basis for universal expression of changes in populations. v3: Cleaned up usage of terms virtual and reversible displacements and virtual work and usage of d'Alembert's principle. v4: minor editing and correction

    Predicting stock market movements using network science: An information theoretic approach

    Full text link
    A stock market is considered as one of the highly complex systems, which consists of many components whose prices move up and down without having a clear pattern. The complex nature of a stock market challenges us on making a reliable prediction of its future movements. In this paper, we aim at building a new method to forecast the future movements of Standard & Poor's 500 Index (S&P 500) by constructing time-series complex networks of S&P 500 underlying companies by connecting them with links whose weights are given by the mutual information of 60-minute price movements of the pairs of the companies with the consecutive 5,340 minutes price records. We showed that the changes in the strength distributions of the networks provide an important information on the network's future movements. We built several metrics using the strength distributions and network measurements such as centrality, and we combined the best two predictors by performing a linear combination. We found that the combined predictor and the changes in S&P 500 show a quadratic relationship, and it allows us to predict the amplitude of the one step future change in S&P 500. The result showed significant fluctuations in S&P 500 Index when the combined predictor was high. In terms of making the actual index predictions, we built ARIMA models. We found that adding the network measurements into the ARIMA models improves the model accuracy. These findings are useful for financial market policy makers as an indicator based on which they can interfere with the markets before the markets make a drastic change, and for quantitative investors to improve their forecasting models.Comment: 13 pages, 7 figures, 3 table

    FSMJ: Feature Selection with Maximum Jensen-Shannon Divergence for Text Categorization

    Full text link
    In this paper, we present a new wrapper feature selection approach based on Jensen-Shannon (JS) divergence, termed feature selection with maximum JS-divergence (FSMJ), for text categorization. Unlike most existing feature selection approaches, the proposed FSMJ approach is based on real-valued features which provide more information for discrimination than binary-valued features used in conventional approaches. We show that the FSMJ is a greedy approach and the JS-divergence monotonically increases when more features are selected. We conduct several experiments on real-life data sets, compared with the state-of-the-art feature selection approaches for text categorization. The superior performance of the proposed FSMJ approach demonstrates its effectiveness and further indicates its wide potential applications on data mining.Comment: 8 pages, 6 figures, World Congress on Intelligent Control and Automation, 201
    • …
    corecore