2,291 research outputs found

    Determining Principal Component Cardinality through the Principle of Minimum Description Length

    Full text link
    PCA (Principal Component Analysis) and its variants areubiquitous techniques for matrix dimension reduction and reduced-dimensionlatent-factor extraction. One significant challenge in using PCA, is thechoice of the number of principal components. The information-theoreticMDL (Minimum Description Length) principle gives objective compression-based criteria for model selection, but it is difficult to analytically applyits modern definition - NML (Normalized Maximum Likelihood) - to theproblem of PCA. This work shows a general reduction of NML prob-lems to lower-dimension problems. Applying this reduction, it boundsthe NML of PCA, by terms of the NML of linear regression, which areknown.Comment: LOD 201

    MDL Convergence Speed for Bernoulli Sequences

    Get PDF
    The Minimum Description Length principle for online sequence estimation/prediction in a proper learning setup is studied. If the underlying model class is discrete, then the total expected square loss is a particularly interesting performance measure: (a) this quantity is finitely bounded, implying convergence with probability one, and (b) it additionally specifies the convergence speed. For MDL, in general one can only have loss bounds which are finite but exponentially larger than those for Bayes mixtures. We show that this is even the case if the model class contains only Bernoulli distributions. We derive a new upper bound on the prediction error for countable Bernoulli classes. This implies a small bound (comparable to the one for Bayes mixtures) for certain important model classes. We discuss the application to Machine Learning tasks such as classification and hypothesis testing, and generalization to countable classes of i.i.d. models.Comment: 28 page

    Indefinitely Oscillating Martingales

    Full text link
    We construct a class of nonnegative martingale processes that oscillate indefinitely with high probability. For these processes, we state a uniform rate of the number of oscillations and show that this rate is asymptotically close to the theoretical upper bound. These bounds on probability and expectation of the number of upcrossings are compared to classical bounds from the martingale literature. We discuss two applications. First, our results imply that the limit of the minimum description length operator may not exist. Second, we give bounds on how often one can change one's belief in a given hypothesis when observing a stream of data.Comment: ALT 2014, extended technical repor

    Associations of tissue transglutaminase antibody seropositivity with coronary heart disease: Findings from a prospective cohort study.

    Get PDF
    Clinical experience and observational studies suggest that individuals with coeliac disease are at increased risk of coronary heart disease (CHD), but the precise mechanism for this is unclear. Laboratory studies suggest that it may relate to tissue transglutaminase antibodies (tTGAs). Our aim was to examine whether seropositivity for tTGA and endomysial antibodies (EMAs) are associated with incident CHD in humans. We used data from Mini-Finland Health Survey, a prospective cohort study of Finnish men and women aged 35-80 at study baseline 1978-80. TTGA and EMA seropositivities were ascertained from baseline blood samples and incident CHD events were identified from national hospitalisation and death registers. Cox regression was used to examine the associations between antibody seropositivity and incident CHD. Of 6887 men and women, 562 were seropositive for tTGAs and 72 for EMAs. During a median follow-up of 26 years, 2367 individuals experienced a CHD event. We found no clear evidence for an association between tTGA positivity (hazard ratio, HR: 1.04, 95% confidence interval, CI: 0.83, 1.30) or EMA positivity (HR: 1.16, 95% CI: 0.77, 1.74) and incident CHD, once pre-existing CVD and known CHD risk factors had been adjusted for. We found no clear evidence for an association of tTGA or EMA seropositivity with incident CHD outcomes, suggesting that tTG autoimmunity is unlikely to be the biological link between coeliac disease and CHD

    Mutual Information of Population Codes and Distance Measures in Probability Space

    Full text link
    We studied the mutual information between a stimulus and a large system consisting of stochastic, statistically independent elements that respond to a stimulus. The Mutual Information (MI) of the system saturates exponentially with system size. A theory of the rate of saturation of the MI is developed. We show that this rate is controlled by a distance function between the response probabilities induced by different stimuli. This function, which we term the {\it Confusion Distance} between two probabilities, is related to the Renyi α\alpha-Information.Comment: 11 pages, 3 figures, accepted to PR

    Challenges of Religious Literacy in Education : Islam and the Governance of Religious Diversity in Multi-faith Schools

    Get PDF
    This chapter seeks take part in an emerging research where religion is approached as a whole school endeavor. Previous research and policy recommendations typically focused on teaching about religion in school, but the accommodation of religious diversity in the wider school culture merits more attention. Based on observations in our multiple case studies, we discuss the multi-level governance of religious diversity in Finnish multi-faith schools with a particular focus on the challenges of religious literacy for educators. The three examples we present focus on the inclusion of Muslims in Finnish schools and in particular on the challenges for educator (1) in interpreting the distinction between religion and culture, (2) in recognizing and handling intra-religious diversity, and (3) in being aware of Protestant conceptions of religion and culture. A theme cutting across these examples is how they reflect the tendencies either to see different situations merely through the lens of religion (religionisation), or not to recognize the importance of religion at all (religion-blindness). We argue that religious literacy should be recognized and developed as a vital part of the intercultural competencies of educators.Peer reviewe

    Nonparametric Hierarchical Clustering of Functional Data

    Full text link
    In this paper, we deal with the problem of curves clustering. We propose a nonparametric method which partitions the curves into clusters and discretizes the dimensions of the curve points into intervals. The cross-product of these partitions forms a data-grid which is obtained using a Bayesian model selection approach while making no assumptions regarding the curves. Finally, a post-processing technique, aiming at reducing the number of clusters in order to improve the interpretability of the clustering, is proposed. It consists in optimally merging the clusters step by step, which corresponds to an agglomerative hierarchical classification whose dissimilarity measure is the variation of the criterion. Interestingly this measure is none other than the sum of the Kullback-Leibler divergences between clusters distributions before and after the merges. The practical interest of the approach for functional data exploratory analysis is presented and compared with an alternative approach on an artificial and a real world data set

    Sequence alignment, mutual information, and dissimilarity measures for constructing phylogenies

    Get PDF
    Existing sequence alignment algorithms use heuristic scoring schemes which cannot be used as objective distance metrics. Therefore one relies on measures like the p- or log-det distances, or makes explicit, and often simplistic, assumptions about sequence evolution. Information theory provides an alternative, in the form of mutual information (MI) which is, in principle, an objective and model independent similarity measure. MI can be estimated by concatenating and zipping sequences, yielding thereby the "normalized compression distance". So far this has produced promising results, but with uncontrolled errors. We describe a simple approach to get robust estimates of MI from global pairwise alignments. Using standard alignment algorithms, this gives for animal mitochondrial DNA estimates that are strikingly close to estimates obtained from the alignment free methods mentioned above. Our main result uses algorithmic (Kolmogorov) information theory, but we show that similar results can also be obtained from Shannon theory. Due to the fact that it is not additive, normalized compression distance is not an optimal metric for phylogenetics, but we propose a simple modification that overcomes the issue of additivity. We test several versions of our MI based distance measures on a large number of randomly chosen quartets and demonstrate that they all perform better than traditional measures like the Kimura or log-det (resp. paralinear) distances. Even a simplified version based on single letter Shannon entropies, which can be easily incorporated in existing software packages, gave superior results throughout the entire animal kingdom. But we see the main virtue of our approach in a more general way. For example, it can also help to judge the relative merits of different alignment algorithms, by estimating the significance of specific alignments.Comment: 19 pages + 16 pages of supplementary materia
    • 

    corecore