2,291 research outputs found
Determining Principal Component Cardinality through the Principle of Minimum Description Length
PCA (Principal Component Analysis) and its variants areubiquitous techniques
for matrix dimension reduction and reduced-dimensionlatent-factor extraction.
One significant challenge in using PCA, is thechoice of the number of principal
components. The information-theoreticMDL (Minimum Description Length) principle
gives objective compression-based criteria for model selection, but it is
difficult to analytically applyits modern definition - NML (Normalized Maximum
Likelihood) - to theproblem of PCA. This work shows a general reduction of NML
prob-lems to lower-dimension problems. Applying this reduction, it boundsthe
NML of PCA, by terms of the NML of linear regression, which areknown.Comment: LOD 201
MDL Convergence Speed for Bernoulli Sequences
The Minimum Description Length principle for online sequence
estimation/prediction in a proper learning setup is studied. If the underlying
model class is discrete, then the total expected square loss is a particularly
interesting performance measure: (a) this quantity is finitely bounded,
implying convergence with probability one, and (b) it additionally specifies
the convergence speed. For MDL, in general one can only have loss bounds which
are finite but exponentially larger than those for Bayes mixtures. We show that
this is even the case if the model class contains only Bernoulli distributions.
We derive a new upper bound on the prediction error for countable Bernoulli
classes. This implies a small bound (comparable to the one for Bayes mixtures)
for certain important model classes. We discuss the application to Machine
Learning tasks such as classification and hypothesis testing, and
generalization to countable classes of i.i.d. models.Comment: 28 page
Indefinitely Oscillating Martingales
We construct a class of nonnegative martingale processes that oscillate
indefinitely with high probability. For these processes, we state a uniform
rate of the number of oscillations and show that this rate is asymptotically
close to the theoretical upper bound. These bounds on probability and
expectation of the number of upcrossings are compared to classical bounds from
the martingale literature. We discuss two applications. First, our results
imply that the limit of the minimum description length operator may not exist.
Second, we give bounds on how often one can change one's belief in a given
hypothesis when observing a stream of data.Comment: ALT 2014, extended technical repor
Associations of tissue transglutaminase antibody seropositivity with coronary heart disease: Findings from a prospective cohort study.
Clinical experience and observational studies suggest that individuals with coeliac disease are at increased risk of coronary heart disease (CHD), but the precise mechanism for this is unclear. Laboratory studies suggest that it may relate to tissue transglutaminase antibodies (tTGAs). Our aim was to examine whether seropositivity for tTGA and endomysial antibodies (EMAs) are associated with incident CHD in humans.
We used data from Mini-Finland Health Survey, a prospective cohort study of Finnish men and women aged 35-80 at study baseline 1978-80. TTGA and EMA seropositivities were ascertained from baseline blood samples and incident CHD events were identified from national hospitalisation and death registers. Cox regression was used to examine the associations between antibody seropositivity and incident CHD. Of 6887 men and women, 562 were seropositive for tTGAs and 72 for EMAs. During a median follow-up of 26 years, 2367 individuals experienced a CHD event. We found no clear evidence for an association between tTGA positivity (hazard ratio, HR: 1.04, 95% confidence interval, CI: 0.83, 1.30) or EMA positivity (HR: 1.16, 95% CI: 0.77, 1.74) and incident CHD, once pre-existing CVD and known CHD risk factors had been adjusted for.
We found no clear evidence for an association of tTGA or EMA seropositivity with incident CHD outcomes, suggesting that tTG autoimmunity is unlikely to be the biological link between coeliac disease and CHD
Mutual Information of Population Codes and Distance Measures in Probability Space
We studied the mutual information between a stimulus and a large system
consisting of stochastic, statistically independent elements that respond to a
stimulus. The Mutual Information (MI) of the system saturates exponentially
with system size. A theory of the rate of saturation of the MI is developed. We
show that this rate is controlled by a distance function between the response
probabilities induced by different stimuli. This function, which we term the
{\it Confusion Distance} between two probabilities, is related to the Renyi
-Information.Comment: 11 pages, 3 figures, accepted to PR
Challenges of Religious Literacy in Education : Islam and the Governance of Religious Diversity in Multi-faith Schools
This chapter seeks take part in an emerging research where religion is approached as a whole school endeavor. Previous research and policy recommendations typically focused on teaching about religion in school, but the accommodation of religious diversity in the wider school culture merits more attention. Based on observations in our multiple case studies, we discuss the multi-level governance of religious diversity in Finnish multi-faith schools with a particular focus on the challenges of religious literacy for educators. The three examples we present focus on the inclusion of Muslims in Finnish schools and in particular on the challenges for educator (1) in interpreting the distinction between religion and culture, (2) in recognizing and handling intra-religious diversity, and (3) in being aware of Protestant conceptions of religion and culture. A theme cutting across these examples is how they reflect the tendencies either to see different situations merely through the lens of religion (religionisation), or not to recognize the importance of religion at all (religion-blindness). We argue that religious literacy should be recognized and developed as a vital part of the intercultural competencies of educators.Peer reviewe
Nonparametric Hierarchical Clustering of Functional Data
In this paper, we deal with the problem of curves clustering. We propose a
nonparametric method which partitions the curves into clusters and discretizes
the dimensions of the curve points into intervals. The cross-product of these
partitions forms a data-grid which is obtained using a Bayesian model selection
approach while making no assumptions regarding the curves. Finally, a
post-processing technique, aiming at reducing the number of clusters in order
to improve the interpretability of the clustering, is proposed. It consists in
optimally merging the clusters step by step, which corresponds to an
agglomerative hierarchical classification whose dissimilarity measure is the
variation of the criterion. Interestingly this measure is none other than the
sum of the Kullback-Leibler divergences between clusters distributions before
and after the merges. The practical interest of the approach for functional
data exploratory analysis is presented and compared with an alternative
approach on an artificial and a real world data set
Sequence alignment, mutual information, and dissimilarity measures for constructing phylogenies
Existing sequence alignment algorithms use heuristic scoring schemes which
cannot be used as objective distance metrics. Therefore one relies on measures
like the p- or log-det distances, or makes explicit, and often simplistic,
assumptions about sequence evolution. Information theory provides an
alternative, in the form of mutual information (MI) which is, in principle, an
objective and model independent similarity measure. MI can be estimated by
concatenating and zipping sequences, yielding thereby the "normalized
compression distance". So far this has produced promising results, but with
uncontrolled errors. We describe a simple approach to get robust estimates of
MI from global pairwise alignments. Using standard alignment algorithms, this
gives for animal mitochondrial DNA estimates that are strikingly close to
estimates obtained from the alignment free methods mentioned above. Our main
result uses algorithmic (Kolmogorov) information theory, but we show that
similar results can also be obtained from Shannon theory. Due to the fact that
it is not additive, normalized compression distance is not an optimal metric
for phylogenetics, but we propose a simple modification that overcomes the
issue of additivity. We test several versions of our MI based distance measures
on a large number of randomly chosen quartets and demonstrate that they all
perform better than traditional measures like the Kimura or log-det (resp.
paralinear) distances. Even a simplified version based on single letter Shannon
entropies, which can be easily incorporated in existing software packages, gave
superior results throughout the entire animal kingdom. But we see the main
virtue of our approach in a more general way. For example, it can also help to
judge the relative merits of different alignment algorithms, by estimating the
significance of specific alignments.Comment: 19 pages + 16 pages of supplementary materia
- âŠ