8,166 research outputs found
Infinite factorization of multiple non-parametric views
Combined analysis of multiple data sources has increasing application interest, in particular for distinguishing shared and source-specific aspects. We extend this rationale of classical canonical correlation analysis into a flexible, generative and non-parametric clustering
setting, by introducing a novel non-parametric hierarchical
mixture model. The lower level of the model describes each source with a flexible non-parametric mixture, and the top level combines these to describe commonalities of the sources. The lower-level clusters arise from hierarchical Dirichlet Processes, inducing an infinite-dimensional contingency table between the views. The commonalities between the sources are modeled by an infinite block
model of the contingency table, interpretable as non-negative factorization of infinite matrices, or as a prior for infinite contingency tables. With Gaussian mixture components plugged in for continuous measurements, the model is applied to two views of genes, mRNA expression and abundance of the produced proteins, to expose groups of genes that are co-regulated in either or both of the views.
Cluster analysis of co-expression is a standard simple way of screening for co-regulation, and the two-view analysis extends the approach to distinguishing between pre- and post-translational regulation
The Importance of Being Clustered: Uncluttering the Trends of Statistics from 1970 to 2015
In this paper we retrace the recent history of statistics by analyzing all
the papers published in five prestigious statistical journals since 1970,
namely: Annals of Statistics, Biometrika, Journal of the American Statistical
Association, Journal of the Royal Statistical Society, series B and Statistical
Science. The aim is to construct a kind of "taxonomy" of the statistical papers
by organizing and by clustering them in main themes. In this sense being
identified in a cluster means being important enough to be uncluttered in the
vast and interconnected world of the statistical research. Since the main
statistical research topics naturally born, evolve or die during time, we will
also develop a dynamic clustering strategy, where a group in a time period is
allowed to migrate or to merge into different groups in the following one.
Results show that statistics is a very dynamic and evolving science, stimulated
by the rise of new research questions and types of data
Scalable Bayesian nonparametric measures for exploring pairwise dependence via Dirichlet Process Mixtures
In this article we propose novel Bayesian nonparametric methods using
Dirichlet Process Mixture (DPM) models for detecting pairwise dependence
between random variables while accounting for uncertainty in the form of the
underlying distributions. A key criteria is that the procedures should scale to
large data sets. In this regard we find that the formal calculation of the
Bayes factor for a dependent-vs.-independent DPM joint probability measure is
not feasible computationally. To address this we present Bayesian diagnostic
measures for characterising evidence against a "null model" of pairwise
independence. In simulation studies, as well as for a real data analysis, we
show that our approach provides a useful tool for the exploratory nonparametric
Bayesian analysis of large multivariate data sets
Estimating parameters of a multipartite loglinear graph model via the EM algorithm
We will amalgamate the Rash model (for rectangular binary tables) and the
newly introduced - models (for random undirected graphs) in the
framework of a semiparametric probabilistic graph model. Our purpose is to give
a partition of the vertices of an observed graph so that the generated
subgraphs and bipartite graphs obey these models, where their strongly
connected parameters give multiscale evaluation of the vertices at the same
time. In this way, a heterogeneous version of the stochastic block model is
built via mixtures of loglinear models and the parameters are estimated with a
special EM iteration. In the context of social networks, the clusters can be
identified with social groups and the parameters with attitudes of people of
one group towards people of the other, which attitudes depend on the cluster
memberships. The algorithm is applied to randomly generated and real-word data
Statistical Methods For Detecting Genetic Risk Factors of a Disease with Applications to Genome-Wide Association Studies
This thesis aims to develop various statistical methods for analysing the data derived from genome wide association studies (GWAS).
The GWAS often involves genotyping individual human genetic variation, using high-throughput genome-wide single nucleotide polymorphism (SNP) arrays, in thousands of individuals and testing for association between those variants and a given disease under the assumption of common disease/common variant.
Although GWAS have identified many potential genetic factors in the genome that affect the risks to complex
diseases, there is still much of the genetic heritability that remains unexplained. The power of
detecting new genetic risk variants can be improved by considering multiple genetic variants simultaneously with novel statistical methods.
Improving the analysis of the GWAS data has received much attention from statisticians and other scientific researchers over the past decade.
There are several challenges arising in analysing the GWAS data. First, determining the risk SNPs might be difficult due to non-random correlation between SNPs that can inflate type I and II errors in statistical inference. When a group of SNPs are considered together in the context of haplotypes/genotypes, the distribution of the haplotypes/genotypes is sparse, which makes it difficult to detect risk haplotypes/genotypes in terms of disease penetrance.
In this work, we proposed four new methods to identify risk haplotypes/genotypes based on their frequency differences between cases and controls. To evaluate the performances of our methods, we simulated datasets under wide range of scenarios according to both retrospective and prospective designs.
In the first method, we first reconstruct haplotypes by using unphased genotypes, followed by clustering and thresholding the inferred haplotypes into risk and non-risk groups with a two-component binomial-mixture model. In the method, the parameters were estimated by using the modified Expectation-Maximization algorithm, where the maximisation step was replaced the posterior sampling of the component parameters. We also elucidated the relationships between risk and non-risk haplotypes under different modes of inheritance and genotypic relative risk.
In the second method, we fitted a three-component mixture model to genotype data directly, followed by an odds-ratio thresholding.
In the third method, we combined the existing haplotype reconstruction software PHASE and permutation method to infer risk haplotypes.
In the fourth method, we proposed a new way to score the genotypes by clustering and combined it with a logistic regression approach to infer risk haplotypes.
The simulation studies showed that the first three methods outperformed the multiple testing method of (Zhu, 2010) in terms of average specificity and sensitivity (AVSS) in all scenarios considered. The logistic regression methods also outperformed the standard logistic regression method.
We applied our methods to two GWAS datasets on coronary artery disease (CAD) and hypertension (HT), detecting several new risk haplotypes and recovering a number of the existing disease-associated genetic variants in the literature
- …