96,913 research outputs found
Spatial clustering of array CGH features in combination with hierarchical multiple testing
We propose a new approach for clustering DNA features using array CGH data
from multiple tumor samples. We distinguish data-collapsing: joining contiguous
DNA clones or probes with extremely similar data into regions, from clustering:
joining contiguous, correlated regions based on a maximum likelihood principle.
The model-based clustering algorithm accounts for the apparent spatial patterns
in the data. We evaluate the randomness of the clustering result by a cluster
stability score in combination with cross-validation. Moreover, we argue that
the clustering really captures spatial genomic dependency by showing that
coincidental clustering of independent regions is very unlikely. Using the
region and cluster information, we combine testing of these for association
with a clinical variable in an hierarchical multiple testing approach. This
allows for interpreting the significance of both regions and clusters while
controlling the Family-Wise Error Rate simultaneously. We prove that in the
context of permutation tests and permutation-invariant clusters it is allowed
to perform clustering and testing on the same data set. Our procedures are
illustrated on two cancer data sets
Fuzzy Jets
Collimated streams of particles produced in high energy physics experiments
are organized using clustering algorithms to form jets. To construct jets, the
experimental collaborations based at the Large Hadron Collider (LHC) primarily
use agglomerative hierarchical clustering schemes known as sequential
recombination. We propose a new class of algorithms for clustering jets that
use infrared and collinear safe mixture models. These new algorithms, known as
fuzzy jets, are clustered using maximum likelihood techniques and can
dynamically determine various properties of jets like their size. We show that
the fuzzy jet size adds additional information to conventional jet tagging
variables. Furthermore, we study the impact of pileup and show that with some
slight modifications to the algorithm, fuzzy jets can be stable up to high
pileup interaction multiplicities
Model-based clustering via linear cluster-weighted models
A novel family of twelve mixture models with random covariates, nested in the
linear cluster-weighted model (CWM), is introduced for model-based
clustering. The linear CWM was recently presented as a robust alternative
to the better known linear Gaussian CWM. The proposed family of models provides
a unified framework that also includes the linear Gaussian CWM as a special
case. Maximum likelihood parameter estimation is carried out within the EM
framework, and both the BIC and the ICL are used for model selection. A simple
and effective hierarchical random initialization is also proposed for the EM
algorithm. The novel model-based clustering technique is illustrated in some
applications to real data. Finally, a simulation study for evaluating the
performance of the BIC and the ICL is presented
Hierarchical maximum likelihood clustering approach
Objective:
In this work, we focused on developing a clustering approach for biological data. In many biological
analyses, such as multi-omics data analysis and genome-wide
association studies (GWAS) analysis, it is crucial to find groups of data belonging to subtypes of diseases or tumors. Methods:
Conventionally, the k-means clustering algorithm is
overwhelmingly applied in many areas including biological
sciences. There are, however, several alternative clustering algorithms that can be applied, including support vector clustering. In this paper, taking into consideration the nature of biological data, we propose a maximum likelihood clustering scheme based on a hierarchical framework.
Results: This method can perform clustering even when the data belonging to different groups overlap. It can also perform clustering when the number of samples is lower than the data dimensionality.
Conclusion: The proposed scheme is free from selecting initial settings to begin the search process. In addition, it does not require the computation of the first and second derivative of likelihood functions, as is required by many other maximum likelihood based methods.
Significance: This algorithm uses distribution and centroid
information to cluster a sample and was applied to biological data. A Matlab implementation of this method can be downloaded from the web-link
http://www.riken.jp/en/research/labs/ims/med_sci_math/
- …