1,042 research outputs found
Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm
Over the past five decades, k-means has become the clustering algorithm of
choice in many application domains primarily due to its simplicity, time/space
efficiency, and invariance to the ordering of the data points. Unfortunately,
the algorithm's sensitivity to the initial selection of the cluster centers
remains to be its most serious drawback. Numerous initialization methods have
been proposed to address this drawback. Many of these methods, however, have
time complexity superlinear in the number of data points, which makes them
impractical for large data sets. On the other hand, linear methods are often
random and/or sensitive to the ordering of the data points. These methods are
generally unreliable in that the quality of their results is unpredictable.
Therefore, it is common practice to perform multiple runs of such methods and
take the output of the run that produces the best results. Such a practice,
however, greatly increases the computational requirements of the otherwise
highly efficient k-means algorithm. In this chapter, we investigate the
empirical performance of six linear, deterministic (non-random), and
order-invariant k-means initialization methods on a large and diverse
collection of data sets from the UCI Machine Learning Repository. The results
demonstrate that two relatively unknown hierarchical initialization methods due
to Su and Dy outperform the remaining four methods with respect to two
objective effectiveness criteria. In addition, a recent method due to Erisoglu
et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms
(Springer, 2014). arXiv admin note: substantial text overlap with
arXiv:1304.7465, arXiv:1209.196
Hierarchical information clustering by means of topologically embedded graphs
We introduce a graph-theoretic approach to extract clusters and hierarchies
in complex data-sets in an unsupervised and deterministic manner, without the
use of any prior information. This is achieved by building topologically
embedded networks containing the subset of most significant links and analyzing
the network structure. For a planar embedding, this method provides both the
intra-cluster hierarchy, which describes the way clusters are composed, and the
inter-cluster hierarchy which describes how clusters gather together. We
discuss performance, robustness and reliability of this method by first
investigating several artificial data-sets, finding that it can outperform
significantly other established approaches. Then we show that our method can
successfully differentiate meaningful clusters and hierarchies in a variety of
real data-sets. In particular, we find that the application to gene expression
patterns of lymphoma samples uncovers biologically significant groups of genes
which play key-roles in diagnosis, prognosis and treatment of some of the most
relevant human lymphoid malignancies.Comment: 33 Pages, 18 Figures, 5 Table
Medoid-based clustering using ant colony optimization
The application of ACO-based algorithms in data mining has been growing over the last few years, and several supervised and unsupervised learning algorithms have been developed using this bio-inspired approach. Most recent works about unsupervised learning have focused on clustering, showing the potential of ACO-based techniques. However, there are still clustering areas that are almost unexplored using these techniques, such as medoid-based clustering. Medoid-based clustering methods are helpful—compared to classical centroid-based techniques—when centroids cannot be easily defined. This paper proposes two medoid-based ACO clustering algorithms, where the only information needed is the distance between data: one algorithm that uses an ACO procedure to determine an optimal medoid set (METACOC algorithm) and another algorithm that uses an automatic selection of the number of clusters (METACOC-K algorithm). The proposed algorithms are compared against classical clustering approaches using synthetic and real-world datasets
Missing value imputation improves clustering and interpretation of gene expression microarray data
<p>Abstract</p> <p>Background</p> <p>Missing values frequently pose problems in gene expression microarray experiments as they can hinder downstream analysis of the datasets. While several missing value imputation approaches are available to the microarray users and new ones are constantly being developed, there is no general consensus on how to choose between the different methods since their performance seems to vary drastically depending on the dataset being used.</p> <p>Results</p> <p>We show that this discrepancy can mostly be attributed to the way in which imputation methods have traditionally been developed and evaluated. By comparing a number of advanced imputation methods on recent microarray datasets, we show that even when there are marked differences in the measurement-level imputation accuracies across the datasets, these differences become negligible when the methods are evaluated in terms of how well they can reproduce the original gene clusters or their biological interpretations. Regardless of the evaluation approach, however, imputation always gave better results than ignoring missing data points or replacing them with zeros or average values, emphasizing the continued importance of using more advanced imputation methods.</p> <p>Conclusion</p> <p>The results demonstrate that, while missing values are still severely complicating microarray data analysis, their impact on the discovery of biologically meaningful gene groups can – up to a certain degree – be reduced by using readily available and relatively fast imputation methods, such as the Bayesian Principal Components Algorithm (BPCA).</p
Clustering-based approaches to SAGE data mining
Serial analysis of gene expression (SAGE) is one of the most powerful tools for global gene expression profiling. It has led to several biological discoveries and biomedical applications, such as the prediction of new gene functions and the identification of biomarkers in human cancer research. Clustering techniques have become fundamental approaches in these applications. This paper reviews relevant clustering techniques specifically designed for this type of data. It places an emphasis on current limitations and opportunities in this area for supporting biologically-meaningful data mining and visualisation
Measurement of the cross-section and charge asymmetry of bosons produced in proton-proton collisions at TeV with the ATLAS detector
This paper presents measurements of the and cross-sections and the associated charge asymmetry as a
function of the absolute pseudorapidity of the decay muon. The data were
collected in proton--proton collisions at a centre-of-mass energy of 8 TeV with
the ATLAS experiment at the LHC and correspond to a total integrated luminosity
of 20.2~\mbox{fb^{-1}}. The precision of the cross-section measurements
varies between 0.8% to 1.5% as a function of the pseudorapidity, excluding the
1.9% uncertainty on the integrated luminosity. The charge asymmetry is measured
with an uncertainty between 0.002 and 0.003. The results are compared with
predictions based on next-to-next-to-leading-order calculations with various
parton distribution functions and have the sensitivity to discriminate between
them.Comment: 38 pages in total, author list starting page 22, 5 figures, 4 tables,
submitted to EPJC. All figures including auxiliary figures are available at
https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/STDM-2017-13
Search for chargino-neutralino production with mass splittings near the electroweak scale in three-lepton final states in √s=13 TeV pp collisions with the ATLAS detector
A search for supersymmetry through the pair production of electroweakinos with mass splittings near the electroweak scale and decaying via on-shell W and Z bosons is presented for a three-lepton final state. The analyzed proton-proton collision data taken at a center-of-mass energy of √s=13 TeV were collected between 2015 and 2018 by the ATLAS experiment at the Large Hadron Collider, corresponding to an integrated luminosity of 139 fb−1. A search, emulating the recursive jigsaw reconstruction technique with easily reproducible laboratory-frame variables, is performed. The two excesses observed in the 2015–2016 data recursive jigsaw analysis in the low-mass three-lepton phase space are reproduced. Results with the full data set are in agreement with the Standard Model expectations. They are interpreted to set exclusion limits at the 95% confidence level on simplified models of chargino-neutralino pair production for masses up to 345 GeV
New resampling method for evaluating stability of clusters
<p>Abstract</p> <p>Background</p> <p>Hierarchical clustering is a widely applied tool in the analysis of microarray gene expression data. The assessment of cluster stability is a major challenge in clustering procedures. Statistical methods are required to distinguish between real and random clusters. Several methods for assessing cluster stability have been published, including resampling methods such as the bootstrap.</p> <p>We propose a new resampling method based on continuous weights to assess the stability of clusters in hierarchical clustering. While in bootstrapping approximately one third of the original items is lost, continuous weights avoid zero elements and instead allow non integer diagonal elements, which leads to retention of the full dimensionality of space, i.e. each variable of the original data set is represented in the resampling sample.</p> <p>Results</p> <p>Comparison of continuous weights and bootstrapping using real datasets and simulation studies reveals the advantage of continuous weights especially when the dataset has only few observations, few differentially expressed genes and the fold change of differentially expressed genes is low.</p> <p>Conclusion</p> <p>We recommend the use of continuous weights in small as well as in large datasets, because according to our results they produce at least the same results as conventional bootstrapping and in some cases they surpass it.</p
NeatMap - non-clustering heat map alternatives in R
<p>Abstract</p> <p>Background</p> <p>The clustered heat map is the most popular means of visualizing genomic data. It compactly displays a large amount of data in an intuitive format that facilitates the detection of hidden structures and relations in the data. However, it is hampered by its use of cluster analysis which does not always respect the intrinsic relations in the data, often requiring non-standardized reordering of rows/columns to be performed post-clustering. This sometimes leads to uninformative and/or misleading conclusions. Often it is more informative to use dimension-reduction algorithms (such as Principal Component Analysis and Multi-Dimensional Scaling) which respect the topology inherent in the data. Yet, despite their proven utility in the analysis of biological data, they are not as widely used. This is at least partially due to the lack of user-friendly visualization methods with the visceral impact of the heat map.</p> <p>Results</p> <p>NeatMap is an R package designed to meet this need. NeatMap offers a variety of novel plots (in 2 and 3 dimensions) to be used in conjunction with these dimension-reduction techniques. Like the heat map, but unlike traditional displays of such results, it allows the entire dataset to be displayed while visualizing relations between elements. It also allows superimposition of cluster analysis results for mutual validation. NeatMap is shown to be more informative than the traditional heat map with the help of two well-known microarray datasets.</p> <p>Conclusions</p> <p>NeatMap thus preserves many of the strengths of the clustered heat map while addressing some of its deficiencies. It is hoped that NeatMap will spur the adoption of non-clustering dimension-reduction algorithms.</p
Search for new phenomena in final states with an energetic jet and large missing transverse momentum in pp collisions at √ s = 8 TeV with the ATLAS detector
Results of a search for new phenomena in final states with an energetic jet and large missing transverse momentum are reported. The search uses 20.3 fb−1 of √ s = 8 TeV data collected in 2012 with the ATLAS detector at the LHC. Events are required to have at least one jet with pT > 120 GeV and no leptons. Nine signal regions are considered with increasing missing transverse momentum requirements between Emiss T > 150 GeV and Emiss T > 700 GeV. Good agreement is observed between the number of events in data and Standard Model expectations. The results are translated into exclusion limits on models with either large extra spatial dimensions, pair production of weakly interacting dark matter candidates, or production of very light gravitinos in a gauge-mediated supersymmetric model. In addition, limits on the production of an invisibly decaying Higgs-like boson leading to similar topologies in the final state are presente
- …