119,707 research outputs found
Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data
BACKGROUND: A hierarchy, characterized by tree-like relationships, is a natural method of organizing data in various domains. When considering an unsupervised machine learning routine, such as clustering, a bottom-up hierarchical (BU, agglomerative) algorithm is used as a default and is often the only method applied. METHODOLOGY/PRINCIPAL FINDINGS: We show that hierarchical clustering that involve global considerations, such as top-down (TD, divisive), or glocal (global-local) algorithms are better suited to reveal meaningful patterns in the data. This is demonstrated, by testing the correspondence between the results of several algorithms (TD, glocal and BU) and the correct annotations provided by experts. The correspondence was tested in multiple domains including gene expression experiments, stock trade records and functional protein families. The performance of each of the algorithms is evaluated by statistical criteria that are assigned to clusters (nodes of the hierarchy tree) based on expert-labeled data. Whereas TD algorithms perform better on global patterns, BU algorithms perform well and are advantageous when finer granularity of the data is sought. In addition, a novel TD algorithm that is based on genuine density of the data points is presented and is shown to outperform other divisive and agglomerative methods. Application of the algorithm to more than 500 protein sequences belonging to ion-channels illustrates the potential of the method for inferring overlooked functional annotations. ClustTree, a graphical Matlab toolbox for applying various hierarchical clustering algorithms and testing their quality is made available. CONCLUSIONS: Although currently rarely used, global approaches, in particular, TD or glocal algorithms, should be considered in the exploratory process of clustering. In general, applying unsupervised clustering methods can leverage the quality of manually-created mapping of proteins families. As demonstrated, it can also provide insights in erroneous and missed annotations
Functional Data Analysis in Electronic Commerce Research
This paper describes opportunities and challenges of using functional data
analysis (FDA) for the exploration and analysis of data originating from
electronic commerce (eCommerce). We discuss the special data structures that
arise in the online environment and why FDA is a natural approach for
representing and analyzing such data. The paper reviews several FDA methods and
motivates their usefulness in eCommerce research by providing a glimpse into
new domain insights that they allow. We argue that the wedding of eCommerce
with FDA leads to innovations both in statistical methodology, due to the
challenges and complications that arise in eCommerce data, and in online
research, by being able to ask (and subsequently answer) new research questions
that classical statistical methods are not able to address, and also by
expanding on research questions beyond the ones traditionally asked in the
offline environment. We describe several applications originating from online
transactions which are new to the statistics literature, and point out
statistical challenges accompanied by some solutions. We also discuss some
promising future directions for joint research efforts between researchers in
eCommerce and statistics.Comment: Published at http://dx.doi.org/10.1214/088342306000000132 in the
Statistical Science (http://www.imstat.org/sts/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Classification methods for Hilbert data based on surrogate density
An unsupervised and a supervised classification approaches for Hilbert random
curves are studied. Both rest on the use of a surrogate of the probability
density which is defined, in a distribution-free mixture context, from an
asymptotic factorization of the small-ball probability. That surrogate density
is estimated by a kernel approach from the principal components of the data.
The focus is on the illustration of the classification algorithms and the
computational implications, with particular attention to the tuning of the
parameters involved. Some asymptotic results are sketched. Applications on
simulated and real datasets show how the proposed methods work.Comment: 33 pages, 11 figures, 6 table
Diffusion map for clustering fMRI spatial maps extracted by independent component analysis
Functional magnetic resonance imaging (fMRI) produces data about activity
inside the brain, from which spatial maps can be extracted by independent
component analysis (ICA). In datasets, there are n spatial maps that contain p
voxels. The number of voxels is very high compared to the number of analyzed
spatial maps. Clustering of the spatial maps is usually based on correlation
matrices. This usually works well, although such a similarity matrix inherently
can explain only a certain amount of the total variance contained in the
high-dimensional data where n is relatively small but p is large. For
high-dimensional space, it is reasonable to perform dimensionality reduction
before clustering. In this research, we used the recently developed diffusion
map for dimensionality reduction in conjunction with spectral clustering. This
research revealed that the diffusion map based clustering worked as well as the
more traditional methods, and produced more compact clusters when needed.Comment: 6 pages. 8 figures. Copyright (c) 2013 IEEE. Published at 2013 IEEE
International Workshop on Machine Learning for Signal Processin
Search for Evergreens in Science: A Functional Data Analysis
Evergreens in science are papers that display a continual rise in annual
citations without decline, at least within a sufficiently long time period.
Aiming to better understand evergreens in particular and patterns of citation
trajectory in general, this paper develops a functional data analysis method to
cluster citation trajectories of a sample of 1699 research papers published in
1980 in the American Physical Society (APS) journals. We propose a functional
Poisson regression model for individual papers' citation trajectories, and fit
the model to the observed 30-year citations of individual papers by functional
principal component analysis and maximum likelihood estimation. Based on the
estimated paper-specific coefficients, we apply the K-means clustering
algorithm to cluster papers into different groups, for uncovering general types
of citation trajectories. The result demonstrates the existence of an evergreen
cluster of papers that do not exhibit any decline in annual citations over 30
years.Comment: 40 pages, 9 figure
- …