15,275 research outputs found
Ensemble clustering for result diversification
This paper describes the participation of the University of Twente in the Web track of TREC 2012. Our baseline approach uses the Mirex toolkit, an open source tool that sequantially scans all the documents. For result diversification, we experimented with improving the quality of clusters through ensemble clustering. We combined clusters obtained by different clustering methods (such as LDA and K-means) and clusters obtained by using different types of data (such as document text and anchor text). Our two-layer ensemble run performed better than the LDA based diversification and also better than a non-diversification run
Combining Multiple Clusterings via Crowd Agreement Estimation and Multi-Granularity Link Analysis
The clustering ensemble technique aims to combine multiple clusterings into a
probably better and more robust clustering and has been receiving an increasing
attention in recent years. There are mainly two aspects of limitations in the
existing clustering ensemble approaches. Firstly, many approaches lack the
ability to weight the base clusterings without access to the original data and
can be affected significantly by the low-quality, or even ill clusterings.
Secondly, they generally focus on the instance level or cluster level in the
ensemble system and fail to integrate multi-granularity cues into a unified
model. To address these two limitations, this paper proposes to solve the
clustering ensemble problem via crowd agreement estimation and
multi-granularity link analysis. We present the normalized crowd agreement
index (NCAI) to evaluate the quality of base clusterings in an unsupervised
manner and thus weight the base clusterings in accordance with their clustering
validity. To explore the relationship between clusters, the source aware
connected triple (SACT) similarity is introduced with regard to their common
neighbors and the source reliability. Based on NCAI and multi-granularity
information collected among base clusterings, clusters, and data instances, we
further propose two novel consensus functions, termed weighted evidence
accumulation clustering (WEAC) and graph partitioning with multi-granularity
link analysis (GP-MGLA) respectively. The experiments are conducted on eight
real-world datasets. The experimental results demonstrate the effectiveness and
robustness of the proposed methods.Comment: The MATLAB source code of this work is available at:
https://www.researchgate.net/publication/28197031
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
EC3: Combining Clustering and Classification for Ensemble Learning
Classification and clustering algorithms have been proved to be successful
individually in different contexts. Both of them have their own advantages and
limitations. For instance, although classification algorithms are more powerful
than clustering methods in predicting class labels of objects, they do not
perform well when there is a lack of sufficient manually labeled reliable data.
On the other hand, although clustering algorithms do not produce label
information for objects, they provide supplementary constraints (e.g., if two
objects are clustered together, it is more likely that the same label is
assigned to both of them) that one can leverage for label prediction of a set
of unknown objects. Therefore, systematic utilization of both these types of
algorithms together can lead to better prediction performance. In this paper,
We propose a novel algorithm, called EC3 that merges classification and
clustering together in order to support both binary and multi-class
classification. EC3 is based on a principled combination of multiple
classification and multiple clustering methods using an optimization function.
We theoretically show the convexity and optimality of the problem and solve it
by block coordinate descent method. We additionally propose iEC3, a variant of
EC3 that handles imbalanced training data. We perform an extensive experimental
analysis by comparing EC3 and iEC3 with 14 baseline methods (7 well-known
standalone classifiers, 5 ensemble classifiers, and 2 existing methods that
merge classification and clustering) on 13 standard benchmark datasets. We show
that our methods outperform other baselines for every single dataset, achieving
at most 10% higher AUC. Moreover our methods are faster (1.21 times faster than
the best baseline), more resilient to noise and class imbalance than the best
baseline method.Comment: 14 pages, 7 figures, 11 table
The multiplex structure of interbank networks
The interbank market has a natural multiplex network representation. We
employ a unique database of supervisory reports of Italian banks to the Banca
d'Italia that includes all bilateral exposures broken down by maturity and by
the secured and unsecured nature of the contract. We find that layers have
different topological properties and persistence over time. The presence of a
link in a layer is not a good predictor of the presence of the same link in
other layers. Maximum entropy models reveal different unexpected substructures,
such as network motifs, in different layers. Using the total interbank network
or focusing on a specific layer as representative of the other layers provides
a poor representation of interlinkages in the interbank market and could lead
to biased estimation of systemic risk.Comment: 41 pages, 8 figures, 10 table
Ranked List Loss for Deep Metric Learning
The objective of deep metric learning (DML) is to learn embeddings that can
capture semantic similarity and dissimilarity information among data points.
Existing pairwise or tripletwise loss functions used in DML are known to suffer
from slow convergence due to a large proportion of trivial pairs or triplets as
the model improves. To improve this, ranking-motivated structured losses are
proposed recently to incorporate multiple examples and exploit the structured
information among them. They converge faster and achieve state-of-the-art
performance. In this work, we unveil two limitations of existing
ranking-motivated structured losses and propose a novel ranked list loss to
solve both of them. First, given a query, only a fraction of data points is
incorporated to build the similarity structure. Consequently, some useful
examples are ignored and the structure is less informative. To address this, we
propose to build a set-based similarity structure by exploiting all instances
in the gallery. The learning setting can be interpreted as few-shot retrieval:
given a mini-batch, every example is iteratively used as a query, and the rest
ones compose the gallery to search, i.e., the support set in few-shot setting.
The rest examples are split into a positive set and a negative set. For every
mini-batch, the learning objective of ranked list loss is to make the query
closer to the positive set than to the negative set by a margin. Second,
previous methods aim to pull positive pairs as close as possible in the
embedding space. As a result, the intraclass data distribution tends to be
extremely compressed. In contrast, we propose to learn a hypersphere for each
class in order to preserve useful similarity structure inside it, which
functions as regularisation. Extensive experiments demonstrate the superiority
of our proposal by comparing with the state-of-the-art methods.Comment: Accepted to T-PAMI. Therefore, to read the offical version, please go
to IEEE Xplore. Fine-grained image retrieval task. Our source code is
available online: https://github.com/XinshaoAmosWang/Ranked-List-Loss-for-DM
- …