9,599 research outputs found
Non-parametric document clustering by ensemble methods
Los sesgos de los algoritmos individuales para clustering no paramétrico
de documentos pueden conducir a soluciones no óptimas. Los métodos de consenso
podrÃan compensar esta limitación, pero no han sido probados sobre colecciones de
documentos. Este artÃculo presenta una comparación de estrategias para clustering
no paramétrico de documentos por consenso. / The biases of individual algorithms for non-parametric document clustering can lead to non-optimal solutions. Ensemble clustering methods may overcome this limitation, but have not been applied to document collections. This paper presents a comparison of strategies for non-parametric document ensemble clustering.Peer ReviewedPostprint (published version
Multiple Instance Learning: A Survey of Problem Characteristics and Applications
Multiple instance learning (MIL) is a form of weakly supervised learning
where training instances are arranged in sets, called bags, and a label is
provided for the entire bag. This formulation is gaining interest because it
naturally fits various problems and allows to leverage weakly labeled data.
Consequently, it has been used in diverse application fields such as computer
vision and document classification. However, learning from bags raises
important challenges that are unique to MIL. This paper provides a
comprehensive survey of the characteristics which define and differentiate the
types of MIL problems. Until now, these problem characteristics have not been
formally identified and described. As a result, the variations in performance
of MIL algorithms from one data set to another are difficult to explain. In
this paper, MIL problem characteristics are grouped into four broad categories:
the composition of the bags, the types of data distribution, the ambiguity of
instance labels, and the task to be performed. Methods specialized to address
each category are reviewed. Then, the extent to which these characteristics
manifest themselves in key MIL application areas are described. Finally,
experiments are conducted to compare the performance of 16 state-of-the-art MIL
methods on selected problem characteristics. This paper provides insight on how
the problem characteristics affect MIL algorithms, recommendations for future
benchmarking and promising avenues for research
A network approach to topic models
One of the main computational and scientific challenges in the modern age is
to extract useful information from unstructured texts. Topic models are one
popular machine-learning approach which infers the latent topical structure of
a collection of documents. Despite their success --- in particular of its most
widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous
applications in sociology, history, and linguistics, topic models are known to
suffer from severe conceptual and practical problems, e.g. a lack of
justification for the Bayesian priors, discrepancies with statistical
properties of real texts, and the inability to properly choose the number of
topics. Here we obtain a fresh view on the problem of identifying topical
structures by relating it to the problem of finding communities in complex
networks. This is achieved by representing text corpora as bipartite networks
of documents and words. By adapting existing community-detection methods --
using a stochastic block model (SBM) with non-parametric priors -- we obtain a
more versatile and principled framework for topic modeling (e.g., it
automatically detects the number of topics and hierarchically clusters both the
words and documents). The analysis of artificial and real corpora demonstrates
that our SBM approach leads to better topic models than LDA in terms of
statistical model selection. More importantly, our work shows how to formally
relate methods from community detection and topic modeling, opening the
possibility of cross-fertilization between these two fields.Comment: 22 pages, 10 figures, code available at https://topsbm.github.io
Path Similarity Analysis: a Method for Quantifying Macromolecular Pathways
Diverse classes of proteins function through large-scale conformational
changes; sophisticated enhanced sampling methods have been proposed to generate
these macromolecular transition paths. As such paths are curves in a
high-dimensional space, they have been difficult to compare quantitatively, a
prerequisite to, for instance, assess the quality of different sampling
algorithms. The Path Similarity Analysis (PSA) approach alleviates these
difficulties by utilizing the full information in 3N-dimensional trajectories
in configuration space. PSA employs the Hausdorff or Fr\'echet path
metrics---adopted from computational geometry---enabling us to quantify path
(dis)similarity, while the new concept of a Hausdorff-pair map permits the
extraction of atomic-scale determinants responsible for path differences.
Combined with clustering techniques, PSA facilitates the comparison of many
paths, including collections of transition ensembles. We use the closed-to-open
transition of the enzyme adenylate kinase (AdK)---a commonly used testbed for
the assessment enhanced sampling algorithms---to examine multiple microsecond
equilibrium molecular dynamics (MD) transitions of AdK in its substrate-free
form alongside transition ensembles from the MD-based dynamic importance
sampling (DIMS-MD) and targeted MD (TMD) methods, and a geometrical targeting
algorithm (FRODA). A Hausdorff pairs analysis of these ensembles revealed, for
instance, that differences in DIMS-MD and FRODA paths were mediated by a set of
conserved salt bridges whose charge-charge interactions are fully modeled in
DIMS-MD but not in FRODA. We also demonstrate how existing trajectory analysis
methods relying on pre-defined collective variables, such as native contacts or
geometric quantities, can be used synergistically with PSA, as well as the
application of PSA to more complex systems such as membrane transporter
proteins.Comment: 9 figures, 3 tables in the main manuscript; supplementary information
includes 7 texts (S1 Text - S7 Text) and 11 figures (S1 Fig - S11 Fig) (also
available from journal site
OBOE: Collaborative Filtering for AutoML Model Selection
Algorithm selection and hyperparameter tuning remain two of the most
challenging tasks in machine learning. Automated machine learning (AutoML)
seeks to automate these tasks to enable widespread use of machine learning by
non-experts. This paper introduces OBOE, a collaborative filtering method for
time-constrained model selection and hyperparameter tuning. OBOE forms a matrix
of the cross-validated errors of a large number of supervised learning models
(algorithms together with hyperparameters) on a large number of datasets, and
fits a low rank model to learn the low-dimensional feature vectors for the
models and datasets that best predict the cross-validated errors. To find
promising models for a new dataset, OBOE runs a set of fast but informative
algorithms on the new dataset and uses their cross-validated errors to infer
the feature vector for the new dataset. OBOE can find good models under
constraints on the number of models fit or the total time budget. To this end,
this paper develops a new heuristic for active learning in time-constrained
matrix completion based on optimal experiment design. Our experiments
demonstrate that OBOE delivers state-of-the-art performance faster than
competing approaches on a test bed of supervised learning problems. Moreover,
the success of the bilinear model used by OBOE suggests that AutoML may be
simpler than was previously understood
- …