2,052 research outputs found
Optimism in Active Learning with Gaussian Processes
International audienceIn the context of Active Learning for classification, the classification error depends on the joint distribution of samples and their labels which is initially unknown. The minimization of this error requires estimating this distribution. Online estimation of this distribution involves a trade-off between exploration and exploitation. This is a common problem in machine learning for which multi-armed bandit theory, building upon Optimism in the Face of Uncertainty, has been proven very efficient these last years. We introduce two novel algorithms that use Optimism in the Face of Uncertainty along with Gaussian Processes for the Active Learning problem. The evaluation lead on real world datasets shows that these new algorithms compare positively to state-of-the-art methods
Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity
A high degree of topical diversity is often considered to be an important
characteristic of interesting text documents. A recent proposal for measuring
topical diversity identifies three elements for assessing diversity: words,
topics, and documents as collections of words. Topic models play a central role
in this approach. Using standard topic models for measuring diversity of
documents is suboptimal due to generality and impurity. General topics only
include common information from a background corpus and are assigned to most of
the documents in the collection. Impure topics contain words that are not
related to the topic; impurity lowers the interpretability of topic models and
impure topics are likely to get assigned to documents erroneously. We propose a
hierarchical re-estimation approach for topic models to combat generality and
impurity; the proposed approach operates at three levels: words, topics, and
documents. Our re-estimation approach for measuring documents' topical
diversity outperforms the state of the art on PubMed dataset which is commonly
used for diversity experiments.Comment: Proceedings of the 39th European Conference on Information Retrieval
(ECIR2017
TK: The Twitter Top-K Keywords Benchmark
Information retrieval from textual data focuses on the construction of
vocabularies that contain weighted term tuples. Such vocabularies can then be
exploited by various text analysis algorithms to extract new knowledge, e.g.,
top-k keywords, top-k documents, etc. Top-k keywords are casually used for
various purposes, are often computed on-the-fly, and thus must be efficiently
computed. To compare competing weighting schemes and database implementations,
benchmarking is customary. To the best of our knowledge, no benchmark currently
addresses these problems. Hence, in this paper, we present a top-k keywords
benchmark, TK, which features a real tweet dataset and queries with
various complexities and selectivities. TK helps evaluate weighting
schemes and database implementations in terms of computing performance. To
illustrate TK's relevance and genericity, we successfully performed
tests on the TF-IDF and Okapi BM25 weighting schemes, on one hand, and on
different relational (Oracle, PostgreSQL) and document-oriented (MongoDB)
database implementations, on the other hand
Good Statistical Practiceâdevelopment of tailored Good Clinical Practice training for statisticians
\ua9 The Author(s) 2024.Background: Statisticians are fundamental in ensuring clinical research, including clinical trials, are conducted with quality, transparency, reproducibility and integrity. Good Clinical Practice (GCP) is an international quality standard for the conduct of clinical trials research. Statisticians are required to undertake training on GCP but existing training is generic and, crucially, does not cover statistical activities. This results in statisticians undertaking training mostly unrelated to their role and variation in awareness and implementation of relevant regulatory requirements with regards to statistical conduct. The need for role-relevant training is recognised by the UK NHS Health Research Authority and the Medicines and Healthcare products Regulatory Agency (MHRA). Methods: The Good Statistical Practice (GCP for Statisticians) project was instigated by the UK Clinical Research Collaboration (UKCRC) Registered Clinical Trials Unit (CTU) Statisticians Operational Group and funded by the National Institute for Health and Care Research (NIHR), to develop materials to enable role-specific GCP training tailored to statisticians. Review of current GCP training was undertaken by survey. Development of training materials were based on MHRA GCP. Critical review and piloting was conducted with UKCRC CTU and NIHR researchers with comment from MHRA. Final review was conducted through the UKCRC CTU Statistics group. Results: The survey confirmed the need and desire for the development of dedicated GCP training for statisticians. An accessible, comprehensive, piloted training package was developed tailored to statisticians working in clinical research, particularly the clinical trials arena. The training materials cover legislation and guidance for best practice across all clinical trial processes with statistical involvement, including exercises and real-life scenarios to bridge the gap between theory and practice. Comprehensive feedback was incorporated. The training materials are freely available for national and international adoption. Conclusion: All research staff should have training in GCP yet the training undertaken by most academic statisticians does not cover activities related to their role. The Good Statistical Practice (GCP for Statisticians) project has developed and extensively piloted new, role-specific, comprehensive, accessible GCP training tailored to statisticians working in clinical research, particularly the clinical trials arena. This role-specific training will encourage best practice, leading to transparent and reproducible statistical activity, as required by regulatory authorities and funders
TCGM: An Information-Theoretic Framework for Semi-Supervised Multi-Modality Learning
Fusing data from multiple modalities provides more information to train
machine learning systems. However, it is prohibitively expensive and
time-consuming to label each modality with a large amount of data, which leads
to a crucial problem of semi-supervised multi-modal learning. Existing methods
suffer from either ineffective fusion across modalities or lack of theoretical
guarantees under proper assumptions. In this paper, we propose a novel
information-theoretic approach, namely \textbf{T}otal \textbf{C}orrelation
\textbf{G}ain \textbf{M}aximization (TCGM), for semi-supervised multi-modal
learning, which is endowed with promising properties: (i) it can utilize
effectively the information across different modalities of unlabeled data
points to facilitate training classifiers of each modality (ii) it has
theoretical guarantee to identify Bayesian classifiers, i.e., the ground truth
posteriors of all modalities. Specifically, by maximizing TC-induced loss
(namely TC gain) over classifiers of all modalities, these classifiers can
cooperatively discover the equivalent class of ground-truth classifiers; and
identify the unique ones by leveraging limited percentage of labeled data. We
apply our method to various tasks and achieve state-of-the-art results,
including news classification, emotion recognition and disease prediction.Comment: ECCV 2020 (oral
A Compromise between Neutrino Masses and Collider Signatures in the Type-II Seesaw Model
A natural extension of the standard gauge
model to accommodate massive neutrinos is to introduce one Higgs triplet and
three right-handed Majorana neutrinos, leading to a neutrino mass
matrix which contains three sub-matrices ,
and . We show that three light Majorana neutrinos (i.e., the mass
eigenstates of , and ) are exactly massless in this
model, if and only if
exactly holds. This no-go theorem implies that small but non-vanishing neutrino
masses may result from a significant but incomplete cancellation between
and terms in the Type-II
seesaw formula, provided three right-handed Majorana neutrinos are of TeV and experimentally detectable at the LHC. We propose three simple
Type-II seesaw scenarios with the flavor symmetry to
interpret the observed neutrino mass spectrum and neutrino mixing pattern. Such
a TeV-scale neutrino model can be tested in two complementary ways: (1)
searching for possible collider signatures of lepton number violation induced
by the right-handed Majorana neutrinos and doubly-charged Higgs particles; and
(2) searching for possible consequences of unitarity violation of the neutrino mixing matrix in the future long-baseline neutrino oscillation
experiments.Comment: RevTeX 19 pages, no figure
Accurate and Fast Retrieval for Complex Non-metric Data via Neighborhood Graphs
We demonstrate that a graph-based search algorithm-relying on the
construction of an approximate neighborhood graph-can directly work with
challenging non-metric and/or non-symmetric distances without resorting to
metric-space mapping and/or distance symmetrization, which, in turn, lead to
substantial performance degradation. Although the straightforward metrization
and symmetrization is usually ineffective, we find that constructing an index
using a modified, e.g., symmetrized, distance can improve performance. This
observation paves a way to a new line of research of designing index-specific
graph-construction distance functions
- âŠ