1,014 research outputs found
Ontology-Based MEDLINE Document Classification
An increasing and overwhelming amount of biomedical information is available in the research literature mainly in the form of free-text. Biologists need tools that automate their information search and deal with the high volume and ambiguity of free-text. Ontologies can help automatic information processing by providing standard concepts and information about the relationships between concepts. The Medical Subject Headings (MeSH) ontology is already available and used by MEDLINE indexers to annotate the conceptual content of biomedical articles. This paper presents a domain-independent method that uses the MeSH ontology inter-concept relationships to extend the existing MeSH-based representation of MEDLINE documents. The extension method is evaluated within a document triage task organized by the Genomics track of the 2005 Text REtrieval Conference (TREC). Our method for extending the representation of documents leads to an improvement of 17% over a non-extended baseline in terms of normalized utility, the metric defined for the task. The SVMlight software is used to classify documents
Sensitive and Scalable Online Evaluation with Theoretical Guarantees
Multileaved comparison methods generalize interleaved comparison methods to
provide a scalable approach for comparing ranking systems based on regular user
interactions. Such methods enable the increasingly rapid research and
development of search engines. However, existing multileaved comparison methods
that provide reliable outcomes do so by degrading the user experience during
evaluation. Conversely, current multileaved comparison methods that maintain
the user experience cannot guarantee correctness. Our contribution is two-fold.
First, we propose a theoretical framework for systematically comparing
multileaved comparison methods using the notions of considerateness, which
concerns maintaining the user experience, and fidelity, which concerns reliable
correct outcomes. Second, we introduce a novel multileaved comparison method,
Pairwise Preference Multileaving (PPM), that performs comparisons based on
document-pair preferences, and prove that it is considerate and has fidelity.
We show empirically that, compared to previous multileaved comparison methods,
PPM is more sensitive to user preferences and scalable with the number of
rankers being compared.Comment: CIKM 2017, Proceedings of the 2017 ACM on Conference on Information
and Knowledge Managemen
Fake View Analytics in Online Video Services
Online video-on-demand(VoD) services invariably maintain a view count for
each video they serve, and it has become an important currency for various
stakeholders, from viewers, to content owners, advertizers, and the online
service providers themselves. There is often significant financial incentive to
use a robot (or a botnet) to artificially create fake views. How can we detect
the fake views? Can we detect them (and stop them) using online algorithms as
they occur? What is the extent of fake views with current VoD service
providers? These are the questions we study in the paper. We develop some
algorithms and show that they are quite effective for this problem.Comment: 25 pages, 15 figure
Deformable Registration through Learning of Context-Specific Metric Aggregation
We propose a novel weakly supervised discriminative algorithm for learning
context specific registration metrics as a linear combination of conventional
similarity measures. Conventional metrics have been extensively used over the
past two decades and therefore both their strengths and limitations are known.
The challenge is to find the optimal relative weighting (or parameters) of
different metrics forming the similarity measure of the registration algorithm.
Hand-tuning these parameters would result in sub optimal solutions and quickly
become infeasible as the number of metrics increases. Furthermore, such
hand-crafted combination can only happen at global scale (entire volume) and
therefore will not be able to account for the different tissue properties. We
propose a learning algorithm for estimating these parameters locally,
conditioned to the data semantic classes. The objective function of our
formulation is a special case of non-convex function, difference of convex
function, which we optimize using the concave convex procedure. As a proof of
concept, we show the impact of our approach on three challenging datasets for
different anatomical structures and modalities.Comment: Accepted for publication in the 8th International Workshop on Machine
Learning in Medical Imaging (MLMI 2017), in conjunction with MICCAI 201
An Augmentation Hybrid System for Document Classification and Rating.
This paper introduces an augmentation hybrid system, referred to as Rated MCRDR. It uses Multiple Classification Ripple Down Rules (MCRDR), a simple and effective knowledge acquisition technique, combined with a neural network
Using Linguistic Information and Machine Learning Techniques to Identify Entities from Juridical Documents
Information extraction from legal documents is an important and open problem. A mixed approach, using linguistic information and machine learning techniques, is described in this paper. In this approach, top-level legal concepts are identified and used for document classifica- tion using Support Vector Machines. Named entities, such as, locations, organizations, dates, and document references, are identified using se- mantic information from the output of a natural language parser. This information, legal concepts and named entities, may be used to popu- late a simple ontology, allowing the enrichment of documents and the creation of high-level legal information retrieval systems.
The proposed methodology was applied to a corpus of legal documents - from the EUR-Lex site – and it was evaluated. The obtained results were quite good and indicate this may be a promising approach to the legal information extraction problem
Controlling Fairness and Bias in Dynamic Learning-to-Rank
Rankings are the primary interface through which many online platforms match
users to items (e.g. news, products, music, video). In these two-sided markets,
not only the users draw utility from the rankings, but the rankings also
determine the utility (e.g. exposure, revenue) for the item providers (e.g.
publishers, sellers, artists, studios). It has already been noted that
myopically optimizing utility to the users, as done by virtually all
learning-to-rank algorithms, can be unfair to the item providers. We,
therefore, present a learning-to-rank approach for explicitly enforcing
merit-based fairness guarantees to groups of items (e.g. articles by the same
publisher, tracks by the same artist). In particular, we propose a learning
algorithm that ensures notions of amortized group fairness, while
simultaneously learning the ranking function from implicit feedback data. The
algorithm takes the form of a controller that integrates unbiased estimators
for both fairness and utility, dynamically adapting both as more data becomes
available. In addition to its rigorous theoretical foundation and convergence
guarantees, we find empirically that the algorithm is highly practical and
robust.Comment: First two authors contributed equally. In Proceedings of the 43rd
International ACM SIGIR Conference on Research and Development in Information
Retrieval 202
Enhancing Sensitivity Classification with Semantic Features using Word Embeddings
Government documents must be reviewed to identify any sensitive information
they may contain, before they can be released to the public. However,
traditional paper-based sensitivity review processes are not practical for reviewing
born-digital documents. Therefore, there is a timely need for automatic sensitivity
classification techniques, to assist the digital sensitivity review process.
However, sensitivity is typically a product of the relations between combinations
of terms, such as who said what about whom, therefore, automatic sensitivity
classification is a difficult task. Vector representations of terms, such as word
embeddings, have been shown to be effective at encoding latent term features
that preserve semantic relations between terms, which can also be beneficial to
sensitivity classification. In this work, we present a thorough evaluation of the
effectiveness of semantic word embedding features, along with term and grammatical
features, for sensitivity classification. On a test collection of government
documents containing real sensitivities, we show that extending text classification
with semantic features and additional term n-grams results in significant improvements
in classification effectiveness, correctly classifying 9.99% more sensitive
documents compared to the text classification baseline
- …