61 research outputs found
An Empirical Analysis on Point-wise Machine Learning Techniques using Regression Trees for Web-search Ranking
Learning how to rank a set of objects relative to an user defined query has received much interest in the machine learning community during the past decade. In fact, there have been two recent competitions hosted by internationally prominent search companies to encourage research on ranking web site documents. Recent literature on learning to rank has focused on three approaches: point-wise, pair-wise, and list-wise. Many different kinds of classifiers, including boosted decision trees, neural networks, and SVMs have proven successful in the field. This thesis surveys traditional point-wise techniques that use regression trees for web-search ranking. The thesis contains empirical studies on Random Forests and Gradient Boosted Decision Trees, with novel augmentations to them on real world data sets. We also analyze how these point-wise techniques perform on new areas of research for web-search ranking: transfer learning and feature-cost aware models
Living analytics methods for the social web
[no abstract
Combining Word Embedding Interactions and LETOR Feature Evidences for Supervised QPP
In information retrieval, query performance prediction aims to predict whether a search engine is likely to succeed in retrieving potentially relevant documents to a user’s query. This problem is usually cast into a regression problem where a machine should predict the effectiveness (in terms of an information retrieval measure) of the search engine on a given query. The solutions range from simple unsupervised approaches where a single source of information (e.g., the variance of the retrieval similarity scores in NQC), predicts the search engine effectiveness for a given query, to more involved ones that rely on supervised machine learning making use of several sources of information, e.g., the learning to rank (LETOR) features, word embedding similarities etc. In this paper, we investigate the combination of two different types of evidences into a single neural network model. While our first source of information corresponds to the semantic interaction between the terms in queries and their top-retrieved documents, our second source of information corresponds to that of LETOR features
Biomedical information extraction for matching patients to clinical trials
Digital Medical information had an astonishing growth on the last decades, driven
by an unprecedented number of medical writers, which lead to a complete revolution in
what and how much information is available to the health professionals.
The problem with this wave of information is that performing a precise selection of
the information retrieved by medical information repositories is very exhaustive and time
consuming for physicians. This is one of the biggest challenges for physicians with the
new digital era: how to reduce the time spent finding the perfect matching document for a
patient (e.g. intervention articles, clinical trial, prescriptions).
Precision Medicine (PM) 2017 is the track by the Text REtrieval Conference (TREC),
that is focused on this type of challenges exclusively for oncology. Using a dataset with a
large amount of clinical trials, this track is a good real life example on how information
retrieval solutions can be used to solve this types of problems. This track can be a very
good starting point for applying information extraction and retrieval methods, in a very
complex domain.
The purpose of this thesis is to improve a system designed by the NovaSearch team
for TREC PM 2017 Clinical Trials task, which got ranked on the top-5 systems of 2017.
The NovaSearch team also participated on the 2018 track and got a 15% increase on
precision compared to the 2017 one. It was used multiple IR techniques for information
extraction and processing of data, including rank fusion, query expansion (e.g. Pseudo
relevance feedback, Mesh terms expansion) and experiments with Learning to Rank
(LETOR) algorithms. Our goal is to retrieve the best possible set of trials for a given
patient, using precise documents filters to exclude the unwanted clinical trials. This work
can open doors in what can be done for searching and perceiving the criteria to exclude or
include the trials, helping physicians even on the more complex and difficult information
retrieval tasks
Acceleration of ListNet for ranking using reconfigurable architecture
Document ranking is used to order query results by relevance with ranking models. ListNet is a
well-known ranking approach for constructing and training learning-to-rank models. Compared with traditional learning approaches, ListNet delivers better accuracy, but is computationally too expensive to learn models with large data sets due to the large number of permutations and documents involved in computing the gradients. Currently, the long training time limits the practicality of ListNet in ranking applications such as breaking news search and stock prediction, and this situation is getting worse with the increase in data-set size. In order to tackle the challenge of long training time, this thesis optimises the ListNet algorithm, and designs hardware accelerators for learning the ListNet algorithm using Field Programmable Gate Arrays (FPGAs), making the algorithm more practical for real-world application.
The contributions of this thesis include: 1) A novel computation method of the ListNet algorithm for ranking. The proposed computation method exposes more fine-grained parallelism for FPGA implementation. 2) A weighted sampling method that takes into account the ranking positions, along with an effective quantisation method based on FPGA devices. The proposed design achieves a 4.42x improvement over GPU implementation speed, while still guaranteeing the accuracy. 3) A full reconfigurable architecture for the ListNet training using multiple bitstream kernels. The proposed method achieves a higher model accuracy than pure fixed point training, and a better throughput than pure floating point training. This thesis has resulted in the acceleration of the ListNet algorithm for ranking using FPGAs by applying the above techniques. Significant improvements in speed have been achieved in this work against CPU and GPU implementations.Open Acces
iQPP: A Benchmark for Image Query Performance Prediction
To date, query performance prediction (QPP) in the context of content-based
image retrieval remains a largely unexplored task, especially in the
query-by-example scenario, where the query is an image. To boost the
exploration of the QPP task in image retrieval, we propose the first benchmark
for image query performance prediction (iQPP). First, we establish a set of
four data sets (PASCAL VOC 2012, Caltech-101, ROxford5k and RParis6k) and
estimate the ground-truth difficulty of each query as the average precision or
the precision@k, using two state-of-the-art image retrieval models. Next, we
propose and evaluate novel pre-retrieval and post-retrieval query performance
predictors, comparing them with existing or adapted (from text to image)
predictors. The empirical results show that most predictors do not generalize
across evaluation scenarios. Our comprehensive experiments indicate that iQPP
is a challenging benchmark, revealing an important research gap that needs to
be addressed in future work. We release our code and data as open source at
https://github.com/Eduard6421/iQPP, to foster future research.Comment: Accepted at SIGIR 202
Finding related sentence pairs in MEDLINE
We explore the feasibility of automatically identifying sentences in different MEDLINE abstracts that are related in meaning. We compared traditional vector space models with machine learning methods for detecting relatedness, and found that machine learning was superior. The Huber method, a variant of Support Vector Machines which minimizes the modified Huber loss function, achieves 73% precision when the score cutoff is set high enough to identify about one related sentence per abstract on average. We illustrate how an abstract viewed in PubMed might be modified to present the related sentences found in other abstracts by this automatic procedure
- …