Search CORE

14 research outputs found

A cross-benchmark comparison of 87 learning to rank methods

Author: Alcântara
Busa-Fekete
Cai
Chapelle
Chapelle
Chen
Derhami
Djoerd Hiemstra
Duh
Freund
Geng
Geng
Gomes
He
Kao
Lai
Lai
Lai
Laporte
Metzler
Mohan
Niek Tax
Pahikkala
Pan
Qin
Qin
Rousseeuw
Rudin
Sander Bockting
Silva
Song
Sun
Torkestani
Torkestani
Veloso
Wang
Zong
Publication venue: Elsevier
Publication date: 01/01/2015
Field of study

Learning to rank is an increasingly important scientific field that comprises the use of machine learning for the ranking task. New learning to rank methods are generally evaluated on benchmark test collections. However, comparison of learning to rank methods based on evaluation results is hindered by the absence of a standard set of evaluation benchmark collections. In this paper we propose a way to compare learning to rank methods based on a sparse set of evaluation results on a set of benchmark datasets. Our comparison methodology consists of two components: (1) Normalized Winning Number, which gives insight in the ranking accuracy of the learning to rank method, and (2) Ideal Winning Number, which gives insight in the degree of certainty concerning its ranking accuracy. Evaluation results of 87 learning to rank methods on 20 well-known benchmark datasets are collected through a structured literature search. ListNet, SmoothRank, FenchelRank, FSMRank, LRUF and LARF are Pareto optimal learning to rank methods in the Normalized Winning Number and Ideal Winning Number dimensions, listed in increasing order of Normalized Winning Number and decreasing order of Ideal Winning Number

University of Twente Research Information

Information Retrieval using applied Supervised Learning for Personalized E-Commerce

Author: Hellum Kjell Arne
Publication venue: University of Stavanger, Norway
Publication date: 15/06/2017
Field of study

Master's thesis in Computer SciencePersonalized E-Commerce Search Challenge issued by the International Conference on Information and Knowledge Management. By analyzing historical data containing browsing logs, queries, user interactions, and static data in the domain of an online retail service, we attempt to extract patterns and derive features from the data collection that will subsequently improve prediction of relevant products. A selection of supervised learning models will utilize an assembly of these features to be trained for prediction of test data. Prediction is performed on the queries given by the data collection, paired with each product item originally appearing in the query. We experiment with the possible assemblies of features along with the models and compare the results to achieve maximum prediction power. Lastly, the quality of the predictions are evaluated towards a ground truth to yield scores.submittedVersio

UiS Brage

Business Analytics for Non-profit Marketing and Online Advertising

Author: Chang Wei
Publication venue
Publication date: 02/07/2013
Field of study

Business analytics is facing formidable challenges in the Internet era. Data collected from business website often contain hundreds of millions of records; the goal of analysis frequently involves predicting rare events; and substantial noise in the form of errors or unstructured text cannot be interpreted automatically. It is thus necessary to identify pertinent techniques or new method to tackle these difficulties. Learning–to-rank, an emerging approach in information retrieval research has attracted our attention for its superiority in handling noisy data with rare events. In this dissertation, we introduce this technique to the marketing science community, apply it to predict customers’ responses to donation solicitations by the American Red Cross, and show that it outperforms traditional regression methods. We adapt the original learning-to-rank algorithm to better serve the needs of business applications relevant to such solicitations. The proposed algorithm is effective and efficient is predicting potential donors. Namely, through the adapted learning-to-rank algorithm, we are able to identify the most important 20% of potential donors, who would provide 80% of the actual donations. The latter half of the dissertation is dedicated to the application of business analytics to online advertising. The goal is to model visitors’ click-through probability on advertising video clips at a hedonic video website. We build a hierarchical linear model with latent variables and show its superiority in comparison to two other benchmark models. This research helps online business managers derive insights into the site visitors’ characteristics that affect their click-through propensity, and recommends managerial actions to increase advertising effectiveness

D-Scholarship@Pitt

Graph Inference with Applications to Low-Resource Audio Search and Indexing

Author: Levin Keith David
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 26/07/2017
Field of study

The task of query-by-example search is to retrieve, from among a collection of data, the observations most similar to a given query. A common approach to this problem is based on viewing the data as vertices in a graph in which edge weights reflect similarities between observations. Errors arise in this graph-based framework both from errors in measuring these similarities and from approximations required for fast retrieval. In this thesis, we use tools from graph inference to analyze and control the sources of these errors. We establish novel theoretical results related to representation learning and to vertex nomination, and use these results to control the effects of model misspecification, noisy similarity measurement and approximation error on search accuracy. We present a state-of-the-art system for query-by-example audio search in the context of low-resource speech recognition, which also serves as an illustrative example and testbed for applying our theoretical results

JScholarship

Learning, deducing and linking entities

Author: Tugay Resul
Publication venue: The University of Edinburgh
Publication date: 25/10/2023
Field of study

Improving the quality of data is a critical issue in data management and machine learning, and finding the most representative and concise way to achieve this is a key challenge. Learning how to represent entities accurately is essential for various tasks in data science, such as generating better recommendations and more accurate question answering. Thus, the amount and quality of information available on an entity can greatly impact the quality of results of downstream tasks. This thesis focuses on two specific areas to improve data quality: (i) learning and deducing entities for data currency (i.e., how up-to-date information is), and (ii) linking entities across different data sources. The first technical contribution is GATE (Get the lATEst), a framework that combines deep learning and rule-based methods to find up-to-date information of an entity. GATE learns and deduces temporal orders on attribute values in a set of tuples that pertain to the same entity. It is based on creator-critic framework and the creator trains a neural ranking model to learn temporal orders and rank attribute values based on correlations among the attributes. The critic then validates the temporal orders learned and deduces more ranked pairs by chasing the data with currency constraints; it also provides augmented training data as feedback for the creator to improve the ranking in the next round. The process proceeds until the temporal order obtained becomes stable. The second technical contribution is HER (Heterogeneous Entity Resolution), a framework that consists of a set of methods to link entities across relations and graphs. We propose a new notion, parametric simulation, to link entities across a relational database D and a graph G. Taking functions and thresholds for measuring vertex closeness, path associations and important properties as parameters, parametric simulation identifies tuplest in D and vertices v in G that refer to the same real-world entity, based on topological and semantic matching. We develop machine learning methods to learn the parameter functions and thresholds. Rather than solely concentrating on rule-based methods and machine learning algorithms separately to enhance data quality, we focused on combining both approaches to address the challenges of data currency and entity linking. We combined rule-based methods with state-of-the-art machine learning methods to represent entities, then used representation of these entities for further tasks. These enhanced models, combination of machine learning and logic rules helped us to represent entities in a better way (i) to find the most up-to-date attribute values and (ii) to link them across relations and graphs

Edinburgh Research Archive

Collecte orientée sur le Web pour la recherche d'information spécialisée

Author: DE GROC Clément
TANNIER Xavier
ZWEIGENBAUM Pierre
Publication venue
Publication date: 01/01/2013
Field of study

Les moteurs de recherche verticaux, qui se concentrent sur des segments spécifiques du Web, deviennent aujourd'hui de plus en plus présents dans le paysage d'Internet. Les moteurs de recherche thématiques, notamment, peuvent obtenir de très bonnes performances en limitant le corpus indexé à un thème connu. Les ambiguïtés de la langue sont alors d'autant plus contrôlables que le domaine est bien ciblé. De plus, la connaissance des objets et de leurs propriétés rend possible le développement de techniques d'analyse spécifiques afin d'extraire des informations pertinentes.Dans le cadre de cette thèse, nous nous intéressons plus précisément à la procédure de collecte de documents thématiques à partir du Web pour alimenter un moteur de recherche thématique. La procédure de collecte peut être réalisée en s'appuyant sur un moteur de recherche généraliste existant (recherche orientée) ou en parcourant les hyperliens entre les pages Web (exploration orientée).Nous étudions tout d'abord la recherche orientée. Dans ce contexte, l'approche classique consiste à combiner des mot-clés du domaine d'intérêt, à les soumettre à un moteur de recherche et à télécharger les meilleurs résultats retournés par ce dernier.Après avoir évalué empiriquement cette approche sur 340 thèmes issus de l'OpenDirectory, nous proposons de l'améliorer en deux points. En amont du moteur de recherche, nous proposons de formuler des requêtes thématiques plus pertinentes pour le thème afin d'augmenter la précision de la collecte. Nous définissons une métrique fondée sur un graphe de cooccurrences et un algorithme de marche aléatoire, dans le but de prédire la pertinence d'une requête thématique. En aval du moteur de recherche, nous proposons de filtrer les documents téléchargés afin d'améliorer la qualité du corpus produit. Pour ce faire, nous modélisons la procédure de collecte sous la forme d'un graphe triparti et appliquons un algorithme de marche aléatoire biaisé afin d'ordonner par pertinence les documents et termes apparaissant dans ces derniers.Dans la seconde partie de cette thèse, nous nous focalisons sur l'exploration orientée du Web. Au coeur de tout robot d'exploration orientée se trouve une stratégie de crawl qui lui permet de maximiser le rapatriement de pages pertinentes pour un thème, tout en minimisant le nombre de pages visitées qui ne sont pas en rapport avec le thème. En pratique, cette stratégie définit l'ordre de visite des pages. Nous proposons d'apprendre automatiquement une fonction d'ordonnancement indépendante du thème à partir de données existantes annotées automatiquement.Vertical search engines, which focus on a specific segment of the Web, become more and more present in the Internet landscape. Topical search engines, notably, can obtain a significant performance boost by limiting their index on a specific topic. By doing so, language ambiguities are reduced, and both the algorithms and the user interface can take advantage of domain knowledge, such as domain objects or characteristics, to satisfy user information needs.In this thesis, we tackle the first inevitable step of a all topical search engine : focused document gathering from the Web. A thorough study of the state of art leads us to consider two strategies to gather topical documents from the Web: either relying on an existing search engine index (focused search) or directly crawling the Web (focused crawling).The first part of our research has been dedicated to focused search. In this context, a standard approach consists in combining domain-specific terms into queries, submitting those queries to a search engine and down- loading top ranked documents. After empirically evaluating this approach over 340 topics, we propose to enhance it in two different ways: Upstream of the search engine, we aim at formulating more relevant queries in or- der to increase the precision of the top retrieved documents. To do so, we define a metric based on a co-occurrence graph and a random walk algorithm, which aims at predicting the topical relevance of a query. Downstream of the search engine, we filter the retrieved documents in order to improve the document collection quality. We do so by modeling our gathering process as a tripartite graph and applying a random walk with restart algorithm so as to simultaneously order by relevance the documents and terms appearing in our corpus.In the second part of this thesis, we turn to focused crawling. We describe our focused crawler implementation that was designed to scale horizontally. Then, we consider the problem of crawl frontier ordering, which is at the very heart of a focused crawler. Such ordering strategy allows the crawler to prioritize its fetches, maximizing the number of in-domain documents retrieved while minimizing the non relevant ones. We propose to apply learning to rank algorithms to efficiently order the crawl frontier, and define a method to learn a ranking function from existing crawls.PARIS11-SCD-Bib. électronique (914719901) / SudocSudocFranceF

OpenGrey Repository

Ranking and Retrieval under Semantic Relevance

Author: Chen Tongfei
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 16/02/2021
Field of study

This thesis presents a series of conceptual and empirical developments on the ranking and retrieval of candidates under semantic relevance. Part I of the thesis introduces the concept of uncertainty in various semantic tasks (such as recognizing textual entailment) in natural language processing, and the machine learning techniques commonly employed to model these semantic phenomena. A unified view of ranking and retrieval will be presented, and the trade-off between model expressiveness, performance, and scalability in model design will be discussed. Part II of the thesis focuses on applying these ranking and retrieval techniques to text: Chapter 3 examines the feasibility of ranking hypotheses given a premise with respect to a human's subjective probability of the hypothesis happening, effectively extending the traditional categorical task of natural language inference. Chapter 4 focuses on detecting situation frames for documents using ranking methods. Then we extend the ranking notion to retrieval, and develop both sparse (Chapter 5) and dense (Chapter 6) vector-based methods to facilitate scalable retrieval for potential answer paragraphs in question answering. Part III turns the focus to mentions and entities in text, while continuing the theme on ranking and retrieval: Chapter 7 discusses the ranking of fine-grained types that an entity mention could belong to, leading to state-of-the-art performance on hierarchical multi-label fine-grained entity typing. Chapter 8 extends the semantic relation of coreference to a cross-document setting, enabling models to retrieve from a large corpus, instead of in a single document, when resolving coreferent entity mentions

Johns Hopkins University

JScholarship