14 research outputs found

    A cross-benchmark comparison of 87 learning to rank methods

    Get PDF
    Learning to rank is an increasingly important scientific field that comprises the use of machine learning for the ranking task. New learning to rank methods are generally evaluated on benchmark test collections. However, comparison of learning to rank methods based on evaluation results is hindered by the absence of a standard set of evaluation benchmark collections. In this paper we propose a way to compare learning to rank methods based on a sparse set of evaluation results on a set of benchmark datasets. Our comparison methodology consists of two components: (1) Normalized Winning Number, which gives insight in the ranking accuracy of the learning to rank method, and (2) Ideal Winning Number, which gives insight in the degree of certainty concerning its ranking accuracy. Evaluation results of 87 learning to rank methods on 20 well-known benchmark datasets are collected through a structured literature search. ListNet, SmoothRank, FenchelRank, FSMRank, LRUF and LARF are Pareto optimal learning to rank methods in the Normalized Winning Number and Ideal Winning Number dimensions, listed in increasing order of Normalized Winning Number and decreasing order of Ideal Winning Number

    Information Retrieval using applied Supervised Learning for Personalized E-Commerce

    Get PDF
    Master's thesis in Computer SciencePersonalized E-Commerce Search Challenge issued by the International Conference on Information and Knowledge Management. By analyzing historical data containing browsing logs, queries, user interactions, and static data in the domain of an online retail service, we attempt to extract patterns and derive features from the data collection that will subsequently improve prediction of relevant products. A selection of supervised learning models will utilize an assembly of these features to be trained for prediction of test data. Prediction is performed on the queries given by the data collection, paired with each product item originally appearing in the query. We experiment with the possible assemblies of features along with the models and compare the results to achieve maximum prediction power. Lastly, the quality of the predictions are evaluated towards a ground truth to yield scores.submittedVersio

    Business Analytics for Non-profit Marketing and Online Advertising

    Get PDF
    Business analytics is facing formidable challenges in the Internet era. Data collected from business website often contain hundreds of millions of records; the goal of analysis frequently involves predicting rare events; and substantial noise in the form of errors or unstructured text cannot be interpreted automatically. It is thus necessary to identify pertinent techniques or new method to tackle these difficulties. Learning–to-rank, an emerging approach in information retrieval research has attracted our attention for its superiority in handling noisy data with rare events. In this dissertation, we introduce this technique to the marketing science community, apply it to predict customers’ responses to donation solicitations by the American Red Cross, and show that it outperforms traditional regression methods. We adapt the original learning-to-rank algorithm to better serve the needs of business applications relevant to such solicitations. The proposed algorithm is effective and efficient is predicting potential donors. Namely, through the adapted learning-to-rank algorithm, we are able to identify the most important 20% of potential donors, who would provide 80% of the actual donations. The latter half of the dissertation is dedicated to the application of business analytics to online advertising. The goal is to model visitors’ click-through probability on advertising video clips at a hedonic video website. We build a hierarchical linear model with latent variables and show its superiority in comparison to two other benchmark models. This research helps online business managers derive insights into the site visitors’ characteristics that affect their click-through propensity, and recommends managerial actions to increase advertising effectiveness

    Graph Inference with Applications to Low-Resource Audio Search and Indexing

    Get PDF
    The task of query-by-example search is to retrieve, from among a collection of data, the observations most similar to a given query. A common approach to this problem is based on viewing the data as vertices in a graph in which edge weights reflect similarities between observations. Errors arise in this graph-based framework both from errors in measuring these similarities and from approximations required for fast retrieval. In this thesis, we use tools from graph inference to analyze and control the sources of these errors. We establish novel theoretical results related to representation learning and to vertex nomination, and use these results to control the effects of model misspecification, noisy similarity measurement and approximation error on search accuracy. We present a state-of-the-art system for query-by-example audio search in the context of low-resource speech recognition, which also serves as an illustrative example and testbed for applying our theoretical results

    Learning, deducing and linking entities

    Get PDF
    Improving the quality of data is a critical issue in data management and machine learning, and finding the most representative and concise way to achieve this is a key challenge. Learning how to represent entities accurately is essential for various tasks in data science, such as generating better recommendations and more accurate question answering. Thus, the amount and quality of information available on an entity can greatly impact the quality of results of downstream tasks. This thesis focuses on two specific areas to improve data quality: (i) learning and deducing entities for data currency (i.e., how up-to-date information is), and (ii) linking entities across different data sources. The first technical contribution is GATE (Get the lATEst), a framework that combines deep learning and rule-based methods to find up-to-date information of an entity. GATE learns and deduces temporal orders on attribute values in a set of tuples that pertain to the same entity. It is based on creator-critic framework and the creator trains a neural ranking model to learn temporal orders and rank attribute values based on correlations among the attributes. The critic then validates the temporal orders learned and deduces more ranked pairs by chasing the data with currency constraints; it also provides augmented training data as feedback for the creator to improve the ranking in the next round. The process proceeds until the temporal order obtained becomes stable. The second technical contribution is HER (Heterogeneous Entity Resolution), a framework that consists of a set of methods to link entities across relations and graphs. We propose a new notion, parametric simulation, to link entities across a relational database D and a graph G. Taking functions and thresholds for measuring vertex closeness, path associations and important properties as parameters, parametric simulation identifies tuplest in D and vertices v in G that refer to the same real-world entity, based on topological and semantic matching. We develop machine learning methods to learn the parameter functions and thresholds. Rather than solely concentrating on rule-based methods and machine learning algorithms separately to enhance data quality, we focused on combining both approaches to address the challenges of data currency and entity linking. We combined rule-based methods with state-of-the-art machine learning methods to represent entities, then used representation of these entities for further tasks. These enhanced models, combination of machine learning and logic rules helped us to represent entities in a better way (i) to find the most up-to-date attribute values and (ii) to link them across relations and graphs

    Collecte orientée sur le Web pour la recherche d'information spécialisée

    Get PDF
    Les moteurs de recherche verticaux, qui se concentrent sur des segments spĂ©cifiques du Web, deviennent aujourd'hui de plus en plus prĂ©sents dans le paysage d'Internet. Les moteurs de recherche thĂ©matiques, notamment, peuvent obtenir de trĂšs bonnes performances en limitant le corpus indexĂ© Ă  un thĂšme connu. Les ambiguĂŻtĂ©s de la langue sont alors d'autant plus contrĂŽlables que le domaine est bien ciblĂ©. De plus, la connaissance des objets et de leurs propriĂ©tĂ©s rend possible le dĂ©veloppement de techniques d'analyse spĂ©cifiques afin d'extraire des informations pertinentes.Dans le cadre de cette thĂšse, nous nous intĂ©ressons plus prĂ©cisĂ©ment Ă  la procĂ©dure de collecte de documents thĂ©matiques Ă  partir du Web pour alimenter un moteur de recherche thĂ©matique. La procĂ©dure de collecte peut ĂȘtre rĂ©alisĂ©e en s'appuyant sur un moteur de recherche gĂ©nĂ©raliste existant (recherche orientĂ©e) ou en parcourant les hyperliens entre les pages Web (exploration orientĂ©e).Nous Ă©tudions tout d'abord la recherche orientĂ©e. Dans ce contexte, l'approche classique consiste Ă  combiner des mot-clĂ©s du domaine d'intĂ©rĂȘt, Ă  les soumettre Ă  un moteur de recherche et Ă  tĂ©lĂ©charger les meilleurs rĂ©sultats retournĂ©s par ce dernier.AprĂšs avoir Ă©valuĂ© empiriquement cette approche sur 340 thĂšmes issus de l'OpenDirectory, nous proposons de l'amĂ©liorer en deux points. En amont du moteur de recherche, nous proposons de formuler des requĂȘtes thĂ©matiques plus pertinentes pour le thĂšme afin d'augmenter la prĂ©cision de la collecte. Nous dĂ©finissons une mĂ©trique fondĂ©e sur un graphe de cooccurrences et un algorithme de marche alĂ©atoire, dans le but de prĂ©dire la pertinence d'une requĂȘte thĂ©matique. En aval du moteur de recherche, nous proposons de filtrer les documents tĂ©lĂ©chargĂ©s afin d'amĂ©liorer la qualitĂ© du corpus produit. Pour ce faire, nous modĂ©lisons la procĂ©dure de collecte sous la forme d'un graphe triparti et appliquons un algorithme de marche alĂ©atoire biaisĂ© afin d'ordonner par pertinence les documents et termes apparaissant dans ces derniers.Dans la seconde partie de cette thĂšse, nous nous focalisons sur l'exploration orientĂ©e du Web. Au coeur de tout robot d'exploration orientĂ©e se trouve une stratĂ©gie de crawl qui lui permet de maximiser le rapatriement de pages pertinentes pour un thĂšme, tout en minimisant le nombre de pages visitĂ©es qui ne sont pas en rapport avec le thĂšme. En pratique, cette stratĂ©gie dĂ©finit l'ordre de visite des pages. Nous proposons d'apprendre automatiquement une fonction d'ordonnancement indĂ©pendante du thĂšme Ă  partir de donnĂ©es existantes annotĂ©es automatiquement.Vertical search engines, which focus on a specific segment of the Web, become more and more present in the Internet landscape. Topical search engines, notably, can obtain a significant performance boost by limiting their index on a specific topic. By doing so, language ambiguities are reduced, and both the algorithms and the user interface can take advantage of domain knowledge, such as domain objects or characteristics, to satisfy user information needs.In this thesis, we tackle the first inevitable step of a all topical search engine : focused document gathering from the Web. A thorough study of the state of art leads us to consider two strategies to gather topical documents from the Web: either relying on an existing search engine index (focused search) or directly crawling the Web (focused crawling).The first part of our research has been dedicated to focused search. In this context, a standard approach consists in combining domain-specific terms into queries, submitting those queries to a search engine and down- loading top ranked documents. After empirically evaluating this approach over 340 topics, we propose to enhance it in two different ways: Upstream of the search engine, we aim at formulating more relevant queries in or- der to increase the precision of the top retrieved documents. To do so, we define a metric based on a co-occurrence graph and a random walk algorithm, which aims at predicting the topical relevance of a query. Downstream of the search engine, we filter the retrieved documents in order to improve the document collection quality. We do so by modeling our gathering process as a tripartite graph and applying a random walk with restart algorithm so as to simultaneously order by relevance the documents and terms appearing in our corpus.In the second part of this thesis, we turn to focused crawling. We describe our focused crawler implementation that was designed to scale horizontally. Then, we consider the problem of crawl frontier ordering, which is at the very heart of a focused crawler. Such ordering strategy allows the crawler to prioritize its fetches, maximizing the number of in-domain documents retrieved while minimizing the non relevant ones. We propose to apply learning to rank algorithms to efficiently order the crawl frontier, and define a method to learn a ranking function from existing crawls.PARIS11-SCD-Bib. Ă©lectronique (914719901) / SudocSudocFranceF

    Ranking and Retrieval under Semantic Relevance

    Get PDF
    This thesis presents a series of conceptual and empirical developments on the ranking and retrieval of candidates under semantic relevance. Part I of the thesis introduces the concept of uncertainty in various semantic tasks (such as recognizing textual entailment) in natural language processing, and the machine learning techniques commonly employed to model these semantic phenomena. A unified view of ranking and retrieval will be presented, and the trade-off between model expressiveness, performance, and scalability in model design will be discussed. Part II of the thesis focuses on applying these ranking and retrieval techniques to text: Chapter 3 examines the feasibility of ranking hypotheses given a premise with respect to a human's subjective probability of the hypothesis happening, effectively extending the traditional categorical task of natural language inference. Chapter 4 focuses on detecting situation frames for documents using ranking methods. Then we extend the ranking notion to retrieval, and develop both sparse (Chapter 5) and dense (Chapter 6) vector-based methods to facilitate scalable retrieval for potential answer paragraphs in question answering. Part III turns the focus to mentions and entities in text, while continuing the theme on ranking and retrieval: Chapter 7 discusses the ranking of fine-grained types that an entity mention could belong to, leading to state-of-the-art performance on hierarchical multi-label fine-grained entity typing. Chapter 8 extends the semantic relation of coreference to a cross-document setting, enabling models to retrieve from a large corpus, instead of in a single document, when resolving coreferent entity mentions
    corecore