263 research outputs found

    Embedding-based Retrieval in Facebook Search

    Full text link
    Search in social networks such as Facebook poses different challenges than in classical web search: besides the query text, it is important to take into account the searcher's context to provide relevant results. Their social graph is an integral part of this context and is a unique aspect of Facebook search. While embedding-based retrieval (EBR) has been applied in eb search engines for years, Facebook search was still mainly based on a Boolean matching model. In this paper, we discuss the techniques for applying EBR to a Facebook Search system. We introduce the unified embedding framework developed to model semantic embeddings for personalized search, and the system to serve embedding-based retrieval in a typical search system based on an inverted index. We discuss various tricks and experiences on end-to-end optimization of the whole system, including ANN parameter tuning and full-stack optimization. Finally, we present our progress on two selected advanced topics about modeling. We evaluated EBR on verticals for Facebook Search with significant metrics gains observed in online A/B experiments. We believe this paper will provide useful insights and experiences to help people on developing embedding-based retrieval systems in search engines.Comment: 9 pages, 3 figures, 3 tables, to be published in KDD '2

    Towards Personalized and Human-in-the-Loop Document Summarization

    Full text link
    The ubiquitous availability of computing devices and the widespread use of the internet have generated a large amount of data continuously. Therefore, the amount of available information on any given topic is far beyond humans' processing capacity to properly process, causing what is known as information overload. To efficiently cope with large amounts of information and generate content with significant value to users, we require identifying, merging and summarising information. Data summaries can help gather related information and collect it into a shorter format that enables answering complicated questions, gaining new insight and discovering conceptual boundaries. This thesis focuses on three main challenges to alleviate information overload using novel summarisation techniques. It further intends to facilitate the analysis of documents to support personalised information extraction. This thesis separates the research issues into four areas, covering (i) feature engineering in document summarisation, (ii) traditional static and inflexible summaries, (iii) traditional generic summarisation approaches, and (iv) the need for reference summaries. We propose novel approaches to tackle these challenges, by: i)enabling automatic intelligent feature engineering, ii) enabling flexible and interactive summarisation, iii) utilising intelligent and personalised summarisation approaches. The experimental results prove the efficiency of the proposed approaches compared to other state-of-the-art models. We further propose solutions to the information overload problem in different domains through summarisation, covering network traffic data, health data and business process data.Comment: PhD thesi

    Meta-Learning and the Full Model Selection Problem

    Get PDF
    When working as a data analyst, one of my daily tasks is to select appropriate tools from a set of existing data analysis techniques in my toolbox, including data preprocessing, outlier detection, feature selection, learning algorithm and evaluation techniques, for a given data project. This indeed was an enjoyable job at the beginning, because to me finding patterns and valuable information from data is always fun. Things become tricky when several projects needed to be done in a relatively short time. Naturally, as a computer science graduate, I started to ask myself, "What can be automated here?"; because, intuitively, part of my work is more or less a loop that can be programmed. Literally, the loop is "choose, run, test and choose again... until some criterion/goals are met". In other words, I use my experience or knowledge about machine learning and data mining to guide and speed up the process of selecting and applying techniques in order to build a relatively good predictive model for a given dataset for some purpose. So the following questions arise: "Is it possible to design and implement a system that helps a data analyst to choose from a set of data mining tools? Or at least that provides a useful recommendation about tools that potentially save some time for a human analyst." To answer these questions, I decided to undertake a long-term study on this topic, to think, define, research, and simulate this problem before coding my dream system. This thesis presents research results, including new methods, algorithms, and theoretical and empirical analysis from two directions, both of which try to propose systematic and efficient solutions to the questions above, using different resource requirements, namely, the meta-learning-based algorithm/parameter ranking approach and the meta-heuristic search-based full-model selection approach. Some of the results have been published in research papers; thus, this thesis also serves as a coherent collection of results in a single volume

    Label Ranking with Probabilistic Models

    Get PDF
    Diese Arbeit konzentriert sich auf eine spezielle Prognoseform, das sogenannte Label Ranking. Auf den Punkt gebracht, kann Label Ranking als eine Erweiterung des herkömmlichen Klassifizierungproblems betrachtet werden. Bei einer Anfrage (z. B. durch einen Kunden) und einem vordefinierten Set von Kandidaten Labels (zB AUDI, BMW, VW), wird ein einzelnes Label (zB BMW) zur Vorhersage in der Klassifizierung benötigt, während ein komplettes Ranking aller Label (zB BMW> VW> Audi) für das Label Ranking erforderlich ist. Da Vorhersagen dieser Art, bei vielen Problemen der realen Welt nützlich sind, können Label Ranking-Methoden in mehreren Anwendungen, darunter Information Retrieval, Kundenwunsch Lernen und E-Commerce eingesetzt werden. Die vorliegende Arbeit stellt eine Auswahl an Methoden für Label-Ranking vor, die Maschinelles Lernen mit statistischen Bewertungsmodellen kombiniert. Wir konzentrieren wir uns auf zwei statistische Ranking-Modelle, das Mallows- und das Plackett-Luce-Modell und zwei Techniken des maschinellen Lernens, das Beispielbasierte Lernen und das Verallgemeinernde Lineare Modell

    On-the-fly Table Generation

    Full text link
    Many information needs revolve around entities, which would be better answered by summarizing results in a tabular format, rather than presenting them as a ranked list. Unlike previous work, which is limited to retrieving existing tables, we aim to answer queries by automatically compiling a table in response to a query. We introduce and address the task of on-the-fly table generation: given a query, generate a relational table that contains relevant entities (as rows) along with their key properties (as columns). This problem is decomposed into three specific subtasks: (i) core column entity ranking, (ii) schema determination, and (iii) value lookup. We employ a feature-based approach for entity ranking and schema determination, combining deep semantic features with task-specific signals. We further show that these two subtasks are not independent of each other and can assist each other in an iterative manner. For value lookup, we combine information from existing tables and a knowledge base. Using two sets of entity-oriented queries, we evaluate our approach both on the component level and on the end-to-end table generation task.Comment: The 41st International ACM SIGIR Conference on Research and Development in Information Retrieva

    Computational Diagnosis of Skin Lesions from Dermoscopic Images using Combined Features

    Get PDF
    There has been an alarming increase in the number of skin cancer cases worldwide in recent years, which has raised interest in computational systems for automatic diagnosis to assist early diagnosis and prevention. Feature extraction to describe skin lesions is a challenging research area due to the difficulty in selecting meaningful features. The main objective of this work is to find the best combination of features, based on shape properties, colour variation and texture analysis, to be extracted using various feature extraction methods. Several colour spaces are used for the extraction of both colour- and texture-related features. Different categories of classifiers were adopted to evaluate the proposed feature extraction step, and several feature selection algorithms were compared for the classification of skin lesions. The developed skin lesion computational diagnosis system was applied to a set of 1104 dermoscopic images using a cross-validation procedure. The best results were obtained by an optimum-path forest classifier with very promising results. The proposed system achieved an accuracy of 92.3%, sensitivity of 87.5% and specificity of 97.1% when the full set of features was used. Furthermore, it achieved an accuracy of 91.6%, sensitivity of 87% and specificity of 96.2%, when 50 features were selected using a correlation-based feature selection algorithm

    Active Learning of Classification Models from Enriched Label-related Feedback

    Get PDF
    Our ability to learn accurate classification models from data is often limited by the number of available labeled data instances. This limitation is of particular concern when data instances need to be manually labeled by human annotators and when the labeling process carries a significant cost. Recent years witnessed increased research interest in developing methods in different directions capable of learning models from a smaller number of examples. One such direction is active learning, which finds the most informative unlabeled instances to be labeled next. Another, more recent direction showing a great promise utilizes enriched label-related feedback. In this case, such feedback from the human annotator provides additional information reflecting the relations among possible labels. The cost of such feedback is often negligible compared with the cost of instance review. The enriched label-related feedback may come in different forms. In this work, we propose, develop and study classification models for binary, multi-class and multi-label classification problems that utilize the different forms of enriched label-related feedback. We show that this new feedback can help us improve the quality of classification models compared with the standard class-label feedback. For each of the studied feedback forms, we also develop new active learning strategies for selecting the most informative unlabeled instances that are compatible with the respective feedback form, effectively combining two approaches for reducing the number of required labeled instances. We demonstrate the effectiveness of our new framework on both simulated and real-world datasets

    ARDA: Automatic Relational Data Augmentation for Machine Learning

    Full text link
    Automatic machine learning (\AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation. Automatic data augmentation involves finding new features relevant to the user's predictive task with minimal ``human-in-the-loop'' involvement. We present \system, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set such that training a predictive model on this augmented dataset results in improved performance. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join. We perform an extensive empirical evaluation of different system components and benchmark our feature selection algorithm on real-world datasets
    corecore