263 research outputs found
Embedding-based Retrieval in Facebook Search
Search in social networks such as Facebook poses different challenges than in
classical web search: besides the query text, it is important to take into
account the searcher's context to provide relevant results. Their social graph
is an integral part of this context and is a unique aspect of Facebook search.
While embedding-based retrieval (EBR) has been applied in eb search engines for
years, Facebook search was still mainly based on a Boolean matching model. In
this paper, we discuss the techniques for applying EBR to a Facebook Search
system. We introduce the unified embedding framework developed to model
semantic embeddings for personalized search, and the system to serve
embedding-based retrieval in a typical search system based on an inverted
index. We discuss various tricks and experiences on end-to-end optimization of
the whole system, including ANN parameter tuning and full-stack optimization.
Finally, we present our progress on two selected advanced topics about
modeling. We evaluated EBR on verticals for Facebook Search with significant
metrics gains observed in online A/B experiments. We believe this paper will
provide useful insights and experiences to help people on developing
embedding-based retrieval systems in search engines.Comment: 9 pages, 3 figures, 3 tables, to be published in KDD '2
Towards Personalized and Human-in-the-Loop Document Summarization
The ubiquitous availability of computing devices and the widespread use of
the internet have generated a large amount of data continuously. Therefore, the
amount of available information on any given topic is far beyond humans'
processing capacity to properly process, causing what is known as information
overload. To efficiently cope with large amounts of information and generate
content with significant value to users, we require identifying, merging and
summarising information. Data summaries can help gather related information and
collect it into a shorter format that enables answering complicated questions,
gaining new insight and discovering conceptual boundaries.
This thesis focuses on three main challenges to alleviate information
overload using novel summarisation techniques. It further intends to facilitate
the analysis of documents to support personalised information extraction. This
thesis separates the research issues into four areas, covering (i) feature
engineering in document summarisation, (ii) traditional static and inflexible
summaries, (iii) traditional generic summarisation approaches, and (iv) the
need for reference summaries. We propose novel approaches to tackle these
challenges, by: i)enabling automatic intelligent feature engineering, ii)
enabling flexible and interactive summarisation, iii) utilising intelligent and
personalised summarisation approaches. The experimental results prove the
efficiency of the proposed approaches compared to other state-of-the-art
models. We further propose solutions to the information overload problem in
different domains through summarisation, covering network traffic data, health
data and business process data.Comment: PhD thesi
Meta-Learning and the Full Model Selection Problem
When working as a data analyst, one of my daily tasks is to select appropriate tools from a set of existing data analysis techniques in my toolbox, including data preprocessing, outlier detection, feature selection, learning algorithm and evaluation techniques, for a given data project. This indeed was an enjoyable job at the beginning, because to me finding patterns and valuable information from data is always fun. Things become tricky when several projects needed to be done in a relatively short time.
Naturally, as a computer science graduate, I started to ask myself, "What can be automated here?"; because, intuitively, part of my work is more or less a loop that can be programmed. Literally, the loop is "choose, run, test and choose again... until some criterion/goals are met".
In other words, I use my experience or knowledge about machine learning and data mining to guide and speed up the process of selecting and applying techniques in order to build a relatively good predictive model for a given dataset for some purpose. So the following questions arise:
"Is it possible to design and implement a system that helps a data analyst to choose from a set of data mining tools? Or at least that provides a useful recommendation about tools that potentially save some time for a human analyst."
To answer these questions, I decided to undertake a long-term study on this topic, to think, define, research, and simulate this problem before coding my dream system. This thesis presents research results, including new methods, algorithms, and theoretical and empirical analysis from two directions, both of which try to propose systematic and efficient solutions to the questions above, using different resource requirements, namely, the meta-learning-based algorithm/parameter ranking approach and the meta-heuristic search-based full-model selection approach.
Some of the results have been published in research papers; thus, this thesis also serves as a coherent collection of results in a single volume
Label Ranking with Probabilistic Models
Diese Arbeit konzentriert sich auf eine spezielle Prognoseform, das sogenannte Label Ranking. Auf den Punkt gebracht, kann Label Ranking als eine Erweiterung des herkömmlichen Klassifizierungproblems betrachtet werden. Bei einer Anfrage (z. B. durch einen Kunden) und einem vordefinierten Set von Kandidaten Labels (zB AUDI, BMW, VW), wird ein einzelnes Label (zB BMW) zur Vorhersage in der Klassifizierung benötigt, während ein komplettes Ranking aller Label (zB BMW> VW> Audi) für das Label Ranking erforderlich ist. Da Vorhersagen dieser Art, bei vielen Problemen der realen Welt nützlich sind, können Label Ranking-Methoden in mehreren Anwendungen, darunter Information Retrieval, Kundenwunsch Lernen und E-Commerce eingesetzt werden. Die vorliegende Arbeit stellt eine Auswahl an Methoden für Label-Ranking vor, die Maschinelles Lernen mit statistischen Bewertungsmodellen kombiniert.
Wir konzentrieren wir uns auf zwei statistische Ranking-Modelle, das Mallows- und das Plackett-Luce-Modell und zwei Techniken des maschinellen Lernens, das Beispielbasierte Lernen und das Verallgemeinernde Lineare Modell
On-the-fly Table Generation
Many information needs revolve around entities, which would be better
answered by summarizing results in a tabular format, rather than presenting
them as a ranked list. Unlike previous work, which is limited to retrieving
existing tables, we aim to answer queries by automatically compiling a table in
response to a query. We introduce and address the task of on-the-fly table
generation: given a query, generate a relational table that contains relevant
entities (as rows) along with their key properties (as columns). This problem
is decomposed into three specific subtasks: (i) core column entity ranking,
(ii) schema determination, and (iii) value lookup. We employ a feature-based
approach for entity ranking and schema determination, combining deep semantic
features with task-specific signals. We further show that these two subtasks
are not independent of each other and can assist each other in an iterative
manner. For value lookup, we combine information from existing tables and a
knowledge base. Using two sets of entity-oriented queries, we evaluate our
approach both on the component level and on the end-to-end table generation
task.Comment: The 41st International ACM SIGIR Conference on Research and
Development in Information Retrieva
Computational Diagnosis of Skin Lesions from Dermoscopic Images using Combined Features
There has been an alarming increase in the number of skin cancer cases worldwide in recent years, which has raised interest in computational systems for automatic diagnosis to assist early diagnosis and prevention. Feature extraction to describe skin lesions is a challenging research area due to the difficulty in selecting meaningful features. The main objective of this work is to find the best combination of features, based on shape properties, colour variation and texture analysis, to be extracted using various feature extraction methods. Several colour spaces are used for the extraction of both colour- and texture-related features. Different categories of classifiers were adopted to evaluate the proposed feature extraction step, and several feature selection algorithms were compared for the classification of skin lesions. The developed skin lesion computational diagnosis system was applied to a set of 1104 dermoscopic images using a cross-validation procedure. The best results were obtained by an optimum-path forest classifier with very promising results. The proposed system achieved an accuracy of 92.3%, sensitivity of 87.5% and specificity of 97.1% when the full set of features was used. Furthermore, it achieved an accuracy of 91.6%, sensitivity of 87% and specificity of 96.2%, when 50 features were selected using a correlation-based feature selection algorithm
Active Learning of Classification Models from Enriched Label-related Feedback
Our ability to learn accurate classification models from data is often limited by the number of available labeled data instances. This limitation is of particular concern when data instances need to be manually labeled by human annotators and when the labeling process carries a significant cost. Recent years witnessed increased research interest in developing methods in different directions capable of learning models from a smaller number of examples. One such direction is active learning, which finds the most informative unlabeled instances to be labeled next. Another, more recent direction showing a great promise utilizes enriched label-related feedback. In this case, such feedback from the human annotator provides additional information reflecting the relations among possible labels. The cost of such feedback is often negligible compared with the cost of instance review. The enriched label-related feedback may come in different forms. In this work, we propose, develop and study classification models for binary, multi-class and multi-label classification problems that utilize the different forms of enriched label-related feedback. We show that this new feedback can help us improve the quality of classification models compared with the standard class-label feedback. For each of the studied feedback forms, we also develop new active learning strategies for selecting the most informative unlabeled instances that are compatible with the respective feedback form, effectively combining two approaches for reducing the number of required labeled instances. We demonstrate the effectiveness of our new framework on both simulated and real-world datasets
ARDA: Automatic Relational Data Augmentation for Machine Learning
Automatic machine learning (\AML) is a family of techniques to automate the
process of training predictive models, aiming to both improve performance and
make machine learning more accessible. While many recent works have focused on
aspects of the machine learning pipeline like model selection, hyperparameter
tuning, and feature selection, relatively few works have focused on automatic
data augmentation. Automatic data augmentation involves finding new features
relevant to the user's predictive task with minimal ``human-in-the-loop''
involvement.
We present \system, an end-to-end system that takes as input a dataset and a
data repository, and outputs an augmented data set such that training a
predictive model on this augmented dataset results in improved performance. Our
system has two distinct components: (1) a framework to search and join data
with the input data, based on various attributes of the input, and (2) an
efficient feature selection algorithm that prunes out noisy or irrelevant
features from the resulting join. We perform an extensive empirical evaluation
of different system components and benchmark our feature selection algorithm on
real-world datasets
- …