6 research outputs found
Multi-Task Learning for Email Search Ranking with Auxiliary Query Clustering
User information needs vary significantly across different tasks, and
therefore their queries will also differ considerably in their expressiveness
and semantics. Many studies have been proposed to model such query diversity by
obtaining query types and building query-dependent ranking models. These
studies typically require either a labeled query dataset or clicks from
multiple users aggregated over the same document. These techniques, however,
are not applicable when manual query labeling is not viable, and aggregated
clicks are unavailable due to the private nature of the document collection,
e.g., in email search scenarios. In this paper, we study how to obtain query
type in an unsupervised fashion and how to incorporate this information into
query-dependent ranking models. We first develop a hierarchical clustering
algorithm based on truncated SVD and varimax rotation to obtain coarse-to-fine
query types. Then, we study three query-dependent ranking models, including two
neural models that leverage query type information as additional features, and
one novel multi-task neural model that views query type as the label for the
auxiliary query cluster prediction task. This multi-task model is trained to
simultaneously rank documents and predict query types. Our experiments on tens
of millions of real-world email search queries demonstrate that the proposed
multi-task model can significantly outperform the baseline neural ranking
models, which either do not incorporate query type information or just simply
feed query type as an additional feature.Comment: CIKM 201
Accelerated Convergence for Counterfactual Learning to Rank
Counterfactual Learning to Rank (LTR) algorithms learn a ranking model from
logged user interactions, often collected using a production system. Employing
such an offline learning approach has many benefits compared to an online one,
but it is challenging as user feedback often contains high levels of bias.
Unbiased LTR uses Inverse Propensity Scoring (IPS) to enable unbiased learning
from logged user interactions. One of the major difficulties in applying
Stochastic Gradient Descent (SGD) approaches to counterfactual learning
problems is the large variance introduced by the propensity weights. In this
paper we show that the convergence rate of SGD approaches with IPS-weighted
gradients suffers from the large variance introduced by the IPS weights:
convergence is slow, especially when there are large IPS weights. To overcome
this limitation, we propose a novel learning algorithm, called CounterSample,
that has provably better convergence than standard IPS-weighted gradient
descent methods. We prove that CounterSample converges faster and complement
our theoretical findings with empirical results by performing extensive
experimentation in a number of biased LTR scenarios -- across optimizers, batch
sizes, and different degrees of position bias.Comment: SIGIR 2020 full conference pape
Separate and Attend in Personal Email Search
In personal email search, user queries often impose different requirements on
different aspects of the retrieved emails. For example, the query "my recent
flight to the US" requires emails to be ranked based on both textual contents
and recency of the email documents, while other queries such as "medical
history" do not impose any constraints on the recency of the email. Recent deep
learning-to-rank models for personal email search often directly concatenate
dense numerical features (e.g., document age) with embedded sparse features
(e.g., n-gram embeddings). In this paper, we first show with a set of
experiments on synthetic datasets that direct concatenation of dense and sparse
features does not lead to the optimal search performance of deep neural ranking
models. To effectively incorporate both sparse and dense email features into
personal email search ranking, we propose a novel neural model, SepAttn.
SepAttn first builds two separate neural models to learn from sparse and dense
features respectively, and then applies an attention mechanism at the
prediction level to derive the final prediction from these two models. We
conduct a comprehensive set of experiments on a large-scale email search
dataset, and demonstrate that our SepAttn model consistently improves the
search quality over the baseline models.Comment: WSDM 202
Reducci贸n de dimensionalidad en Machine Learning. Diagn贸stico de c谩ncer de mama bsado en datos gen贸micos y de imagen
The target of the current Project consist in analyzing some of the Matching Learning techniques used in the current treatment of Big Data. It includes the study of the statistical and algebraic tools involved in the calculations, and an application to the diagnosis and clasification of breast cancer based on genomic and image data.
To extract information from Big Data, the data obtained require to be pre-processed. In this Project we present different pre-processing techniques and analyze them and their impact on the resulting prediction models.
Two Machine Learning Models are presented: One of them is focused on the diagnosis of breast c谩ncer base don image data. The second one is devoted to the classification of the different types of breast cancer and to the discovery of different patterns using genomic and proteinomic data. The two data basis are particularly convenient to present the Marchine Learning techniques analyzed in the Project and the corresponding pre-processing strategies.El objetivo del proyecto es analizar algunas t茅cnicas de aprendizaje autom谩tico (Machine Learning) que se emplean en la actualidad para extracci贸n de informaci贸n de grandes cantidades de datos, estudiar las herramientas estad铆sticas y algebraicas que emplean en los c谩lculos, y aplicarlas al diagn贸stico y clasificaci贸n de tipos de c谩ncer de mama.
El manejo de grandes cantidades de datos requiere de un pre-procesamiento de los mismos para poder ser empleados. En este proyecto se presentan y analizan tambi茅n distintas herramientas utilizadas en el pre-procesado de datos y su impacto en el modelo de predicci贸n.
En el trabajo se crean dos modelos de aprendizaje autom谩tico: Uno enfocado al diagn贸stico del c谩ncer de mama utilizando indicadores de imagen, y otro focalizado en la clasificaci贸n de subtipos y descubrimiento de patrones utilizando datos gen贸micos y prote贸micos. Las dos bases de datos elegidas son particularmente adecuadas para mostrar el funcionamiento de las t茅cnicas de Machine Learning analizadas y del correspondiente pre-procesamiento requerido.Galarza Hern谩ndez, J. (2017). Reducci贸n de dimensionalidad en Machine Learning.
Diagn贸stico de c谩ncer de mama bsado en datos gen贸micos y de imagen. http://hdl.handle.net/10251/92565TFG