6 research outputs found

    Multi-Task Learning for Email Search Ranking with Auxiliary Query Clustering

    Full text link
    User information needs vary significantly across different tasks, and therefore their queries will also differ considerably in their expressiveness and semantics. Many studies have been proposed to model such query diversity by obtaining query types and building query-dependent ranking models. These studies typically require either a labeled query dataset or clicks from multiple users aggregated over the same document. These techniques, however, are not applicable when manual query labeling is not viable, and aggregated clicks are unavailable due to the private nature of the document collection, e.g., in email search scenarios. In this paper, we study how to obtain query type in an unsupervised fashion and how to incorporate this information into query-dependent ranking models. We first develop a hierarchical clustering algorithm based on truncated SVD and varimax rotation to obtain coarse-to-fine query types. Then, we study three query-dependent ranking models, including two neural models that leverage query type information as additional features, and one novel multi-task neural model that views query type as the label for the auxiliary query cluster prediction task. This multi-task model is trained to simultaneously rank documents and predict query types. Our experiments on tens of millions of real-world email search queries demonstrate that the proposed multi-task model can significantly outperform the baseline neural ranking models, which either do not incorporate query type information or just simply feed query type as an additional feature.Comment: CIKM 201

    Accelerated Convergence for Counterfactual Learning to Rank

    Get PDF
    Counterfactual Learning to Rank (LTR) algorithms learn a ranking model from logged user interactions, often collected using a production system. Employing such an offline learning approach has many benefits compared to an online one, but it is challenging as user feedback often contains high levels of bias. Unbiased LTR uses Inverse Propensity Scoring (IPS) to enable unbiased learning from logged user interactions. One of the major difficulties in applying Stochastic Gradient Descent (SGD) approaches to counterfactual learning problems is the large variance introduced by the propensity weights. In this paper we show that the convergence rate of SGD approaches with IPS-weighted gradients suffers from the large variance introduced by the IPS weights: convergence is slow, especially when there are large IPS weights. To overcome this limitation, we propose a novel learning algorithm, called CounterSample, that has provably better convergence than standard IPS-weighted gradient descent methods. We prove that CounterSample converges faster and complement our theoretical findings with empirical results by performing extensive experimentation in a number of biased LTR scenarios -- across optimizers, batch sizes, and different degrees of position bias.Comment: SIGIR 2020 full conference pape

    Separate and Attend in Personal Email Search

    Full text link
    In personal email search, user queries often impose different requirements on different aspects of the retrieved emails. For example, the query "my recent flight to the US" requires emails to be ranked based on both textual contents and recency of the email documents, while other queries such as "medical history" do not impose any constraints on the recency of the email. Recent deep learning-to-rank models for personal email search often directly concatenate dense numerical features (e.g., document age) with embedded sparse features (e.g., n-gram embeddings). In this paper, we first show with a set of experiments on synthetic datasets that direct concatenation of dense and sparse features does not lead to the optimal search performance of deep neural ranking models. To effectively incorporate both sparse and dense email features into personal email search ranking, we propose a novel neural model, SepAttn. SepAttn first builds two separate neural models to learn from sparse and dense features respectively, and then applies an attention mechanism at the prediction level to derive the final prediction from these two models. We conduct a comprehensive set of experiments on a large-scale email search dataset, and demonstrate that our SepAttn model consistently improves the search quality over the baseline models.Comment: WSDM 202

    Reducci贸n de dimensionalidad en Machine Learning. Diagn贸stico de c谩ncer de mama bsado en datos gen贸micos y de imagen

    Full text link
    The target of the current Project consist in analyzing some of the Matching Learning techniques used in the current treatment of Big Data. It includes the study of the statistical and algebraic tools involved in the calculations, and an application to the diagnosis and clasification of breast cancer based on genomic and image data. To extract information from Big Data, the data obtained require to be pre-processed. In this Project we present different pre-processing techniques and analyze them and their impact on the resulting prediction models. Two Machine Learning Models are presented: One of them is focused on the diagnosis of breast c谩ncer base don image data. The second one is devoted to the classification of the different types of breast cancer and to the discovery of different patterns using genomic and proteinomic data. The two data basis are particularly convenient to present the Marchine Learning techniques analyzed in the Project and the corresponding pre-processing strategies.El objetivo del proyecto es analizar algunas t茅cnicas de aprendizaje autom谩tico (Machine Learning) que se emplean en la actualidad para extracci贸n de informaci贸n de grandes cantidades de datos, estudiar las herramientas estad铆sticas y algebraicas que emplean en los c谩lculos, y aplicarlas al diagn贸stico y clasificaci贸n de tipos de c谩ncer de mama. El manejo de grandes cantidades de datos requiere de un pre-procesamiento de los mismos para poder ser empleados. En este proyecto se presentan y analizan tambi茅n distintas herramientas utilizadas en el pre-procesado de datos y su impacto en el modelo de predicci贸n. En el trabajo se crean dos modelos de aprendizaje autom谩tico: Uno enfocado al diagn贸stico del c谩ncer de mama utilizando indicadores de imagen, y otro focalizado en la clasificaci贸n de subtipos y descubrimiento de patrones utilizando datos gen贸micos y prote贸micos. Las dos bases de datos elegidas son particularmente adecuadas para mostrar el funcionamiento de las t茅cnicas de Machine Learning analizadas y del correspondiente pre-procesamiento requerido.Galarza Hern谩ndez, J. (2017). Reducci贸n de dimensionalidad en Machine Learning. Diagn贸stico de c谩ncer de mama bsado en datos gen贸micos y de imagen. http://hdl.handle.net/10251/92565TFG
    corecore