9 research outputs found

    Domain Adaptation for Enterprise Email Search

    Full text link
    In the enterprise email search setting, the same search engine often powers multiple enterprises from various industries: technology, education, manufacturing, etc. However, using the same global ranking model across different enterprises may result in suboptimal search quality, due to the corpora differences and distinct information needs. On the other hand, training an individual ranking model for each enterprise may be infeasible, especially for smaller institutions with limited data. To address this data challenge, in this paper we propose a domain adaptation approach that fine-tunes the global model to each individual enterprise. In particular, we propose a novel application of the Maximum Mean Discrepancy (MMD) approach to information retrieval, which attempts to bridge the gap between the global data distribution and the data distribution for a given individual enterprise. We conduct a comprehensive set of experiments on a large-scale email search engine, and demonstrate that the MMD approach consistently improves the search quality for multiple individual domains, both in comparison to the global ranking model, as well as several competitive domain adaptation baselines including adversarial learning methods.Comment: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieva

    ESAM: Discriminative Domain Adaptation with Non-Displayed Items to Improve Long-Tail Performance

    Full text link
    Most of ranking models are trained only with displayed items (most are hot items), but they are utilized to retrieve items in the entire space which consists of both displayed and non-displayed items (most are long-tail items). Due to the sample selection bias, the long-tail items lack sufficient records to learn good feature representations, i.e. data sparsity and cold start problems. The resultant distribution discrepancy between displayed and non-displayed items would cause poor long-tail performance. To this end, we propose an entire space adaptation model (ESAM) to address this problem from the perspective of domain adaptation (DA). ESAM regards displayed and non-displayed items as source and target domains respectively. Specifically, we design the attribute correlation alignment that considers the correlation between high-level attributes of the item to achieve distribution alignment. Furthermore, we introduce two effective regularization strategies, i.e. \textit{center-wise clustering} and \textit{self-training} to improve DA process. Without requiring any auxiliary information and auxiliary domains, ESAM transfers the knowledge from displayed items to non-displayed items for alleviating the distribution inconsistency. Experiments on two public datasets and a large-scale industrial dataset collected from Taobao demonstrate that ESAM achieves state-of-the-art performance, especially in the long-tail space. Besides, we deploy ESAM to the Taobao search engine, leading to significant improvement on online performance. The code is available at \url{https://github.com/A-bone1/ESAM.git}Comment: Accept by SIGIR-202

    Learning List-Level Domain-Invariant Representations for Ranking

    Full text link
    Domain adaptation aims to transfer the knowledge learned on (data-rich) source domains to (low-resource) target domains, and a popular method is invariant representation learning, which matches and aligns the data distributions on the feature space. Although this method is studied extensively and applied on classification and regression problems, its adoption on ranking problems is sporadic, and the few existing implementations lack theoretical justifications. This paper revisits invariant representation learning for ranking. Upon reviewing prior work, we found that they implement what we call item-level alignment, which aligns the distributions of the items being ranked from all lists in aggregate but ignores their list structure. However, the list structure should be leveraged, because it is intrinsic to ranking problems where the data and the metrics are defined and computed on lists, not the items by themselves. To close this discrepancy, we propose list-level alignment -- learning domain-invariant representations at the higher level of lists. The benefits are twofold: it leads to the first domain adaptation generalization bound for ranking, in turn providing theoretical support for the proposed method, and it achieves better empirical transfer performance for unsupervised domain adaptation on ranking tasks, including passage reranking.Comment: NeurIPS 2023. Comparison to v1: revised presentation and proof of Corollary 4.

    Separate and Attend in Personal Email Search

    Full text link
    In personal email search, user queries often impose different requirements on different aspects of the retrieved emails. For example, the query "my recent flight to the US" requires emails to be ranked based on both textual contents and recency of the email documents, while other queries such as "medical history" do not impose any constraints on the recency of the email. Recent deep learning-to-rank models for personal email search often directly concatenate dense numerical features (e.g., document age) with embedded sparse features (e.g., n-gram embeddings). In this paper, we first show with a set of experiments on synthetic datasets that direct concatenation of dense and sparse features does not lead to the optimal search performance of deep neural ranking models. To effectively incorporate both sparse and dense email features into personal email search ranking, we propose a novel neural model, SepAttn. SepAttn first builds two separate neural models to learn from sparse and dense features respectively, and then applies an attention mechanism at the prediction level to derive the final prediction from these two models. We conduct a comprehensive set of experiments on a large-scale email search dataset, and demonstrate that our SepAttn model consistently improves the search quality over the baseline models.Comment: WSDM 202

    UNCERTAINTY MITIGATION IN IMAGE-BASED MACHINE LEARNING MODELS FOR PRECISION MEDICINE

    Get PDF
    Machine learning (ML) algorithms have been developed to build predictive models in medicine and healthcare. In most cases, the performance of ML models/algorithms is measured by predictive accuracy or accuracy-related measures only. In medicine, the model results are intended to guide physicians to make critical decisions regarding patient care. This means that quantifying and mitigating the uncertainty of the output is also very important as it will allow decision makers to know how much they can rely on the model output. My dissertation focuses on studying model uncertainty of image-based ML in the context of precision medicine of brain cancer. Specifically, I focus on developing ML models to predict intra-tumor heterogeneity of genomic and molecular markers based on multi-contrast magnetic resonance imaging (MRI) data for glioblastoma (GBM) – the most aggressive type of brain cancer. Intra-tumor heterogeneity has been found to be a leading cause of treatment failure of GBM. Devising a non-invasive approach to map out the molecular/genomic distribution using MRI helps develop treatment with high precision. My dissertation research addresses the model uncertainties due to high-dimensional and noisy features, sparsity of labeled data, and utility of domain knowledge. In the first study, we developed a Semi-supervised Gaussian Process with Uncertainty-minimizing Feature-selection (SGP-UF), which can incorporate selected unlabeled samples (i.e. unbiopsied regions of a tumor) in the model training, and integrate feature selection with a new criterion of seeking features that minimize the prediction uncertainty. In the second study, we developed a Knowledge-infused Global-Local data fusion (KGL) framework, which optimally fuses three sources of data/information including biopsy samples (labeled data, local/sparse), images (unlabeled data, global), and knowledge-driven mechanistic models. In the third study, we developed a Weakly Supervised Ordinal Support Vector Machine (WSO-SVM), which aims to leverage a combination of data sources including biopsy/labeled samples and unlabeled samples from the tumor and image data from the normal brain, as well as their intrinsic ordinal relationship. We demonstrate that these novel methods significantly reduce prediction uncertainty while at the same time achieving higher accuracy in precision medicine, which can inform personalized targeted treatment decisions that potentially improve clinical outcome.Ph.D
    corecore