Search CORE

9 research outputs found

Domain Adaptation for Enterprise Email Search

Author: Bendersky Michael
Karimzadehgan Maryam
Metzler Donald
Pasumarthi Rama Kumar
Tran Brandon
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 18/06/2019
Field of study

In the enterprise email search setting, the same search engine often powers multiple enterprises from various industries: technology, education, manufacturing, etc. However, using the same global ranking model across different enterprises may result in suboptimal search quality, due to the corpora differences and distinct information needs. On the other hand, training an individual ranking model for each enterprise may be infeasible, especially for smaller institutions with limited data. To address this data challenge, in this paper we propose a domain adaptation approach that fine-tunes the global model to each individual enterprise. In particular, we propose a novel application of the Maximum Mean Discrepancy (MMD) approach to information retrieval, which attempts to bridge the gap between the global data distribution and the data distribution for a given individual enterprise. We conduct a comprehensive set of experiments on a large-scale email search engine, and demonstrate that the MMD approach consistently improves the search quality for multiple individual domains, both in comparison to the global ranking model, as well as several competitive domain adaptation baselines including adversarial learning methods.Comment: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieva

arXiv.org e-Print Archive

Crossref

ESAM: Discriminative Domain Adaptation with Non-Displayed Items to Improve Long-Tail Performance

Author: Chen Zhihong
Deng Hongbo
Li Chenliang
Sun Haochuan
Xiao Rong
Ye Gangfeng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 21/05/2020
Field of study

Most of ranking models are trained only with displayed items (most are hot items), but they are utilized to retrieve items in the entire space which consists of both displayed and non-displayed items (most are long-tail items). Due to the sample selection bias, the long-tail items lack sufficient records to learn good feature representations, i.e. data sparsity and cold start problems. The resultant distribution discrepancy between displayed and non-displayed items would cause poor long-tail performance. To this end, we propose an entire space adaptation model (ESAM) to address this problem from the perspective of domain adaptation (DA). ESAM regards displayed and non-displayed items as source and target domains respectively. Specifically, we design the attribute correlation alignment that considers the correlation between high-level attributes of the item to achieve distribution alignment. Furthermore, we introduce two effective regularization strategies, i.e. \textit{center-wise clustering} and \textit{self-training} to improve DA process. Without requiring any auxiliary information and auxiliary domains, ESAM transfers the knowledge from displayed items to non-displayed items for alleviating the distribution inconsistency. Experiments on two public datasets and a large-scale industrial dataset collected from Taobao demonstrate that ESAM achieves state-of-the-art performance, especially in the long-tail space. Besides, we deploy ESAM to the Taobao search engine, leading to significant improvement on online performance. The code is available at \url{https://github.com/A-bone1/ESAM.git}Comment: Accept by SIGIR-202

arXiv.org e-Print Archive

Crossref

Learning List-Level Domain-Invariant Representations for Ranking

Author: Bendersky Michael
Hui Kai
Lu Jing
Ma Ji
Qin Zhen
Wang Xuanhui
Xian Ruicheng
Zamani Hamed
Zhao Han
Zhuang Honglei
Publication venue
Publication date: 31/10/2023
Field of study

Domain adaptation aims to transfer the knowledge learned on (data-rich) source domains to (low-resource) target domains, and a popular method is invariant representation learning, which matches and aligns the data distributions on the feature space. Although this method is studied extensively and applied on classification and regression problems, its adoption on ranking problems is sporadic, and the few existing implementations lack theoretical justifications. This paper revisits invariant representation learning for ranking. Upon reviewing prior work, we found that they implement what we call item-level alignment, which aligns the distributions of the items being ranked from all lists in aggregate but ignores their list structure. However, the list structure should be leveraged, because it is intrinsic to ranking problems where the data and the metrics are defined and computed on lists, not the items by themselves. To close this discrepancy, we propose list-level alignment -- learning domain-invariant representations at the higher level of lists. The benefits are twofold: it leads to the first domain adaptation generalization bound for ranking, in turn providing theoretical support for the proposed method, and it achieves better empirical transfer performance for unsupervised domain adaptation on ranking tasks, including passage reranking.Comment: NeurIPS 2023. Comparison to v1: revised presentation and proof of Corollary 4.

arXiv.org e-Print Archive

Separate and Attend in Personal Email Search

Author: Abadi Martin
Devlin Jacob
Duchi John C.
Jaana Kalervo
Soboroff Ian
Soboroff Ian
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 21/11/2019
Field of study

In personal email search, user queries often impose different requirements on different aspects of the retrieved emails. For example, the query "my recent flight to the US" requires emails to be ranked based on both textual contents and recency of the email documents, while other queries such as "medical history" do not impose any constraints on the recency of the email. Recent deep learning-to-rank models for personal email search often directly concatenate dense numerical features (e.g., document age) with embedded sparse features (e.g., n-gram embeddings). In this paper, we first show with a set of experiments on synthetic datasets that direct concatenation of dense and sparse features does not lead to the optimal search performance of deep neural ranking models. To effectively incorporate both sparse and dense email features into personal email search ranking, we propose a novel neural model, SepAttn. SepAttn first builds two separate neural models to learn from sparse and dense features respectively, and then applies an attention mechanism at the prediction level to derive the final prediction from these two models. We conduct a comprehensive set of experiments on a large-scale email search dataset, and demonstrate that our SepAttn model consistently improves the search quality over the baseline models.Comment: WSDM 202

arXiv.org e-Print Archive

Crossref

Recommended from our members

Data Scarcity in Event Analysis and Abusive Language Detection

Author: Sarwar Sheikh Muhammad
Publication venue: ScholarWorks@UMass Amherst
Publication date: 26/10/2022
Field of study

Lack of data is almost always the cause of the suboptimal performance of neural networks. Even though data scarce scenarios can be simulated for any task by assuming limited access to training data, we study two problem areas where data scarcity is a practical challenge: event analysis and abusive content detection} Journalists, social scientists and political scientists need to retrieve and analyze event mentions in unstructured text to compute useful statistical information to understand society. We claim that it is hard to specify information need about events using keyword-based representation and propose a Query by Example (QBE) setting for event retrieval. In the QBE setting, we assume that there are a few example sentences mentioning the event class a user is interested in and we aim to retrieve relevant events using only the examples as a query. Traditional event detection approaches are not applicable in this setting as event detection datasets are constructed based on pre-defined schemas which limits them to a small set of event and event-argument types. Moreover, the amount of annotated data in event detection datasets is limited that only allows us to build a retrieval corpus for evaluation. Thus we assume that there are no relevance judgments to train an event retrieval model -- except for the few examples of a specific event type. We create three QBE evaluation settings from three event detection datasets: PoliceKilling, ACE, and IndiaPoliceEvents. For the PoliceKilling dataset, where a relevant sentence describes a police killing event, we show that a query model constructed from the NLP features extracted from the few given examples is effective compared to event detection baselines. For the ACE dataset, where there are thirty-three types of events, we construct a QBE setting for each type and show that a sentence embedding approach effectively transfers for event matching. Finally, we conducted a unified evaluation of all three datasets using the sentence-embedding-based model and showed that it outperforms strong baselines. We further examine the effect of data scarcity in abusive language detection. We first study a specific type of abusive language -- hate speech. Neural hate speech detection models trained from one dataset poorly generalize to another dataset from a different domain. This is because characteristics of hate speech vary based on racial and cultural aspects. Our data scarcity scenario assumes that we have a hate speech dataset from a domain and it needs to generalize to a test set from another domain using the unlabeled data from the test domain only. Thus we assume zero target domain data in this scenario. To tackle the data scarcity, we propose an unsupervised domain adaptation approach to augment labeled data for hate speech detection. We evaluate the approach with three different models (character CNNs, BiLSTMs, and BERT) on three different collections. We show our approach improves Area under the Precision/Recall curve by as much as 42% and recall by as much as 278%, with no loss (and in some cases a significant gain) in precision. Finally, we examine the cross-lingual abusive language detection problem. Abusive language is a superclass of hate speech that includes profanity, aggression, offensiveness, cyberbullying, toxicity, and hate speech itself. There is a large collection of abusive language detection datasets in English such as Jigsaw. For other languages there exist datasets for abusive language detection but with very limited data. We propose a cross-lingual transfer learning approach to learn an effective neural abusive language classifier for such low-resource languages with help from a dataset from a resource-rich language. The framework is based on a nearest-neighbor architecture and is thus interpretable by design. It is a modern instantiation of the classic k-nearest neighbor model, as we use transformer representations in all its components. Unlike prior work on neighborhood-based approaches, we encode the neighborhood information based on query-neighbor interactions. We propose two encoding schemes and show their effectiveness using both qualitative and quantitative analyses. Our evaluation results on eight languages from two different datasets for abusive language detection show sizable improvements in F1 over strong baselines

ScholarWorks@UMass Amherst

UNCERTAINTY MITIGATION IN IMAGE-BASED MACHINE LEARNING MODELS FOR PRECISION MEDICINE

Author: Wang Lujia
Publication venue: Georgia Institute of Technology
Publication date: 25/08/2022
Field of study

Machine learning (ML) algorithms have been developed to build predictive models in medicine and healthcare. In most cases, the performance of ML models/algorithms is measured by predictive accuracy or accuracy-related measures only. In medicine, the model results are intended to guide physicians to make critical decisions regarding patient care. This means that quantifying and mitigating the uncertainty of the output is also very important as it will allow decision makers to know how much they can rely on the model output. My dissertation focuses on studying model uncertainty of image-based ML in the context of precision medicine of brain cancer. Specifically, I focus on developing ML models to predict intra-tumor heterogeneity of genomic and molecular markers based on multi-contrast magnetic resonance imaging (MRI) data for glioblastoma (GBM) – the most aggressive type of brain cancer. Intra-tumor heterogeneity has been found to be a leading cause of treatment failure of GBM. Devising a non-invasive approach to map out the molecular/genomic distribution using MRI helps develop treatment with high precision. My dissertation research addresses the model uncertainties due to high-dimensional and noisy features, sparsity of labeled data, and utility of domain knowledge. In the first study, we developed a Semi-supervised Gaussian Process with Uncertainty-minimizing Feature-selection (SGP-UF), which can incorporate selected unlabeled samples (i.e. unbiopsied regions of a tumor) in the model training, and integrate feature selection with a new criterion of seeking features that minimize the prediction uncertainty. In the second study, we developed a Knowledge-infused Global-Local data fusion (KGL) framework, which optimally fuses three sources of data/information including biopsy samples (labeled data, local/sparse), images (unlabeled data, global), and knowledge-driven mechanistic models. In the third study, we developed a Weakly Supervised Ordinal Support Vector Machine (WSO-SVM), which aims to leverage a combination of data sources including biopsy/labeled samples and unlabeled samples from the tumor and image data from the normal brain, as well as their intrinsic ordinal relationship. We demonstrate that these novel methods significantly reduce prediction uncertainty while at the same time achieving higher accuracy in precision medicine, which can inform personalized targeted treatment decisions that potentially improve clinical outcome.Ph.D

Scholarly Materials And Research @ Georgia Tech