9 research outputs found
Domain Adaptation for Enterprise Email Search
In the enterprise email search setting, the same search engine often powers
multiple enterprises from various industries: technology, education,
manufacturing, etc. However, using the same global ranking model across
different enterprises may result in suboptimal search quality, due to the
corpora differences and distinct information needs. On the other hand, training
an individual ranking model for each enterprise may be infeasible, especially
for smaller institutions with limited data. To address this data challenge, in
this paper we propose a domain adaptation approach that fine-tunes the global
model to each individual enterprise. In particular, we propose a novel
application of the Maximum Mean Discrepancy (MMD) approach to information
retrieval, which attempts to bridge the gap between the global data
distribution and the data distribution for a given individual enterprise. We
conduct a comprehensive set of experiments on a large-scale email search
engine, and demonstrate that the MMD approach consistently improves the search
quality for multiple individual domains, both in comparison to the global
ranking model, as well as several competitive domain adaptation baselines
including adversarial learning methods.Comment: Proceedings of the 42nd International ACM SIGIR Conference on
Research and Development in Information Retrieva
ESAM: Discriminative Domain Adaptation with Non-Displayed Items to Improve Long-Tail Performance
Most of ranking models are trained only with displayed items (most are hot
items), but they are utilized to retrieve items in the entire space which
consists of both displayed and non-displayed items (most are long-tail items).
Due to the sample selection bias, the long-tail items lack sufficient records
to learn good feature representations, i.e. data sparsity and cold start
problems. The resultant distribution discrepancy between displayed and
non-displayed items would cause poor long-tail performance. To this end, we
propose an entire space adaptation model (ESAM) to address this problem from
the perspective of domain adaptation (DA). ESAM regards displayed and
non-displayed items as source and target domains respectively. Specifically, we
design the attribute correlation alignment that considers the correlation
between high-level attributes of the item to achieve distribution alignment.
Furthermore, we introduce two effective regularization strategies, i.e.
\textit{center-wise clustering} and \textit{self-training} to improve DA
process. Without requiring any auxiliary information and auxiliary domains,
ESAM transfers the knowledge from displayed items to non-displayed items for
alleviating the distribution inconsistency. Experiments on two public datasets
and a large-scale industrial dataset collected from Taobao demonstrate that
ESAM achieves state-of-the-art performance, especially in the long-tail space.
Besides, we deploy ESAM to the Taobao search engine, leading to significant
improvement on online performance. The code is available at
\url{https://github.com/A-bone1/ESAM.git}Comment: Accept by SIGIR-202
Learning List-Level Domain-Invariant Representations for Ranking
Domain adaptation aims to transfer the knowledge learned on (data-rich)
source domains to (low-resource) target domains, and a popular method is
invariant representation learning, which matches and aligns the data
distributions on the feature space. Although this method is studied extensively
and applied on classification and regression problems, its adoption on ranking
problems is sporadic, and the few existing implementations lack theoretical
justifications. This paper revisits invariant representation learning for
ranking. Upon reviewing prior work, we found that they implement what we call
item-level alignment, which aligns the distributions of the items being ranked
from all lists in aggregate but ignores their list structure. However, the list
structure should be leveraged, because it is intrinsic to ranking problems
where the data and the metrics are defined and computed on lists, not the items
by themselves. To close this discrepancy, we propose list-level alignment --
learning domain-invariant representations at the higher level of lists. The
benefits are twofold: it leads to the first domain adaptation generalization
bound for ranking, in turn providing theoretical support for the proposed
method, and it achieves better empirical transfer performance for unsupervised
domain adaptation on ranking tasks, including passage reranking.Comment: NeurIPS 2023. Comparison to v1: revised presentation and proof of
Corollary 4.
Separate and Attend in Personal Email Search
In personal email search, user queries often impose different requirements on
different aspects of the retrieved emails. For example, the query "my recent
flight to the US" requires emails to be ranked based on both textual contents
and recency of the email documents, while other queries such as "medical
history" do not impose any constraints on the recency of the email. Recent deep
learning-to-rank models for personal email search often directly concatenate
dense numerical features (e.g., document age) with embedded sparse features
(e.g., n-gram embeddings). In this paper, we first show with a set of
experiments on synthetic datasets that direct concatenation of dense and sparse
features does not lead to the optimal search performance of deep neural ranking
models. To effectively incorporate both sparse and dense email features into
personal email search ranking, we propose a novel neural model, SepAttn.
SepAttn first builds two separate neural models to learn from sparse and dense
features respectively, and then applies an attention mechanism at the
prediction level to derive the final prediction from these two models. We
conduct a comprehensive set of experiments on a large-scale email search
dataset, and demonstrate that our SepAttn model consistently improves the
search quality over the baseline models.Comment: WSDM 202
Recommended from our members
Data Scarcity in Event Analysis and Abusive Language Detection
Lack of data is almost always the cause of the suboptimal performance of neural networks. Even though data scarce scenarios can be simulated for any task by assuming limited access to training data, we study two problem areas where data scarcity is a practical challenge: event analysis and abusive content detection} Journalists, social scientists and political scientists need to retrieve and analyze event mentions in unstructured text to compute useful statistical information to understand society. We claim that it is hard to specify information need about events using keyword-based representation and propose a Query by Example (QBE) setting for event retrieval. In the QBE setting, we assume that there are a few example sentences mentioning the event class a user is interested in and we aim to retrieve relevant events using only the examples as a query. Traditional event detection approaches are not applicable in this setting as event detection datasets are constructed based on pre-defined schemas which limits them to a small set of event and event-argument types. Moreover, the amount of annotated data in event detection datasets is limited that only allows us to build a retrieval corpus for evaluation. Thus we assume that there are no relevance judgments to train an event retrieval model -- except for the few examples of a specific event type. We create three QBE evaluation settings from three event detection datasets: PoliceKilling, ACE, and IndiaPoliceEvents. For the PoliceKilling dataset, where a relevant sentence describes a police killing event, we show that a query model constructed from the NLP features extracted from the few given examples is effective compared to event detection baselines. For the ACE dataset, where there are thirty-three types of events, we construct a QBE setting for each type and show that a sentence embedding approach effectively transfers for event matching. Finally, we conducted a unified evaluation of all three datasets using the sentence-embedding-based model and showed that it outperforms strong baselines.
We further examine the effect of data scarcity in abusive language detection. We first study a specific type of abusive language -- hate speech. Neural hate speech detection models trained from one dataset poorly generalize to another dataset from a different domain. This is because characteristics of hate speech vary based on racial and cultural aspects. Our data scarcity scenario assumes that we have a hate speech dataset from a domain and it needs to generalize to a test set from another domain using the unlabeled data from the test domain only. Thus we assume zero target domain data in this scenario. To tackle the data scarcity, we propose an unsupervised domain adaptation approach to augment labeled data for hate speech detection. We evaluate the approach with three different models (character CNNs, BiLSTMs, and BERT) on three different collections. We show our approach improves Area under the Precision/Recall curve by as much as 42% and recall by as much as 278%, with no loss (and in some cases a significant gain) in precision.
Finally, we examine the cross-lingual abusive language detection problem. Abusive language is a superclass of hate speech that includes profanity, aggression, offensiveness, cyberbullying, toxicity, and hate speech itself. There is a large collection of abusive language detection datasets in English such as Jigsaw. For other languages there exist datasets for abusive language detection but with very limited data. We propose a cross-lingual transfer learning approach to learn an effective neural abusive language classifier for such low-resource languages with help from a dataset from a resource-rich language. The framework is based on a nearest-neighbor architecture and is thus interpretable by design. It is a modern instantiation of the classic k-nearest neighbor model, as we use transformer representations in all its components. Unlike prior work on neighborhood-based approaches, we encode the neighborhood information based on query-neighbor interactions. We propose two encoding schemes and show their effectiveness using both qualitative and quantitative analyses. Our evaluation results on eight languages from two different datasets for abusive language detection show sizable improvements in F1 over strong baselines
UNCERTAINTY MITIGATION IN IMAGE-BASED MACHINE LEARNING MODELS FOR PRECISION MEDICINE
Machine learning (ML) algorithms have been developed to build predictive models in medicine and healthcare. In most cases, the performance of ML models/algorithms is measured by predictive accuracy or accuracy-related measures only. In medicine, the model results are intended to guide physicians to make critical decisions regarding patient care. This means that quantifying and mitigating the uncertainty of the output is also very important as it will allow decision makers to know how much they can rely on the model output.
My dissertation focuses on studying model uncertainty of image-based ML in the context of precision medicine of brain cancer. Specifically, I focus on developing ML models to predict intra-tumor heterogeneity of genomic and molecular markers based on multi-contrast magnetic resonance imaging (MRI) data for glioblastoma (GBM) – the most aggressive type of brain cancer. Intra-tumor heterogeneity has been found to be a leading cause of treatment failure of GBM. Devising a non-invasive approach to map out the molecular/genomic distribution using MRI helps develop treatment with high precision. My dissertation research addresses the model uncertainties due to high-dimensional and noisy features, sparsity of labeled data, and utility of domain knowledge.
In the first study, we developed a Semi-supervised Gaussian Process with Uncertainty-minimizing Feature-selection (SGP-UF), which can incorporate selected unlabeled samples (i.e. unbiopsied regions of a tumor) in the model training, and integrate feature selection with a new criterion of seeking features that minimize the prediction uncertainty.
In the second study, we developed a Knowledge-infused Global-Local data fusion (KGL) framework, which optimally fuses three sources of data/information including biopsy samples (labeled data, local/sparse), images (unlabeled data, global), and knowledge-driven mechanistic models.
In the third study, we developed a Weakly Supervised Ordinal Support Vector Machine (WSO-SVM), which aims to leverage a combination of data sources including biopsy/labeled samples and unlabeled samples from the tumor and image data from the normal brain, as well as their intrinsic ordinal relationship.
We demonstrate that these novel methods significantly reduce prediction uncertainty while at the same time achieving higher accuracy in precision medicine, which can inform personalized targeted treatment decisions that potentially improve clinical outcome.Ph.D