37 research outputs found
Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerce
In this paper, we present our work towards comparing on-line and off-line
evaluation metrics in the context of small e-commerce recommender systems.
Recommending on small e-commerce enterprises is rather challenging due to the
lower volume of interactions and low user loyalty, rarely extending beyond a
single session. On the other hand, we usually have to deal with lower volumes
of objects, which are easier to discover by users through various
browsing/searching GUIs.
The main goal of this paper is to determine applicability of off-line
evaluation metrics in learning true usability of recommender systems (evaluated
on-line in A/B testing). In total 800 variants of recommending algorithms were
evaluated off-line w.r.t. 18 metrics covering rating-based, ranking-based,
novelty and diversity evaluation. The off-line results were afterwards compared
with on-line evaluation of 12 selected recommender variants and based on the
results, we tried to learn and utilize an off-line to on-line results
prediction model.
Off-line results shown a great variance in performance w.r.t. different
metrics with the Pareto front covering 68\% of the approaches. Furthermore, we
observed that on-line results are considerably affected by the novelty of
users. On-line metrics correlates positively with ranking-based metrics (AUC,
MRR, nDCG) for novice users, while too high values of diversity and novelty had
a negative impact on the on-line results for them. For users with more visited
items, however, the diversity became more important, while ranking-based
metrics relevance gradually decrease.Comment: Submitted to ACM Hypertext 2020 Conferenc
Off-Policy Evaluation of Probabilistic Identity Data in Lookalike Modeling
We evaluate the impact of probabilistically-constructed digital identity data
collected from Sep. to Dec. 2017 (approx.), in the context of
Lookalike-targeted campaigns. The backbone of this study is a large set of
probabilistically-constructed "identities", represented as small bags of
cookies and mobile ad identifiers with associated metadata, that are likely all
owned by the same underlying user. The identity data allows to generate
"identity-based", rather than "identifier-based", user models, giving a fuller
picture of the interests of the users underlying the identifiers. We employ
off-policy techniques to evaluate the potential of identity-powered lookalike
models without incurring the risk of allowing untested models to direct large
amounts of ad spend or the large cost of performing A/B tests. We add to
historical work on off-policy evaluation by noting a significant type of
"finite-sample bias" that occurs for studies combining modestly-sized datasets
and evaluation metrics involving rare events (e.g., conversions). We illustrate
this bias using a simulation study that later informs the handling of inverse
propensity weights in our analyses on real data. We demonstrate significant
lift in identity-powered lookalikes versus an identity-ignorant baseline: on
average ~70% lift in conversion rate. This rises to factors of ~(4-32)x for
identifiers having little data themselves, but that can be inferred to belong
to users with substantial data to aggregate across identifiers. This implies
that identity-powered user modeling is especially important in the context of
identifiers having very short lifespans (i.e., frequently churned cookies). Our
work motivates and informs the use of probabilistically-constructed identities
in marketing. It also deepens the canon of examples in which off-policy
learning has been employed to evaluate the complex systems of the internet
economy.Comment: Accepted by WSDM 201
Monte Carlo Estimates of Evaluation Metric Error and Bias: Work in Progress
Traditional offline evaluations of recommender systems apply metrics from machine learning and information retrieval in settings where their underlying assumptions no longer hold. This results in significant error and bias in measures of top-N recommendation performance, such as precision, recall, and nDCG. Several of the specific causes of these errors, including popularity bias and misclassified decoy items, are well-explored in the existing literature. In this paper we survey a range of work on identifying and addressing these problems, and report on our work in progress to simulate the recommender data generation and evaluation processes to quantify the extent of evaluation metric errors and assess their sensitivity to various assumptions
Diversify and Conquer: Bandits and Diversity for an Enhanced E-commerce Homepage Experience
In the realm of e-commerce, popular platforms utilize widgets to recommend
advertisements and products to their users. However, the prevalence of mobile
device usage on these platforms introduces a unique challenge due to the
limited screen real estate available. Consequently, the positioning of relevant
widgets becomes pivotal in capturing and maintaining customer engagement. Given
the restricted screen size of mobile devices, widgets placed at the top of the
interface are more prominently displayed and thus attract greater user
attention. Conversely, widgets positioned further down the page require users
to scroll, resulting in reduced visibility and subsequent lower impression
rates. Therefore it becomes imperative to place relevant widgets on top.
However, selecting relevant widgets to display is a challenging task as the
widgets can be heterogeneous, widgets can be introduced or removed at any given
time from the platform. In this work, we model the vertical widget reordering
as a contextual multi-arm bandit problem with delayed batch feedback. The
objective is to rank the vertical widgets in a personalized manner. We present
a two-stage ranking framework that combines contextual bandits with a diversity
layer to improve the overall ranking. We demonstrate its effectiveness through
offline and online A/B results, conducted on proprietary data from Myntra, a
major fashion e-commerce platform in India.Comment: Accepted in Proceedings of Fashionxrecys Workshop, 17th ACM
Conference on Recommender Systems, 202
Unbiased Recommender Learning from Missing-Not-At-Random Implicit Feedback
Recommender systems widely use implicit feedback such as click data because
of its general availability. Although the presence of clicks signals the users'
preference to some extent, the lack of such clicks does not necessarily
indicate a negative response from the users, as it is possible that the users
were not exposed to the items (positive-unlabeled problem). This leads to a
difficulty in predicting the users' preferences from implicit feedback.
Previous studies addressed the positive-unlabeled problem by uniformly
upweighting the loss for the positive feedback data or estimating the
confidence of each data having relevance information via the EM-algorithm.
However, these methods failed to address the missing-not-at-random problem in
which popular or frequently recommended items are more likely to be clicked
than other items even if a user does not have a considerable interest in them.
To overcome these limitations, we first define an ideal loss function to be
optimized to realize recommendations that maximize the relevance and propose an
unbiased estimator for the ideal loss. Subsequently, we analyze the variance of
the proposed unbiased estimator and further propose a clipped estimator that
includes the unbiased estimator as a special case. We demonstrate that the
clipped estimator is expected to improve the performance of the recommender
system, by considering the bias-variance trade-off. We conduct semi-synthetic
and real-world experiments and demonstrate that the proposed method largely
outperforms the baselines. In particular, the proposed method works better for
rare items that are less frequently observed in the training data. The findings
indicate that the proposed method can better achieve the objective of
recommending items with the highest relevance.Comment: accepted at WSDM'2