16,578 research outputs found
Counterfactual Estimation and Optimization of Click Metrics for Search Engines
Optimizing an interactive system against a predefined online metric is
particularly challenging, when the metric is computed from user feedback such
as clicks and payments. The key challenge is the counterfactual nature: in the
case of Web search, any change to a component of the search engine may result
in a different search result page for the same query, but we normally cannot
infer reliably from search log how users would react to the new result page.
Consequently, it appears impossible to accurately estimate online metrics that
depend on user feedback, unless the new engine is run to serve users and
compared with a baseline in an A/B test. This approach, while valid and
successful, is unfortunately expensive and time-consuming. In this paper, we
propose to address this problem using causal inference techniques, under the
contextual-bandit framework. This approach effectively allows one to run
(potentially infinitely) many A/B tests offline from search log, making it
possible to estimate and optimize online metrics quickly and inexpensively.
Focusing on an important component in a commercial search engine, we show how
these ideas can be instantiated and applied, and obtain very promising results
that suggest the wide applicability of these techniques
Data Mining in Electronic Commerce
Modern business is rushing toward e-commerce. If the transition is done
properly, it enables better management, new services, lower transaction costs
and better customer relations. Success depends on skilled information
technologists, among whom are statisticians. This paper focuses on some of the
contributions that statisticians are making to help change the business world,
especially through the development and application of data mining methods. This
is a very large area, and the topics we cover are chosen to avoid overlap with
other papers in this special issue, as well as to respect the limitations of
our expertise. Inevitably, electronic commerce has raised and is raising fresh
research problems in a very wide range of statistical areas, and we try to
emphasize those challenges.Comment: Published at http://dx.doi.org/10.1214/088342306000000204 in the
Statistical Science (http://www.imstat.org/sts/) by the Institute of
Mathematical Statistics (http://www.imstat.org
CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks
Data quality affects machine learning (ML) model performances, and data
scientists spend considerable amount of time on data cleaning before model
training. However, to date, there does not exist a rigorous study on how
exactly cleaning affects ML -- ML community usually focuses on developing ML
algorithms that are robust to some particular noise types of certain
distributions, while database (DB) community has been mostly studying the
problem of data cleaning alone without considering how data is consumed by
downstream ML analytics. We propose a CleanML study that systematically
investigates the impact of data cleaning on ML classification tasks. The
open-source and extensible CleanML study currently includes 14 real-world
datasets with real errors, five common error types, seven different ML models,
and multiple cleaning algorithms for each error type (including both commonly
used algorithms in practice as well as state-of-the-art solutions in academic
literature). We control the randomness in ML experiments using statistical
hypothesis testing, and we also control false discovery rate in our experiments
using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a
systematic way to derive many interesting and nontrivial observations. We also
put forward multiple research directions for researchers.Comment: published in ICDE 202
- …