8,456 research outputs found
Statistical learning methods for mining marketing and biological data
Nowadays, the value of data has been broadly recognized and emphasized. More and more decisions are made based on data and analysis rather than solely on experience and intuition. With the fast development of networking, data storage, and data collection capacity, data have increased dramatically in industry, science and engineering domains, which brings both great opportunities and challenges. To take advantage of the data flood, new computational methods are in demand to process, analyze and understand these datasets.
This dissertation focuses on the development of statistical learning methods for online advertising and bioinformatics to model real world data with temporal or spatial changes. First, a collaborated online change-point detection method is proposed to identify the change-points in sparse time series. It leverages the signals from the auxiliary time series such as engagement metrics to compensate the sparse revenue data and improve detection efficiency and accuracy through smart collaboration. Second, a task-specific multi-task learning algorithm is developed to model the ever-changing video viewing behaviors. With the 1-regularized task-specific features and jointly estimated shared features, it allows different models to seek common ground while reserving differences. Third, an empirical Bayes method is proposed to identify 3\u27 and 5\u27 alternative splicing in RNA-seq data. It formulates alternative 3\u27 and 5\u27 splicing site selection as a change-point problem and provides for the first time a systematic framework to pool information across genes and integrate various information when available, in particular the useful junction read information, in order to obtain better performance
Off-Policy Evaluation of Probabilistic Identity Data in Lookalike Modeling
We evaluate the impact of probabilistically-constructed digital identity data
collected from Sep. to Dec. 2017 (approx.), in the context of
Lookalike-targeted campaigns. The backbone of this study is a large set of
probabilistically-constructed "identities", represented as small bags of
cookies and mobile ad identifiers with associated metadata, that are likely all
owned by the same underlying user. The identity data allows to generate
"identity-based", rather than "identifier-based", user models, giving a fuller
picture of the interests of the users underlying the identifiers. We employ
off-policy techniques to evaluate the potential of identity-powered lookalike
models without incurring the risk of allowing untested models to direct large
amounts of ad spend or the large cost of performing A/B tests. We add to
historical work on off-policy evaluation by noting a significant type of
"finite-sample bias" that occurs for studies combining modestly-sized datasets
and evaluation metrics involving rare events (e.g., conversions). We illustrate
this bias using a simulation study that later informs the handling of inverse
propensity weights in our analyses on real data. We demonstrate significant
lift in identity-powered lookalikes versus an identity-ignorant baseline: on
average ~70% lift in conversion rate. This rises to factors of ~(4-32)x for
identifiers having little data themselves, but that can be inferred to belong
to users with substantial data to aggregate across identifiers. This implies
that identity-powered user modeling is especially important in the context of
identifiers having very short lifespans (i.e., frequently churned cookies). Our
work motivates and informs the use of probabilistically-constructed identities
in marketing. It also deepens the canon of examples in which off-policy
learning has been employed to evaluate the complex systems of the internet
economy.Comment: Accepted by WSDM 201
Improving Estimation in Functional Linear Regression with Points of Impact: Insights into Google AdWords
The functional linear regression model with points of impact is a recent
augmentation of the classical functional linear model with many practically
important applications. In this work, however, we demonstrate that the existing
data-driven procedure for estimating the parameters of this regression model
can be very instable and inaccurate. The tendency to omit relevant points of
impact is a particularly problematic aspect resulting in omitted-variable
biases. We explain the theoretical reason for this problem and propose a new
sequential estimation algorithm that leads to significantly improved estimation
results. Our estimation algorithm is compared with the existing estimation
procedure using an in-depth simulation study. The applicability is demonstrated
using data from Google AdWords, today's most important platform for online
advertisements. The \textsf{R}-package \texttt{FunRegPoI} and additional
\textsf{R}-codes are provided in the online supplementary material
Sparse Signal Recovery under Poisson Statistics
We are motivated by problems that arise in a number of applications such as
Online Marketing and explosives detection, where the observations are usually
modeled using Poisson statistics. We model each observation as a Poisson random
variable whose mean is a sparse linear superposition of known patterns. Unlike
many conventional problems observations here are not identically distributed
since they are associated with different sensing modalities. We analyze the
performance of a Maximum Likelihood (ML) decoder, which for our Poisson setting
involves a non-linear optimization but yet is computationally tractable. We
derive fundamental sample complexity bounds for sparse recovery when the
measurements are contaminated with Poisson noise. In contrast to the
least-squares linear regression setting with Gaussian noise, we observe that in
addition to sparsity, the scale of the parameters also fundamentally impacts
sample complexity. We introduce a novel notion of Restricted Likelihood
Perturbation (RLP), to jointly account for scale and sparsity. We derive sample
complexity bounds for regularized ML estimators in terms of RLP and
further specialize these results for deterministic and random sensing matrix
designs.Comment: 13 pages, 11 figures, 2 tables, submitted to IEEE Transactions on
Signal Processin
Inefficiencies in Digital Advertising Markets
Digital advertising markets are growing and attracting increased scrutiny. This article explores four market inefficiencies that remain poorly understood: ad effect measurement, frictions between and within advertising channel members, ad blocking, and ad fraud. Although these topics are not unique to digital advertising, each manifests in unique ways in markets for digital ads. The authors identify relevant findings in the academic literature, recent developments in practice, and promising topics for future research
CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap
After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in
multimedia search engines, we have identified and analyzed gaps within European research effort during our second year.
In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio-
economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown
of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on
requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the
community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our
Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as
National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core
technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research
challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal
challenges
Modeling the formation of attentive publics in social media: the case of Donald Trump
Previous research has shown the importance of Donald Trump’s Twitter activity, and that of his Twitter following, in spreading his message during the primary and general election campaigns of 2015–2016. However, we know little about how the publics who followed Trump and amplified his messages took shape. We take this case as an opportunity to theorize and test questions about the assembly of what we call “attentive publics” in social media. We situate our study in the context of current discussions of audience formation, attention flow, and hybridity in the United States’ political media system. From this we derive propositions concerning how attentive publics aggregate around a particular object, in this case Trump himself, which we test using time series modeling. We also present an exploration of the possible role of automated accounts in these processes. Our results reiterate the media hybridity described by others, while emphasizing the importance of news media coverage in building social media attentive publics.Accepted manuscrip
Scalable Bayesian modeling, monitoring and analysis of dynamic network flow data
Traffic flow count data in networks arise in many applications, such as
automobile or aviation transportation, certain directed social network
contexts, and Internet studies. Using an example of Internet browser traffic
flow through site-segments of an international news website, we present
Bayesian analyses of two linked classes of models which, in tandem, allow fast,
scalable and interpretable Bayesian inference. We first develop flexible
state-space models for streaming count data, able to adaptively characterize
and quantify network dynamics efficiently in real-time. We then use these
models as emulators of more structured, time-varying gravity models that allow
formal dissection of network dynamics. This yields interpretable inferences on
traffic flow characteristics, and on dynamics in interactions among network
nodes. Bayesian monitoring theory defines a strategy for sequential model
assessment and adaptation in cases when network flow data deviates from
model-based predictions. Exploratory and sequential monitoring analyses of
evolving traffic on a network of web site-segments in e-commerce demonstrate
the utility of this coupled Bayesian emulation approach to analysis of
streaming network count data.Comment: 29 pages, 16 figure
- …