5 research outputs found
Data acquisition and cost-effective predictive modeling: targeting offers for electronic commerce
Electronic commerce is revolutionizing the way we think about
data modeling, by making it possible to integrate the processes of
(costly) data acquisition and model induction. The opportunity for
improving modeling through costly data acquisition presents itself
for a diverse set of electronic commerce modeling tasks, from personalization
to customer lifetime value modeling; we illustrate with
the running example of choosing offers to display to web-site visitors,
which captures important aspects in a familiar setting. Considering
data acquisition costs explicitly can allow the building of
predictive models at significantly lower costs, and a modeler may
be able to improve performance via new sources of information that
previously were too expensive to consider. However, existing techniques
for integrating modeling and data acquisition cannot deal
with the rich environment that electronic commerce presents. We
discuss several possible data acquisition settings, the challenges involved
in the integration with modeling, and various research areas
that may supply parts of an ultimate solution. We also present and
demonstrate briefly a unified framework within which one can integrate
acquisitions of different types, with any cost structure and
any predictive modeling objectiveNYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
Data acquisition and cost-effective predictive modeling: targeting offers for electronic commerce
Electronic commerce is revolutionizing the way we think about
data modeling, by making it possible to integrate the processes of
(costly) data acquisition and model induction. The opportunity for
improving modeling through costly data acquisition presents itself
for a diverse set of electronic commerce modeling tasks, from personalization
to customer lifetime value modeling; we illustrate with
the running example of choosing offers to display to web-site visitors,
which captures important aspects in a familiar setting. Considering
data acquisition costs explicitly can allow the building of
predictive models at significantly lower costs, and a modeler may
be able to improve performance via new sources of information that
previously were too expensive to consider. However, existing techniques
for integrating modeling and data acquisition cannot deal
with the rich environment that electronic commerce presents. We
discuss several possible data acquisition settings, the challenges involved
in the integration with modeling, and various research areas
that may supply parts of an ultimate solution. We also present and
demonstrate briefly a unified framework within which one can integrate
acquisitions of different types, with any cost structure and
any predictive modeling objectiveNYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
Projection Based Models for High Dimensional Data
In recent years, many machine learning applications have arisen which deal with the
problem of finding patterns in high dimensional data. Principal component analysis
(PCA) has become ubiquitous in this setting. PCA performs dimensionality reduction
by estimating latent factors which minimise the reconstruction error between
the original data and its low-dimensional projection. We initially consider a situation
where influential observations exist within the dataset which have a large,
adverse affect on the estimated PCA model. We propose a measure of “predictive
influence” to detect these points based on the contribution of each point to the
leave-one-out reconstruction error of the model using an analytic PRedicted REsidual
Sum of Squares (PRESS) statistic. We then develop a robust alternative to PCA
to deal with the presence of influential observations and outliers which minimizes
the predictive reconstruction error.
In some applications there may be unobserved clusters in the data, for which
fitting PCA models to subsets of the data would provide a better fit. This is known
as the subspace clustering problem. We develop a novel algorithm for subspace
clustering which iteratively fits PCA models to subsets of the data and assigns observations
to clusters based on their predictive influence on the reconstruction error.
We study the convergence of the algorithm and compare its performance to a number
of subspace clustering methods on simulated data and in real applications from
computer vision involving clustering object trajectories in video sequences and images
of faces.
We extend our predictive clustering framework to a setting where two high-dimensional
views of data have been obtained. Often, only either clustering or predictive modelling is performed between the views. Instead, we aim to recover
clusters which are maximally predictive between the views. In this setting two block
partial least squares (TB-PLS) is a useful model. TB-PLS performs dimensionality
reduction in both views by estimating latent factors that are highly predictive. We
fit TB-PLS models to subsets of data and assign points to clusters based on their
predictive influence under each model which is evaluated using a PRESS statistic.
We compare our method to state of the art algorithms in real applications in webpage
and document clustering and find that our approach to predictive clustering
yields superior results.
Finally, we propose a method for dynamically tracking multivariate data streams
based on PLS. Our method learns a linear regression function from multivariate
input and output streaming data in an incremental fashion while also performing
dimensionality reduction and variable selection. Moreover, the recursive regression
model is able to adapt to sudden changes in the data generating mechanism and also
identifies the number of latent factors. We apply our method to the enhanced index
tracking problem in computational finance
XDC: uma proposta de controle de restrições de integridade de domínio em documentos XML
Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico. Programa de Pós-graduação em Ciência da ComputaçãoXML (eXtensible Markup Language) vem se consolidando como um padrão para exportação de dados entre aplicações na Web, por apresentar um formato textual simples e aberto. Essas características tornam-no adequado à representação de dados vindos de fontes heterogêneas. Restrições de integridade são mecanismos utilizados para a imposição de consistência em bancos de dados e também são utilizados em documentos XML