30,916 research outputs found
Recommended from our members
Learning from aggregated data
Data aggregation is ubiquitous in modern life. Due to various reasons like privacy, scalability, robustness, etc., ground truth data is often subjected to aggregation before being released to the public, or utilised by researchers and analysts. Learning from aggregated data is a challenging problem that requires significant algorithmic innovation, since naive application of standard techniques to aggregated data is vulnerable to the ecological fallacy. In this work, we explore three different versions of this setting.
First, we tackle the problem of using generalised linear models when features/covariates are fully observed but the targets are only available as histograms- a common scenario in the healthcare domain where many datasets contain both non-sensitive attributes like age, sex, zip-code, etc., as well as privacy sensitive attributes like healthcare records. We introduce an efficient algorithm that uses alternating data imputation and GLM estimation steps to learn predictive models in this setting.
Next, we look at the problem of learning sparse linear models when both features and targets are in aggregated form, specified as empirical estimates of group-wise means computed over different sub-groups of the population. We show that if the true sub-populations are heterogeneous enough, the optimal sparse parameter can be recovered within an arbitrarily small tolerance even in the presence of noise, provided the empirical estimates are obtained from a sufficiently large number of observations.
Third, we tackle the scenario of predictive modelling with data that is subjected to spatio-temporal aggregation. We show that by formulating the problem in the frequency domain, we can bypass the mathematical and representational challenges that arise due to non-uniform aggregation, misaligned sampling periods and aliasing. We introduce a novel algorithm that uses restricted Fourier transforms to estimate a linear model which, when applied to spatio-temporally aggregated data, has a generalisation error that is provably close to the optimal performance by the best possible linear model that can be learned from the non-aggregated data set.
We then focus our attention on the complementary problem that involves designing aggregation strategies that can allow learning, as well as developing algorithmic techniques that can use only the aggregates to train a model that works on individual samples. We motivate our methods by using the example of Gaussian regression, and subsequently extend our techniques to subsume binary classifiers and generalised linear models. We deonstrate the effectiveness of our techniques with empirical evaluation on data from healthcare and telecommunication.
Finally, we present a concrete example of our methods applied to a real life practical problem. Specifically, we consider an application in the domain of online advertising where the complexity of bidding strategies require accurate estimates of most probable cost-per-click or CPC incurred by advertisers, but the data used for training these CPC prediction models are only available as aggregated invoices supplied by an ad publisher on a daily or hourly basis. We introduce a novel learning framework that can use aggregates computed at varying levels of granularity for building individual-level predictive models. We generalise our modelling and algorithmic framework to handle data from diverse domains, and extend our techniques to cover arbitrary aggregation paradigms like sliding windows and overlapping/non-uniform aggregation. We show empirical evidence for the efficacy of our techniques with experiments on both synthetic data and real data from the online advertising domain as well as healthcare to demonstrate the wider applicability of our framework.Electrical and Computer Engineerin
Predicting Multi-class Customer Profiles Based on Transactions: a Case Study in Food Sales
Predicting the class of a customer profile is a key task in marketing, which enables businesses to approach the right customer with the right product at the right time through the right channel to satisfy the customer's evolving needs. However, due to costs, privacy and/or data protection, only the business' owned transactional data is typically available for constructing customer profiles. Predicting the class of customer profiles based on such data is challenging, as the data tends to be very large, heavily sparse and highly skewed. We present a new approach that is designed to efficiently and accurately handle the multi-class classification of customer profiles built using sparse and skewed transactional data. Our approach first bins the customer profiles on the basis of the number of items transacted. The discovered bins are then partitioned and prototypes within each of the discovered bins selected to build the multi-class classifier models. The results obtained from using four multi-class classifiers on real-world transactional data from the food sales domain consistently show the critical numbers of items at which the predictive performance of customer profiles can be substantially improved
Improving adaptation and interpretability of a short-term traffic forecasting system
Traffic management is being more important than ever, especially in overcrowded big cities with over-pollution problems and with new unprecedented mobility changes. In this scenario, road-traffic prediction plays a key role within Intelligent Transportation Systems, allowing traffic managers to be able to anticipate and take the proper decisions. This paper aims to analyse the situation in a commercial real-time prediction system with its current problems and limitations. The analysis unveils the trade-off between simple parsimonious models and more complex models. Finally, we propose an enriched machine learning framework, Adarules, for the traffic prediction in real-time facing the problem as continuously incoming data streams with all the commonly occurring problems in such volatile scenario, namely changes in the network infrastructure and demand, new detection stations or failure ones, among others. The framework is also able to infer automatically the most relevant features to our end-task, including the relationships within the road network. Although the intention with the proposed framework is to evolve and grow with new incoming big data, however there is no limitation in starting to use it without any prior knowledge as it can starts learning the structure and parameters automatically from data. We test this predictive system in different real-work scenarios, and evaluate its performance integrating a multi-task learning paradigm for the sake of the traffic prediction task.Peer ReviewedPostprint (published version
MORE: Merged Opinions Reputation Model
Reputation is generally defined as the opinion of a group on an aspect of a
thing. This paper presents a reputation model that follows a probabilistic
modelling of opinions based on three main concepts: (1) the value of an opinion
decays with time, (2) the reputation of the opinion source impacts the
reliability of the opinion, and (3) the certainty of the opinion impacts its
weight with respect to other opinions. Furthermore, the model is flexible with
its opinion sources: it may use explicit opinions or implicit opinions that can
be extracted from agent behavior in domains where explicit opinions are sparse.
We illustrate the latter with an approach to extract opinions from behavioral
information in the sports domain, focusing on football in particular. One of
the uses of a reputation model is predicting behavior. We take up the challenge
of predicting the behavior of football teams in football matches, which we
argue is a very interesting yet difficult approach for evaluating the model.Comment: 12th European Conference on Multi-Agent Systems (EUMAS 2014
Uncovering predictability in the evolution of the WTI oil futures curve
Accurately forecasting the price of oil, the world's most actively traded
commodity, is of great importance to both academics and practitioners. We
contribute by proposing a functional time series based method to model and
forecast oil futures. Our approach boasts a number of theoretical and practical
advantages including effectively exploiting underlying process dynamics missed
by classical discrete approaches. We evaluate the finite-sample performance
against established benchmarks using a model confidence set test. A realistic
out-of-sample exercise provides strong support for the adoption of our approach
with it residing in the superior set of models in all considered instances.Comment: 28 pages, 4 figures, to appear in European Financial Managemen
On the Inability of Markov Models to Capture Criticality in Human Mobility
We examine the non-Markovian nature of human mobility by exposing the
inability of Markov models to capture criticality in human mobility. In
particular, the assumed Markovian nature of mobility was used to establish a
theoretical upper bound on the predictability of human mobility (expressed as a
minimum error probability limit), based on temporally correlated entropy. Since
its inception, this bound has been widely used and empirically validated using
Markov chains. We show that recurrent-neural architectures can achieve
significantly higher predictability, surpassing this widely used upper bound.
In order to explain this anomaly, we shed light on several underlying
assumptions in previous research works that has resulted in this bias. By
evaluating the mobility predictability on real-world datasets, we show that
human mobility exhibits scale-invariant long-range correlations, bearing
similarity to a power-law decay. This is in contrast to the initial assumption
that human mobility follows an exponential decay. This assumption of
exponential decay coupled with Lempel-Ziv compression in computing Fano's
inequality has led to an inaccurate estimation of the predictability upper
bound. We show that this approach inflates the entropy, consequently lowering
the upper bound on human mobility predictability. We finally highlight that
this approach tends to overlook long-range correlations in human mobility. This
explains why recurrent-neural architectures that are designed to handle
long-range structural correlations surpass the previously computed upper bound
on mobility predictability
Nonfractional Memory: Filtering, Antipersistence, and Forecasting
The fractional difference operator remains to be the most popular mechanism
to generate long memory due to the existence of efficient algorithms for their
simulation and forecasting. Nonetheless, there is no theoretical argument
linking the fractional difference operator with the presence of long memory in
real data. In this regard, one of the most predominant theoretical explanations
for the presence of long memory is cross-sectional aggregation of persistent
micro units. Yet, the type of processes obtained by cross-sectional aggregation
differs from the one due to fractional differencing. Thus, this paper develops
fast algorithms to generate and forecast long memory by cross-sectional
aggregation. Moreover, it is shown that the antipersistent phenomenon that
arises for negative degrees of memory in the fractional difference literature
is not present for cross-sectionally aggregated processes. Pointedly, while the
autocorrelations for the fractional difference operator are negative for
negative degrees of memory by construction, this restriction does not apply to
the cross-sectional aggregated scheme. We show that this has implications for
long memory tests in the frequency domain, which will be misspecified for
cross-sectionally aggregated processes with negative degrees of memory.
Finally, we assess the forecast performance of high-order and
models when the long memory series are generated by cross-sectional
aggregation. Our results are of interest to practitioners developing forecasts
of long memory variables like inflation, volatility, and climate data, where
aggregation may be the source of long memory
Wrapped feature selection for neural networks in direct marketing.
In this paper, we try to validate existing theory on and develop additional insight into repeat purchasing behaviour in a direct-marketing setting by means of an illuminating case study. The case involves the detection and qualification of the most relevant RFM (Recency, Frequency and Monetary) features, using a wrapped feature selection method in a neural network context. Results indicate that elimination of redundant/irrelevant features by means of the discussed feature selection method, allows to significantly reduce model complexity without degrading generalisation ability. It is precisely this issue that will allow to infer some very interesting marketing conclusions concerning the relative importance of the RFM-predictor categories. The empirical findings highlight the importance of a combined use of all three RFM variables in predicting repeat purchase behaviour. However, the study also reveals the dominant role of the frequency variable. Results indicate that a model including only frequency variables still yields satisfactory classification accuracy compared to the optimally reduced model.Marketing; Networks; Selection; Theory; Purchasing; Case studies; Studies; Model; Variables; Yield; Classification; Neural networks;
- …