Search CORE

30,916 research outputs found

Recommended from our members

Learning from aggregated data

Author: Bhowmik Avradeep
Publication venue
Publication date: 04/04/2019
Field of study

Data aggregation is ubiquitous in modern life. Due to various reasons like privacy, scalability, robustness, etc., ground truth data is often subjected to aggregation before being released to the public, or utilised by researchers and analysts. Learning from aggregated data is a challenging problem that requires significant algorithmic innovation, since naive application of standard techniques to aggregated data is vulnerable to the ecological fallacy. In this work, we explore three different versions of this setting. First, we tackle the problem of using generalised linear models when features/covariates are fully observed but the targets are only available as histograms- a common scenario in the healthcare domain where many datasets contain both non-sensitive attributes like age, sex, zip-code, etc., as well as privacy sensitive attributes like healthcare records. We introduce an efficient algorithm that uses alternating data imputation and GLM estimation steps to learn predictive models in this setting. Next, we look at the problem of learning sparse linear models when both features and targets are in aggregated form, specified as empirical estimates of group-wise means computed over different sub-groups of the population. We show that if the true sub-populations are heterogeneous enough, the optimal sparse parameter can be recovered within an arbitrarily small tolerance even in the presence of noise, provided the empirical estimates are obtained from a sufficiently large number of observations. Third, we tackle the scenario of predictive modelling with data that is subjected to spatio-temporal aggregation. We show that by formulating the problem in the frequency domain, we can bypass the mathematical and representational challenges that arise due to non-uniform aggregation, misaligned sampling periods and aliasing. We introduce a novel algorithm that uses restricted Fourier transforms to estimate a linear model which, when applied to spatio-temporally aggregated data, has a generalisation error that is provably close to the optimal performance by the best possible linear model that can be learned from the non-aggregated data set. We then focus our attention on the complementary problem that involves designing aggregation strategies that can allow learning, as well as developing algorithmic techniques that can use only the aggregates to train a model that works on individual samples. We motivate our methods by using the example of Gaussian regression, and subsequently extend our techniques to subsume binary classifiers and generalised linear models. We deonstrate the effectiveness of our techniques with empirical evaluation on data from healthcare and telecommunication. Finally, we present a concrete example of our methods applied to a real life practical problem. Specifically, we consider an application in the domain of online advertising where the complexity of bidding strategies require accurate estimates of most probable cost-per-click or CPC incurred by advertisers, but the data used for training these CPC prediction models are only available as aggregated invoices supplied by an ad publisher on a daily or hourly basis. We introduce a novel learning framework that can use aggregates computed at varying levels of granularity for building individual-level predictive models. We generalise our modelling and algorithmic framework to handle data from diverse domains, and extend our techniques to cover arbitrary aggregation paradigms like sliding windows and overlapping/non-uniform aggregation. We show empirical evidence for the efficacy of our techniques with experiments on both synthetic data and real data from the online advertising domain as well as healthcare to demonstrate the wider applicability of our framework.Electrical and Computer Engineerin

Texas ScholarWorks

Predicting Multi-class Customer Profiles Based on Transactions: a Case Study in Food Sales

Author: Apeh E.
Gabrys Bogdan
Pechenizkiy Mykola
Zliobaite Indre
Publication venue: Smart Technology Research Centre Bournemouth University
Publication date: 01/01/2012
Field of study

Predicting the class of a customer profile is a key task in marketing, which enables businesses to approach the right customer with the right product at the right time through the right channel to satisfy the customer's evolving needs. However, due to costs, privacy and/or data protection, only the business' owned transactional data is typically available for constructing customer profiles. Predicting the class of customer profiles based on such data is challenging, as the data tends to be very large, heavily sparse and highly skewed. We present a new approach that is designed to efficiently and accurately handle the multi-class classification of customer profiles built using sparse and skewed transactional data. Our approach first bins the customer profiles on the basis of the number of items transacted. The discovered bins are then partitioned and prototypes within each of the discovered bins selected to build the multi-class classifier models. The results obtained from using four multi-class classifiers on real-world transactional data from the food sales domain consistently show the critical numbers of items at which the predictive performance of customer profiles can be substantially improved

Repository TU/e

Pure OAI Repository

Bournemouth University Research Online

Improving adaptation and interpretability of a short-term traffic forecasting system

Author: Casas Vilaró Jordi
Djukic Tamara
Gavaldà Mestre Ricard
Mena Yedra Rafael
Publication venue
Publication date: 01/01/2017
Field of study

Traffic management is being more important than ever, especially in overcrowded big cities with over-pollution problems and with new unprecedented mobility changes. In this scenario, road-traffic prediction plays a key role within Intelligent Transportation Systems, allowing traffic managers to be able to anticipate and take the proper decisions. This paper aims to analyse the situation in a commercial real-time prediction system with its current problems and limitations. The analysis unveils the trade-off between simple parsimonious models and more complex models. Finally, we propose an enriched machine learning framework, Adarules, for the traffic prediction in real-time facing the problem as continuously incoming data streams with all the commonly occurring problems in such volatile scenario, namely changes in the network infrastructure and demand, new detection stations or failure ones, among others. The framework is also able to infer automatically the most relevant features to our end-task, including the relationships within the road network. Although the intention with the proposed framework is to evolve and grow with new incoming big data, however there is no limitation in starting to use it without any prior knowledge as it can starts learning the structure and parameters automatically from data. We test this predictive system in different real-work scenarios, and evaluate its performance integrating a multi-task learning paradigm for the sake of the traffic prediction task.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

MORE: Merged Opinions Reputation Model

Author: Osman Nardine
Provetti Alessandro
Riggi Valerio
Sierra Carles
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Reputation is generally defined as the opinion of a group on an aspect of a thing. This paper presents a reputation model that follows a probabilistic modelling of opinions based on three main concepts: (1) the value of an opinion decays with time, (2) the reputation of the opinion source impacts the reliability of the opinion, and (3) the certainty of the opinion impacts its weight with respect to other opinions. Furthermore, the model is flexible with its opinion sources: it may use explicit opinions or implicit opinions that can be extracted from agent behavior in domains where explicit opinions are sparse. We illustrate the latter with an approach to extract opinions from behavioral information in the sports domain, focusing on football in particular. One of the uses of a reputation model is predicting behavior. We take up the challenge of predicting the behavior of football teams in football matches, which we argue is a very interesting yet difficult approach for evaluating the model.Comment: 12th European Conference on Multi-Agent Systems (EUMAS 2014

arXiv.org e-Print Archive

Crossref

AIR Universita degli studi di Milano

Digital.CSIC

Uncovering predictability in the evolution of the WTI oil futures curve

Author: Akaike H.
Alquist R.
Diebold F. X.
Hansen P. R.
Kenzie E.
Kokoszka P.
Kokoszka P.
Kowal D. R.
Rice G.
Singleton K. J.
Trolle A. B.
Publication venue
Publication date: 08/01/2019
Field of study

Accurately forecasting the price of oil, the world's most actively traded commodity, is of great importance to both academics and practitioners. We contribute by proposing a functional time series based method to model and forecast oil futures. Our approach boasts a number of theoretical and practical advantages including effectively exploiting underlying process dynamics missed by classical discrete approaches. We evaluate the finite-sample performance against established benchmarks using a model confidence set test. A realistic out-of-sample exercise provides strong support for the adoption of our approach with it residing in the superior set of models in all considered instances.Comment: 28 pages, 4 figures, to appear in European Financial Managemen

arXiv.org e-Print Archive

Queen's University Belfast Research Portal

Crossref

On the Inability of Markov Models to Capture Criticality in Human Mobility

Author: A Cuttone
A Lesne
AL Barabasi
C Krumme
C Song
EL Ikanovic
FJ Massey Jr
HW Lin
J Schmidhuber
J Ziv
JA Storer
L Song
ME Newman
N Chomsky
S Grossberg
S Hochreiter
VV Prelov
X Lu
XY Yan
Y Zheng
ZD Zhao
Publication venue
Publication date: 27/07/2018
Field of study

We examine the non-Markovian nature of human mobility by exposing the inability of Markov models to capture criticality in human mobility. In particular, the assumed Markovian nature of mobility was used to establish a theoretical upper bound on the predictability of human mobility (expressed as a minimum error probability limit), based on temporally correlated entropy. Since its inception, this bound has been widely used and empirically validated using Markov chains. We show that recurrent-neural architectures can achieve significantly higher predictability, surpassing this widely used upper bound. In order to explain this anomaly, we shed light on several underlying assumptions in previous research works that has resulted in this bias. By evaluating the mobility predictability on real-world datasets, we show that human mobility exhibits scale-invariant long-range correlations, bearing similarity to a power-law decay. This is in contrast to the initial assumption that human mobility follows an exponential decay. This assumption of exponential decay coupled with Lempel-Ziv compression in computing Fano's inequality has led to an inaccurate estimation of the predictability upper bound. We show that this approach inflates the entropy, consequently lowering the upper bound on human mobility predictability. We finally highlight that this approach tends to overlook long-range correlations in human mobility. This explains why recurrent-neural architectures that are designed to handle long-range structural correlations surpass the previously computed upper bound on mobility predictability

arXiv.org e-Print Archive

Crossref

Arrow@TUDublin

Nonfractional Memory: Filtering, Antipersistence, and Forecasting

Author: Vera-Valdés J. Eduardo
Publication venue
Publication date: 20/01/2018
Field of study

The fractional difference operator remains to be the most popular mechanism to generate long memory due to the existence of efficient algorithms for their simulation and forecasting. Nonetheless, there is no theoretical argument linking the fractional difference operator with the presence of long memory in real data. In this regard, one of the most predominant theoretical explanations for the presence of long memory is cross-sectional aggregation of persistent micro units. Yet, the type of processes obtained by cross-sectional aggregation differs from the one due to fractional differencing. Thus, this paper develops fast algorithms to generate and forecast long memory by cross-sectional aggregation. Moreover, it is shown that the antipersistent phenomenon that arises for negative degrees of memory in the fractional difference literature is not present for cross-sectionally aggregated processes. Pointedly, while the autocorrelations for the fractional difference operator are negative for negative degrees of memory by construction, this restriction does not apply to the cross-sectional aggregated scheme. We show that this has implications for long memory tests in the frequency domain, which will be misspecified for cross-sectionally aggregated processes with negative degrees of memory. Finally, we assess the forecast performance of high-order

AR

and

ARFIMA

models when the long memory series are generated by cross-sectional aggregation. Our results are of interest to practitioners developing forecasts of long memory variables like inflation, volatility, and climate data, where aggregation may be the source of long memory

arXiv.org e-Print Archive

VBN

Wrapped feature selection for neural networks in direct marketing.

Author: Baesens Bart
Dedene Guido
Van den Poel D
Vanthienen Jan
Viaene Stijn
Publication venue
Publication date
Field of study

In this paper, we try to validate existing theory on and develop additional insight into repeat purchasing behaviour in a direct-marketing setting by means of an illuminating case study. The case involves the detection and qualification of the most relevant RFM (Recency, Frequency and Monetary) features, using a wrapped feature selection method in a neural network context. Results indicate that elimination of redundant/irrelevant features by means of the discussed feature selection method, allows to significantly reduce model complexity without degrading generalisation ability. It is precisely this issue that will allow to infer some very interesting marketing conclusions concerning the relative importance of the RFM-predictor categories. The empirical findings highlight the importance of a combined use of all three RFM variables in predicting repeat purchase behaviour. However, the study also reveals the dominant role of the frequency variable. Results indicate that a model including only frequency variables still yields satisfactory classification accuracy compared to the optimally reduced model.Marketing; Networks; Selection; Theory; Purchasing; Case studies; Studies; Model; Variables; Yield; Classification; Neural networks;

Research Papers in Economics