In machine learning, a bias occurs whenever training sets are not
representative for the test data, which results in unreliable models. The most
common biases in data are arguably class imbalance and covariate shift. In this
work, we aim to shed light on this topic in order to increase the overall
attention to this issue in the field of machine learning. We propose a scalable
novel framework for reducing multiple biases in high-dimensional data sets in
order to train more reliable predictors. We apply our methodology to the
detection of irregular power usage from real, noisy industrial data. In
emerging markets, irregular power usage, and electricity theft in particular,
may range up to 40% of the total electricity distributed. Biased data sets are
of particular issue in this domain. We show that reducing these biases
increases the accuracy of the trained predictors. Our models have the potential
to generate significant economic value in a real world application, as they are
being deployed in a commercial software for the detection of irregular power
usage

Duarte, Diogo

Glauner, Patrick

State, Radu

Valtchev, Petko

English

arXiv

peer reviewedIn machine learning, a bias occurs whenever training sets are not representative for the test data, which results in unreliable models. The most common biases in data are arguably class imbalance and covariate shift. In this work, we aim to shed light on this topic in order to increase the overall attention to this issue in the field of machine learning. We propose a scalable novel framework for reducing multiple biases in high-dimensional data sets in order to train more reliable predictors. We apply our methodology to the detection of irregular power usage from real, noisy industrial data. In emerging markets, irregular power usage, and electricity theft in particular, may range up to 40% of the total electricity distributed. Biased data sets are of particular issue in this domain. We show that reducing these biases increases the accuracy of the trained predictors. Our models have the potential to generate significant economic value in a real world application, as they are being deployed in a commercial software for the detection of irregular power usage

Open Repository and Bibliography - Luxembourg

ON THE REDUCTION OF BIASES IN BIG DATA SETS FORTHE DETECTION OF IRREGULAR POWER USAGEPATRICK GLAUNER and RADU STATEInterdisciplinary Centre for Security, Reliability and Trust, University of Luxembourg1855 Luxembourg, LuxembourgEmail: {patrick.glauner, radu.state}@uni.luPETKO VALTCHEVDepartment of Computer Science, University of Quebec in MontrealH3C 3P8 Montreal, CanadaEmail: valtchev.petko@uqam.caDIOGO DUARTECHOICE Technologies Holding Sa`rl2453 Luxembourg, LuxembourgEmail: diogo.duarte@choiceholding.comIn machine learning, a bias occurs whenever training sets are not representa-tive for the test data, which results in unreliable models. The most commonbiases in data are arguably class imbalance and covariate shift. In this work,we aim to shed light on this topic in order to increase the overall attention tothis issue in the field of machine learning. We propose a scalable novel frame-work for reducing multiple biases in high-dimensional data sets in order totrain more reliable predictors. We apply our methodology to the detection ofirregular power usage from real, noisy industrial data. In emerging markets,irregular power usage, and electricity theft in particular, may range up to 40%of the total electricity distributed. Biased data sets are of particular issue inthis domain. We show that reducing these biases increases the accuracy ofthe trained predictors. Our models have the potential to generate significanteconomic value in a real world application, as they are being deployed in acommercial software for the detection of irregular power usage.Keywords: Bias; Class Imbalance; Covariate Shift; Non-Technical Losses.1. IntroductionThe contemporary Big Data paradigm can be summarized as follows: “It’snot who has the best algorithm that wins. It’s who has the most data.”1However, in many cases, increasing the amounts of data is not a panacea since it can be biased: One frequently appearing bias results in training data and test data having different distributions, as depicted in Fig. 1. Learning from such training data leads to unreliable predictors that are not able to generalize to the test data. In the literature, this sort of bias is called covariate shift, sampling bias or sample selection bias. Covariate shift has been recognized as an issue in statistics since the mid-20th century.2 In contrast, it has received only a limited attention in machine learning, mainly within the computational learning theory subfield, yet the situation is currently evolving.3,4TrainingTestFig. 1. Example of covariate shift: training and test data having different distributions.Non-technical losses (NTL) appear in power grids during distributionand include, but are not limited to, the following causes: meter tampering inorder to record lower consumptions, bypassing meters by rigging lines fromthe power source, arranged false meter readings by bribing meter readersor faulty or broken meters. NTL are more common in emerging countries,where electricity theft is the main contributor. NTL are reported to rangeup to 40% of the total electricity distributed in countries such as Brazil,India, Malaysia or Pakistan.5,6 NTL are the source of major concerns forthe electricity providers including financial losses and a decrease of stabilityand reliability in power grids. It is therefore crucial to detect customers thatcause NTL. Recent research on NTL detection mainly uses machine learn-ing models that learn anomalous behavior from customer data and knownirregular behavior that was reported through on-site inspection results. Inorder to detect NTL more accurately, one may assume that having simplymore customer and inspection data would help. We have previously shownthat in many cases, the set of inspected customers is biased.2 A reason forthat is that past inspections have been largely focused on certain criteriaand were not sufficiently spread across the population.This paper builds on top of our previous contributions and aims at biasreduction in data, and further at more generalizable NTL predictors. Itsmain contributions are:• We present a framework for reducing biases in data, such as classimbalance and covariate shift, in particular for spatial data.• We propose a scalable novel methodology for reducing multiplebiases in high-dimensional data sets at the same time.• We report on how our method performs on the detection of NTL.Our method leads to a better detection of anomalous customers,subsequently reduces losses of electricity providers and thus in-creases stability and reliability of power distribution infrastructure.2. Background and Related WorkIn supervised learning, training examples (x(i), y(i)) are drawn from a train-ing distribution Ptrain(X,Y ), where X denotes the data and Y the label,respectively. The training set is biased if Ptrain(X,Y ) 6= Ptest(X,Y ). Inorder to reduce the bias, it has been shown that example (x(i), y(i)) can beweighted during training as follows:7wi =Ptest(x(i), y(i))Ptrain(x(i), y(i)).However, computing Ptrain(x(i), y(i)) is impractical because of the limitedamount of data in the training domain. It is for that reason that in the lit-erature, predominantly two different types of biases are discussed: class im-balance and covariate shift. Class imbalance refers to the case where classesare unequally represented in the data. Therefore, we assume Ptrain(X|Y ) =Ptest(X|Y ), but Ptrain(Y ) 6= Ptest(Y ).8 In contrast, for covariate shift, weassume Ptrain(Y |X) = Ptest(Y |X), but Ptrain(X) 6= Ptest(X).9 Instanceweighting using density estimation has been proposed for correcting covari-ate shift.3 Furthermore, the Heckman method has been proposed to correctcovariate shift.10 However, the Heckman method only applies to logistic re-gression models. Other biases are reported in the literature, for example for change of functional relations, i.e. when Ptrain(Y |X) 6= Ptest(Y |X), or biases created by transforming the feature space.73. Reduction of BiasesWe propose the following methodology: Given the assumptions made for class imbalance, we compute the corresponding weight for example i having a label of class k as follows:wi,k =Ptest(x(i), y(i)k )Ptrain(x(i), y(i)k )=Ptest(x(i)|y(i)k )Ptest(y(i)k )Ptrain(x(i)|y(i)k )Ptrain(y(i)k )=Ptest(y(i)k )Ptrain(y(i)k ).We use the empirical counts of classes for computing P<dist>(yk). Giventhe assumptions made for covariate shift, we compute the correspondingweight for the bias in feature k of example i as follows:wi,k =Ptest(x(i)k , y(i))Ptrain(x(i)k , y(i))=Ptest(y(i)|x(i)k )Ptest(x(i)k )Ptrain(y(i)|x(i)k )Ptrain(x(i)k )=Ptest(x(i)k )Ptrain(x(i)k ).We use density estimation for computing P<dist>(x(i)k ).11There may be a variety of biases in a learning problem that are farmore than just class imbalance and covariate shift on a single dimension.We have shown previously that there may be multiple types of covariateshift, for example spatial covariate shifts on different hierarchical levels.There may be also covariate shifts for other master data, such as for thecustomer class or for the contract status.2 We now aim to correct n differentbiases at a same time, e.g. for class imbalance as well as different types ofcovariate shift. As x(i) has potentially many dimensions with a considerablecovariate shift, computing the joint P<dist>(x(i)) becomes impractical foran increasing number of dimensions. We propose a uniformed and scalablesolution to combine weights for correcting the n different biases, compris-ing for example of class imbalance and different types of covariate shift.The corresponding weights per bias of an example are wi,1, wi,2, ..., wi,n.The example weight wi is the harmonic mean of the weights of the biasesconsidered is computed as follows:wi =n1wi,1+ 1wi,2 + · · ·+ 1wi,n=nn∑k=11wi,k. (1)As the different wi,k are computed from noisy, real-world data, specialcare needs to be paid to outliers. Outliers can potentially lead to very largevalues wi,k for the density estimation proposed above. It is for that reasonthat we choose the harmonic mean, as it allows to penalizes extreme valuesand give preference to smaller values.4. EvaluationThe data used in this paper comes from an electricity provider in Brazil,from which we retain M = 150, 700 customers. For these customers, we havea complete time series of 24 monthly meter readings before the most re-cent inspection. From each time series, we compute 304 features comprisinggeneric time series features, daily average features and difference features,as detailed in Table 1. The computation of these features is explained indetail in our previous work.12Table 1. Number of features before and after selection.Name #Features #Retained featuresDaily average 23 18Fixed interval 36 34Generic time series 222 162Intra year difference 12 12Intra year seasonal difference 11 11Total 304 237Next, we employ hypothesis tests to the features in order to retain theones that are statistically relevant. These tests are based on the assumptionthat a feature xk is meaningful for the prediction of the binary label vectory if xk and y are not statistically independent.13 For binary features, weuse Fisher’s exact test.14 In contrast, for continuous features, we use theKolmogorov-Smirnov test.15 We retain 237 of the 304 features.We previously found a random forest (RF) classifier to perform thebest on this data compared to decision tree, gradient boosted tree andsupport vector machine classifiers.12 It is for this reason that in the followingexperiments, we only train RF classifiers. When training a RF, we performmodel selection by doing randomized grid search, for which the parametersare detailed in Table 2. We use 100 sampled models and perform 10-foldcross-validation for each model.We have previously shown that the location and class of customers havethe strongest covariate shift.2 When reducing these, we first compute theweights for the class imbalance, the spatial covariate shift and customerTable 2. Model parameters for random forest.Parameter ValuesMax. number of leaves [2, 1000)Max. number of levels [1, 50)Measure of the purity of a split {entropy, gini}Min. number of samples required to be at a leaf [1, 1000)Min. number of samples required to split a node [2, 50)Number of estimators 20class covariate shift, respectively, as defined in Sec. 3. For covariate shift,we use randomized grid search for a model selection of the density estimatorthat is composed of the kernel type and kernel bandwidth. The completelist of parameters and considered values is depicted in Table 3.Table 3. Density estimation parameters.Parameter ValuesKernel {gaussian, tophat, epanechnikov, exponential, linear, cosine}Bandwidth [0.001, 10] (log space)Next, we use Eq. 1 to combine these weights step by step. For eachstep, we report the test performance of the NTL classifier in Table 4. Itclearly shows that the larger the number of addressed biases, the higherthe reliability of the learned predictor.Table 4. Test performance of random forest.Biases reduced AUCNone 0.59535Class imbalance 0.64445Class imbalance + spatial covariate shift 0.71431Class imbalance + spatial covariate shift + customer class covariate shift 0.73980Note: We use the area under the receiver-operating curve (AUC) metric. It is particularlyuseful for NTL detection, as it allows to handle imbalanced datasets and puts correctand incorrect inspection results in relation to each other.5 AUC denotes the mean testAUC of the 10 folds of cross-validation for the best model.5. Conclusions and Future WorkBiases appear in many real-world applications of machine learning and referto the training data not being representative for the test data. The mostcommon biases are class imbalance and covariate shift. In this work, weproposed a scalable model for reducing multiple biases in high-dimensionaldata at the same time. We applied our methodology to a real-world, noisydata set on irregular power usage. Our model leads to more reliable pre-dictors, thus allowing to better detect customers that have an irregularpower usage. Next, we aim to evaluate our methodology on other data sets,to derive models that reduce hierarchical spatial biases and to handpick aunbiased test set as ground truth for evaluation.AcknowledgementThe present project is supported by the National Research Fund, Luxem-bourg under grant agreement number 11508593.References1. M. Banko and E. Brill, Scaling to very very large corpora for natural languagedisambiguation, in Proceedings of the 39th annual meeting on association forcomputational linguistics, 2001.2. P. Glauner, A. Migliosi, J. A. Meira et al., Is big data sufficient for a reliabledetection of non-technical losses?, in 2017 19th International Conference onIntelligent System Application to Power Systems (ISAP), Sept 2017.3. H. Shimodaira, Journal of statistical planning and inference 90, 227 (2000).4. C. Cortes and M. Mohri, Theoretical Computer Science 519, 103 (2014).5. P. Glauner, J. A. Meira, P. Valtchev et al., International Journal of Compu-tational Intelligence Systems 10, 760 (2017).6. J. L. Viegas, P. R. Esteves, R. Mel´ıcio et al., Renewable and SustainableEnergy Reviews 80, 1256 (2017).7. J. Jiang, A literature survey on domain adaptation of statistical classifiers(2008).8. N. Japkowicz and S. Stephen, Intelligent data analysis 6, 429 (2002).9. B. Zadrozny, Learning and evaluating classifiers under sample selection bias,in Proceedings of the twenty-first international conference on Machine learn-ing , 2004.10. J. J. Heckman, Econometrica 47, 153 (1979).11. F. Pedregosa, G. Varoquaux, A. Gramfort et al., Journal of Machine LearningResearch 12, 2825 (2011).12. P. Glauner, N. Dahringer, O. Puhachov et al., Identifying irregular power us-age by turning predictions into holographic spatial visualizations, in Proceed-ings of the 17th IEEE International Conference on Data Mining Workshops(ICDMW 2017), November 2017.13. P. Radivojac, Z. Obradovic, A. K. Dunker and S. Vucetic, Feature selectionfilters based on the permutation test, in ECML, 2004.14. R. A. Fisher, Journal of the Royal Statistical Society 85, 87 (1922).15. F. J. Massey Jr, Journal of the American statistical Association 46, 68 (1951).

On the Reduction of Biases in Big Data Sets for the Detection of Irregular Power Usage

http://orbilu.uni.lu/bitstream/10993/35427/1/On%20the%20Reduction%20of%20Biases%20in%20Big%20Data%20Sets%20for%20the%20Detection%20of%20Irregular%20Power%20Usage.pdf

On the Reduction of Biases in Big Data Sets for the Detection of Irregular Power Usage

Abstract

Similar works

Full text

Available Versions

Open Repository and Bibliography - Luxembourg