In machine learning, a bias occurs whenever training sets are not
representative for the test data, which results in unreliable models. The most
common biases in data are arguably class imbalance and covariate shift. In this
work, we aim to shed light on this topic in order to increase the overall
attention to this issue in the field of machine learning. We propose a scalable
novel framework for reducing multiple biases in high-dimensional data sets in
order to train more reliable predictors. We apply our methodology to the
detection of irregular power usage from real, noisy industrial data. In
emerging markets, irregular power usage, and electricity theft in particular,
may range up to 40% of the total electricity distributed. Biased data sets are
of particular issue in this domain. We show that reducing these biases
increases the accuracy of the trained predictors. Our models have the potential
to generate significant economic value in a real world application, as they are
being deployed in a commercial software for the detection of irregular power
usage