9 research outputs found

    Impact of Biases in Big Data

    Get PDF
    The underlying paradigm of big data-driven machine learning reflects the desire of deriving better conclusions from simply analyzing more data, without the necessity of looking at theory and models. Is having simply more data always helpful? In 1936, The Literary Digest collected 2.3M filled in questionnaires to predict the outcome of that year's US presidential election. The outcome of this big data prediction proved to be entirely wrong, whereas George Gallup only needed 3K handpicked people to make an accurate prediction. Generally, biases occur in machine learning whenever the distributions of training set and test set are different. In this work, we provide a review of different sorts of biases in (big) data sets in machine learning. We provide definitions and discussions of the most commonly appearing biases in machine learning: class imbalance and covariate shift. We also show how these biases can be quantified and corrected. This work is an introductory text for both researchers and practitioners to become more aware of this topic and thus to derive more reliable models for their learning problems

    Impact of Biases in Big Data

    Get PDF
    The underlying paradigm of big data-driven machine learning reflects the desire of deriving better conclusions from simply analyzing more data, without the necessity of looking at theory and models. Is having simply more data always helpful? In 1936, The Literary Digest collected 2.3M filled in questionnaires to predict the outcome of that year's US presidential election. The outcome of this big data prediction proved to be entirely wrong, whereas George Gallup only needed 3K handpicked people to make an accurate prediction. Generally, biases occur in machine learning whenever the distributions of training set and test set are different. In this work, we provide a review of different sorts of biases in (big) data sets in machine learning. We provide definitions and discussions of the most commonly appearing biases in machine learning: class imbalance and covariate shift. We also show how these biases can be quantified and corrected. This work is an introductory text for both researchers and practitioners to become more aware of this topic and thus to derive more reliable models for their learning problems

    Assessing Capsule Networks with Biased Data

    Get PDF
    Machine learning based methods achieves impressive results in object classification and detection. Utilizing representative data of the visual world during the training phase is crucial to achieve good performance with such data driven approaches. However, it not always possible to access bias-free datasets thus, robustness to biased data is a desirable property for a learning system. Capsule Networks have been introduced recently and their tolerance to biased data has received little attention. This paper aims to fill this gap and proposes two experimental scenarios to assess the tolerance to imbalanced training data and to determine the generalization performance of a model with unfamiliar affine transformations of the images. This paper assesses dynamic routing and EM routing based Capsule Networks and proposes a comparison with Convolutional Neural Networks in the two tested scenarios. The presented results provide new insights into the behaviour of capsule networks

    Treating class imbalance in non-technical loss detection : an exploratory analysis of a real dataset

    Get PDF
    Non-Technical Loss (NTL) is a significant concern for many electric supply companies due to the financial impact caused as a result of suspect consumption activities. A range of machine learning classifiers have been tested across multiple synthesized and real datasets to combat NTL. An important characteristic that exists in these datasets is the imbalance distribution of the classes. When the focus is on predicting the minority class of suspect activities, the classifiers' sensitivity to the class imbalance becomes more important. In this paper, we evaluate the performance of a range of classifiers with under-sampling and over-sampling techniques. The results are compared with the untreated imbalanced dataset. In addition, we compare the performance of the classifiers using penalized classification model. Lastly, the paper presents an exploratory analysis of using different sampling techniques on NTL detection in a real dataset and identify the best performing classifiers. We conclude that logistic regression is the most sensitive to the sampling techniques as the change of its recall is measured around 50% for all sampling techniques. While the random forest is the least sensitive to the sampling technique, the difference in its precision is observed between 1% - 6% for all sampling techniques. © 2013 IEEE

    Machine Learning for Data-Driven Smart Grid Applications

    Get PDF
    The field of Machine Learning grew out of the quest for artificial intelligence. It gives computers the ability to learn statistical patterns from data without being explicitly programmed. These patterns can then be applied to new data in order to make predictions. Machine Learning also allows to automatically adapt to changes in the data without amending the underlying model. We deal every day dozens of times with Machine Learning applications such as when doing a Google search, using spam filters, face detection, speaking to voice recognition software or when sitting in a self-driving car. In recent years, machine learning methods have evolved in the smart grid community. This change towards analyzing data rather than modeling specific problems has lead to adaptable, more generic methods, that require less expert knowledge and that are easier to deploy in a number of use cases. This is an introductory level course to discuss what machine learning is and how to apply it to data-driven smart grid applications. Practical case studies on real data sets, such as load forecasting, detection of irregular power usage and visualization of customer data, will be included. Therefore, attendees will not only understand, but rather experience, how to apply machine learning methods to smart grid data

    Detection of Irregular Power Usage using Machine Learning

    Get PDF
    Electricity losses are a frequently appearing problem in power grids. Non-technical losses (NTL) appear during distribution and include, but are not limited to, the following causes: Meter tampering in order to record lower consumptions, bypassing meters by rigging lines from the power source, arranged false meter readings by bribing meter readers, faulty or broken meters, un-metered supply, technical and human errors in meter readings, data processing and billing. NTLs are also reported to range up to 40% of the total electricity distributed in countries such as India, Pakistan, Malaysia, Brazil or Lebanon. This is an introductory level course to discuss how to predict if a customer causes a NTL. In the last years, employing data analytics methods such as machine learning and data mining have evolved as the primary direction to solve this problem. This course will present and compare different approaches reported in the literature. Practical case studies on real data sets will be included. As an additional outcome, attendees will understand the open challenges of NTL detection and learn how these challenges could be solved in the coming years

    A governance framework for algorithmic accountability and transparency

    Get PDF
    Algorithmic systems are increasingly being used as part of decision-making processes in both the public and private sectors, with potentially significant consequences for individuals, organisations and societies as a whole. Algorithmic systems in this context refer to the combination of algorithms, data and the interface process that together determine the outcomes that affect end users. Many types of decisions can be made faster and more efficiently using algorithms. A significant factor in the adoption of algorithmic systems for decision-making is their capacity to process large amounts of varied data sets (i.e. big data), which can be paired with machine learning methods in order to infer statistical models directly from the data. The same properties of scale, complexity and autonomous model inference however are linked to increasing concerns that many of these systems are opaque to the people affected by their use and lack clear explanations for the decisions they make. This lack of transparency risks undermining meaningful scrutiny and accountability, which is a significant concern when these systems are applied as part of decision-making processes that can have a considerable impact on people's human rights (e.g. critical safety decisions in autonomous vehicles; allocation of health and social service resources, etc.). This study develops policy options for the governance of algorithmic transparency and accountability, based on an analysis of the social, technical and regulatory challenges posed by algorithmic systems. Based on a review and analysis of existing proposals for governance of algorithmic systems, a set of four policy options are proposed, each of which addresses a different aspect of algorithmic transparency and accountability: 1. awareness raising: education, watchdogs and whistleblowers; 2. accountability in public-sector use of algorithmic decision-making; 3. regulatory oversight and legal liability; and 4. global coordination for algorithmic governance

    Artificial Intelligence for the Detection of Electricity Theft and Irregular Power Usage in Emerging Markets

    Get PDF
    Power grids are critical infrastructure assets that face non-technical losses (NTL), which include, but are not limited to, electricity theft, broken or malfunctioning meters and arranged false meter readings. In emerging markets, NTL are a prime concern and often range up to 40% of the total electricity distributed. The annual world-wide costs for utilities due to NTL are estimated to be around USD 100 billion. Reducing NTL in order to increase revenue, profit and reliability of the grid is therefore of vital interest to utilities and authorities. In the beginning of this thesis, we provide an in-depth discussion of the causes of NTL and the economic effects thereof. Industrial NTL detection systems are still largely based on expert knowledge when deciding whether to carry out costly on-site inspections of customers. Electric utilities are reluctant to move to large-scale deployments of automated systems that learn NTL profiles from data. This is due to the latter's propensity to suggest a large number of unnecessary inspections. In this thesis, we compare expert knowledge-based decision making systems to automated statistical decision making. We then branch out our research into different directions: First, in order to allow human experts to feed their knowledge in the decision process, we propose a method for visualizing prediction results at various granularity levels in a spatial hologram. Our approach allows domain experts to put the classification results into the context of the data and to incorporate their knowledge for making the final decisions of which customers to inspect. Second, we propose a machine learning framework that classifies customers into NTL or non-NTL using a variety of features derived from the customers' consumption data as well as a selection of master data. The methodology used is specifically tailored to the level of noise in the data. Last, we discuss the issue of biases in data sets. A bias occurs whenever training sets are not representative of the test data, which results in unreliable models. We show how quantifying and reducing these biases leads to an increased accuracy of the trained NTL detectors. This thesis has resulted in appreciable results on real-world big data sets of millions customers. Our systems are being deployed in a commercial NTL detection software. We also provide suggestions on how to further reduce NTL by not only carrying out inspections, but by implementing market reforms, increasing efficiency in the organization of utilities and improving communication between utilities, authorities and customers
    corecore