111 research outputs found
Focusing on the Big Picture: Insights into a Systems Approach to Deep Learning for Satellite Imagery
Deep learning tasks are often complicated and require a variety of components
working together efficiently to perform well. Due to the often large scale of
these tasks, there is a necessity to iterate quickly in order to attempt a
variety of methods and to find and fix bugs. While participating in IARPA's
Functional Map of the World challenge, we identified challenges along the
entire deep learning pipeline and found various solutions to these challenges.
In this paper, we present the performance, engineering, and deep learning
considerations with processing and modeling data, as well as underlying
infrastructure considerations that support large-scale deep learning tasks. We
also discuss insights and observations with regard to satellite imagery and
deep learning for image classification.Comment: Accepted to IEEE Big Data 201
Large-Scale Detection of Non-Technical Losses in Imbalanced Data Sets
Non-technical losses (NTL) such as electricity theft cause significant harm
to our economies, as in some countries they may range up to 40% of the total
electricity distributed. Detecting NTLs requires costly on-site inspections.
Accurate prediction of NTLs for customers using machine learning is therefore
crucial. To date, related research largely ignore that the two classes of
regular and non-regular customers are highly imbalanced, that NTL proportions
may change and mostly consider small data sets, often not allowing to deploy
the results in production. In this paper, we present a comprehensive approach
to assess three NTL detection models for different NTL proportions in large
real world data sets of 100Ks of customers: Boolean rules, fuzzy logic and
Support Vector Machine. This work has resulted in appreciable results that are
about to be deployed in a leading industry solution. We believe that the
considerations and observations made in this contribution are necessary for
future smart meter research in order to report their effectiveness on
imbalanced and large real world data sets.Comment: Proceedings of the Seventh IEEE Conference on Innovative Smart Grid
Technologies (ISGT 2016
Impact of Biases in Big Data
The underlying paradigm of big data-driven machine learning reflects the
desire of deriving better conclusions from simply analyzing more data, without
the necessity of looking at theory and models. Is having simply more data
always helpful? In 1936, The Literary Digest collected 2.3M filled in
questionnaires to predict the outcome of that year's US presidential election.
The outcome of this big data prediction proved to be entirely wrong, whereas
George Gallup only needed 3K handpicked people to make an accurate prediction.
Generally, biases occur in machine learning whenever the distributions of
training set and test set are different. In this work, we provide a review of
different sorts of biases in (big) data sets in machine learning. We provide
definitions and discussions of the most commonly appearing biases in machine
learning: class imbalance and covariate shift. We also show how these biases
can be quantified and corrected. This work is an introductory text for both
researchers and practitioners to become more aware of this topic and thus to
derive more reliable models for their learning problems
X-ray Astronomical Point Sources Recognition Using Granular Binary-tree SVM
The study on point sources in astronomical images is of special importance,
since most energetic celestial objects in the Universe exhibit a point-like
appearance. An approach to recognize the point sources (PS) in the X-ray
astronomical images using our newly designed granular binary-tree support
vector machine (GBT-SVM) classifier is proposed. First, all potential point
sources are located by peak detection on the image. The image and spectral
features of these potential point sources are then extracted. Finally, a
classifier to recognize the true point sources is build through the extracted
features. Experiments and applications of our approach on real X-ray
astronomical images are demonstrated. comparisons between our approach and
other SVM-based classifiers are also carried out by evaluating the precision
and recall rates, which prove that our approach is better and achieves a higher
accuracy of around 89%.Comment: Accepted by ICSP201
Learning to Auto Weight: Entirely Data-driven and Highly Efficient Weighting Framework
Example weighting algorithm is an effective solution to the training bias
problem, however, most previous typical methods are usually limited to human
knowledge and require laborious tuning of hyperparameters. In this paper, we
propose a novel example weighting framework called Learning to Auto Weight
(LAW). The proposed framework finds step-dependent weighting policies
adaptively, and can be jointly trained with target networks without any
assumptions or prior knowledge about the dataset. It consists of three key
components: Stage-based Searching Strategy (3SM) is adopted to shrink the huge
searching space in a complete training process; Duplicate Network Reward (DNR)
gives more accurate supervision by removing randomness during the searching
process; Full Data Update (FDU) further improves the updating efficiency.
Experimental results demonstrate the superiority of weighting policy explored
by LAW over standard training pipeline. Compared with baselines, LAW can find a
better weighting schedule which achieves much more superior accuracy on both
biased CIFAR and ImageNet.Comment: Accepted by AAAI 202
Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation
Precision-recall (PR) curves and the areas under them are widely used to
summarize machine learning results, especially for data sets exhibiting class
skew. They are often used analogously to ROC curves and the area under ROC
curves. It is known that PR curves vary as class skew changes. What was not
recognized before this paper is that there is a region of PR space that is
completely unachievable, and the size of this region depends only on the skew.
This paper precisely characterizes the size of that region and discusses its
implications for empirical evaluation methodology in machine learning.Comment: ICML2012, fixed citations to use correct tech report numbe
Komparasi Algoritma Kasifikasi dengan Pendekatan Level Data untuk Menangani Data Kelas Tidak Seimbang
Masalah data kelas tidak seimbang memiliki efek buruk pada ketepatan prediksi data. Untuk menangani masalah ini, telah banyak penelitian sebelumnya menggunakan algoritma klasifikasi menangani masalah data kelas tidak seimbang. Pada penelitian ini akan menyajikan teknik under-sampling dan over-sampling untuk menangani data kelas tidak seimbang. Teknik ini akan digunakan pada tingkat preprocessing untuk menyeimbangkan kondisi kelas pada data. Hasil eksperimen menunjukkan neural network (NN) lebih unggul dari decision tree (DT), linear regression (LR), naïve bayes (NB) dan support vector machine (SVM)
- …