195 research outputs found
Deep Boosting: Layered Feature Mining for General Image Classification
Constructing effective representations is a critical but challenging problem
in multimedia understanding. The traditional handcraft features often rely on
domain knowledge, limiting the performances of exiting methods. This paper
discusses a novel computational architecture for general image feature mining,
which assembles the primitive filters (i.e. Gabor wavelets) into compositional
features in a layer-wise manner. In each layer, we produce a number of base
classifiers (i.e. regression stumps) associated with the generated features,
and discover informative compositions by using the boosting algorithm. The
output compositional features of each layer are treated as the base components
to build up the next layer. Our framework is able to generate expressive image
representations while inducing very discriminate functions for image
classification. The experiments are conducted on several public datasets, and
we demonstrate superior performances over state-of-the-art approaches.Comment: 6 pages, 4 figures, ICME 201
Confidential Boosting with Random Linear Classifiers for Outsourced User-generated Data
User-generated data is crucial to predictive modeling in many applications.
With a web/mobile/wearable interface, a data owner can continuously record data
generated by distributed users and build various predictive models from the
data to improve their operations, services, and revenue. Due to the large size
and evolving nature of users data, data owners may rely on public cloud service
providers (Cloud) for storage and computation scalability. Exposing sensitive
user-generated data and advanced analytic models to Cloud raises privacy
concerns. We present a confidential learning framework, SecureBoost, for data
owners that want to learn predictive models from aggregated user-generated data
but offload the storage and computational burden to Cloud without having to
worry about protecting the sensitive data. SecureBoost allows users to submit
encrypted or randomly masked data to designated Cloud directly. Our framework
utilizes random linear classifiers (RLCs) as the base classifiers in the
boosting framework to dramatically simplify the design of the proposed
confidential boosting protocols, yet still preserve the model quality. A
Cryptographic Service Provider (CSP) is used to assist the Cloud's processing,
reducing the complexity of the protocol constructions. We present two
constructions of SecureBoost: HE+GC and SecSh+GC, using combinations of
homomorphic encryption, garbled circuits, and random masking to achieve both
security and efficiency. For a boosted model, Cloud learns only the RLCs and
the CSP learns only the weights of the RLCs. Finally, the data owner collects
the two parts to get the complete model. We conduct extensive experiments to
understand the quality of the RLC-based boosting and the cost distribution of
the constructions. Our results show that SecureBoost can efficiently learn
high-quality boosting models from protected user-generated data
Graph-based Regularization in Machine Learning: Discovering Driver Modules in Biological Networks
Curiosity of human nature drives us to explore the origins of what makes each of us different. From ancient legends and mythology, Mendel\u27s law, Punnett square to modern genetic research, we carry on this old but eternal question. Thanks to technological revolution, today\u27s scientists try to answer this question using easily measurable gene expression and other profiling data. However, the exploration can easily get lost in the data of growing volume, dimension, noise and complexity. This dissertation is aimed at developing new machine learning methods that take data from different classes as input, augment them with knowledge of feature relationships, and train classification models that serve two goals: 1) class prediction for previously unseen samples; 2) knowledge discovery of the underlying causes of class differences. Application of our methods in genetic studies can help scientist take advantage of existing biological networks, generate diagnosis with higher accuracy, and discover the driver networks behind the differences. We proposed three new graph-based regularization algorithms. Graph Connectivity Constrained AdaBoost algorithm combines a connectivity module, a deletion function, and a model retraining procedure with the AdaBoost classifier. Graph-regularized Linear Programming Support Vector Machine integrates penalty term based on submodular graph cut function into linear classifier\u27s objective function. Proximal Graph LogisticBoost adds lasso and graph-based penalties into logistic risk function of an ensemble classifier. Results of tests of our models on simulated biological datasets show that the proposed methods are able to produce accurate, sparse classifiers, and can help discover true genetic differences between phenotypes
The Superiority of the Ensemble Classification Methods: A Comprehensive Review
The modern technologies, which are characterized by cyber-physical systems and internet of things expose organizations to big data, which in turn can be processed to derive actionable knowledge. Machine learning techniques have vastly been employed in both supervised and unsupervised environments in an effort to develop systems that are capable of making feasible decisions in light of past data. In order to enhance the accuracy of supervised learning algorithms, various classification-based ensemble methods have been developed. Herein, we review the superiority exhibited by ensemble learning algorithms based on the past that has been carried out over the years. Moreover, we proceed to compare and discuss the common classification-based ensemble methods, with an emphasis on the boosting and bagging ensemble-learning models. We conclude by out setting the superiority of the ensemble learning models over individual base learners. Keywords: Ensemble, supervised learning, Ensemble model, AdaBoost, Bagging, Randomization, Boosting, Strong learner, Weak learner, classifier fusion, classifier selection, Classifier combination. DOI: 10.7176/JIEA/9-5-05 Publication date: August 31st 2019
Adabook and Multibook: adaptive boosting with chance correction
There has been considerable interest in boosting and bagging, including the combination of the adaptive
techniques of AdaBoost with the random selection with replacement techniques of Bagging. At the same
time there has been a revisiting of the way we evaluate, with chance-corrected measures like Kappa,
Informedness, Correlation or ROC AUC being advocated. This leads to the question of whether learning
algorithms can do better by optimizing an appropriate chance corrected measure. Indeed, it is possible for a
weak learner to optimize Accuracy to the detriment of the more reaslistic chance-corrected measures, and
when this happens the booster can give up too early. This phenomenon is known to occur with conventional
Accuracy-based AdaBoost, and the MultiBoost algorithm has been developed to overcome such problems
using restart techniques based on bagging. This paper thus complements the theoretical work showing the
necessity of using chance-corrected measures for evaluation, with empirical work showing how use of a
chance-corrected measure can improve boosting. We show that the early surrender problem occurs in
MultiBoost too, in multiclass situations, so that chance-corrected AdaBook and Multibook can beat standard
Multiboost or AdaBoost, and we further identify which chance-corrected measures to use when
Machine learning on a budget
Thesis (Ph.D.)--Boston UniversityIn a typical discriminative learning setting, a set of labeled training examples is given, and the goal is to learn a decision rule that accurately classifies (or labels) unseen test examples. Much of machine learning research has focused on improving accuracy, but more recently costs of learning and decision making are becoming more important. Such costs arise both during training and testing. Labeling data for training is often an expensive process. During testing, acquiring or processing measurements for every decision is also costly. This work deals with two problems: how to reduce the amount of labeled data during training, and how to minimize measurements cost in making decisions during testing, while maintaining system accuracy.
The first part falls into an area known as active learning. It deals with the problem of selecting a small subset of examples to label, from a pool of unlabeled data, for training a good classifier. This problem is relevant in many applications where a large collection of unlabeled data is readily available but to label an instance requires using an expensive expert (a radiologist annotating a medical image). We study active learning in the boosting framework. We develop a practical algorithm that labels examples to maximally reduce the space of feasible classifiers. We show that, under certain assumptions, our strategy achieves the generalization error performance of a system trained on the entire data set while only selecting logarithmically many samples to label.
In the second part, we study sequential classifiers under budget constraints. In many systems, such as medical diagnosis and homeland security, sensors have varying acquisition costs, and these costs account for delay, throughput or monetary value. While some decisions require all measurements, it is often unnecessary to use every modality to classify every example. So the problem is to learn a system that, for every decision, sequentially selects sensors to meet a measurement budget while minimizing classification error. Initially, we study the case where the sensor order in which measurement are acquired is given. For every instance, our system has to decide whether to seek more measurements from the next sensor or to terminate by classifying based on the available information. We use Bayesian analysis of this problem to construct a novel multi-stage empirical risk objective and directly learn sequential decision functions from training data. We provide practical algorithms for binary and multi-class settings and derive generalization error guarantees. We compare our approach to alternative strategies on real world data. In the last section, we explore a decision system when the order of sensors is no longer fixed. We investigate how to combine ideas from reinforcement and imitation learning with empirical risk minimization to learn a dynamic sensor selection policy
- …