28,534 research outputs found
Fast Incremental SVDD Learning Algorithm with the Gaussian Kernel
Support vector data description (SVDD) is a machine learning technique that
is used for single-class classification and outlier detection. The idea of SVDD
is to find a set of support vectors that defines a boundary around data. When
dealing with online or large data, existing batch SVDD methods have to be rerun
in each iteration. We propose an incremental learning algorithm for SVDD that
uses the Gaussian kernel. This algorithm builds on the observation that all
support vectors on the boundary have the same distance to the center of sphere
in a higher-dimensional feature space as mapped by the Gaussian kernel
function. Each iteration involves only the existing support vectors and the new
data point. Moreover, the algorithm is based solely on matrix manipulations;
the support vectors and their corresponding Lagrange multiplier 's
are automatically selected and determined in each iteration. It can be seen
that the complexity of our algorithm in each iteration is only , where
is the number of support vectors. Experimental results on some real data
sets indicate that FISVDD demonstrates significant gains in efficiency with
almost no loss in either outlier detection accuracy or objective function
value.Comment: 18 pages, 1 table, 4 figure
Evaluating and Characterizing Incremental Learning from Non-Stationary Data
Incremental learning from non-stationary data poses special challenges to the
field of machine learning. Although new algorithms have been developed for
this, assessment of results and comparison of behaviors are still open
problems, mainly because evaluation metrics, adapted from more traditional
tasks, can be ineffective in this context. Overall, there is a lack of common
testing practices. This paper thus presents a testbed for incremental
non-stationary learning algorithms, based on specially designed synthetic
datasets. Also, test results are reported for some well-known algorithms to
show that the proposed methodology is effective at characterizing their
strengths and weaknesses. It is expected that this methodology will provide a
common basis for evaluating future contributions in the field
Processing Analytical Workloads Incrementally
Analysis of large data collections using popular machine learning and
statistical algorithms has been a topic of increasing research interest. A
typical analysis workload consists of applying an algorithm to build a model on
a data collection and subsequently refining it based on the results.
In this paper we introduce model materialization and incremental model reuse
as first class citizens in the execution of analysis workloads. We materialize
built models instead of discarding them in a way that can be reused in
subsequent computations. At the same time we consider manipulating an existing
model (adding or deleting data from it) in order to build a new one. We discuss
our approach in the context of popular machine learning models. We specify the
details of how to incrementally maintain models as well as outline the suitable
optimizations required to optimally use models and their incremental
adjustments to build new ones. We detail our techniques for linear regression,
naive bayes and logistic regression and present the suitable algorithms and
optimizations to handle these models in our framework.
We present the results of a detailed performance evaluation, using real and
synthetic data sets. Our experiments analyze the various trade offs inherent in
our approach and demonstrate vast performance benefits
Learning Certifiably Optimal Rule Lists for Categorical Data
We present the design and implementation of a custom discrete optimization
technique for building rule lists over a categorical feature space. Our
algorithm produces rule lists with optimal training performance, according to
the regularized empirical risk, with a certificate of optimality. By leveraging
algorithmic bounds, efficient data structures, and computational reuse, we
achieve several orders of magnitude speedup in time and a massive reduction of
memory consumption. We demonstrate that our approach produces optimal rule
lists on practical problems in seconds. Our results indicate that it is
possible to construct optimal sparse rule lists that are approximately as
accurate as the COMPAS proprietary risk prediction tool on data from Broward
County, Florida, but that are completely interpretable. This framework is a
novel alternative to CART and other decision tree methods for interpretable
modeling.Comment: A short version of this work appeared in KDD '17 as "Learning
Certifiably Optimal Rule Lists
State of the Art, Evaluation and Recommendations regarding "Document Processing and Visualization Techniques"
Several Networks of Excellence have been set up in the framework of the
European FP5 research program. Among these Networks of Excellence, the NEMIS
project focuses on the field of Text Mining.
Within this field, document processing and visualization was identified as
one of the key topics and the WG1 working group was created in the NEMIS
project, to carry out a detailed survey of techniques associated with the text
mining process and to identify the relevant research topics in related research
areas.
In this document we present the results of this comprehensive survey. The
report includes a description of the current state-of-the-art and practice, a
roadmap for follow-up research in the identified areas, and recommendations for
anticipated technological development in the domain of text mining.Comment: 54 pages, Report of Working Group 1 for the European Network of
Excellence (NoE) in Text Mining and its Applications in Statistics (NEMIS
Grounding semantics in robots for Visual Question Answering
In this thesis I describe an operational implementation of an object detection and description system that incorporates in an end-to-end Visual Question Answering system and evaluated it on two visual question answering datasets for compositional language and elementary visual reasoning
Online Machine Learning in Big Data Streams
The area of online machine learning in big data streams covers algorithms
that are (1) distributed and (2) work from data streams with only a limited
possibility to store past data. The first requirement mostly concerns software
architectures and efficient algorithms. The second one also imposes nontrivial
theoretical restrictions on the modeling methods: In the data stream model,
older data is no longer available to revise earlier suboptimal modeling
decisions as the fresh data arrives.
In this article, we provide an overview of distributed software architectures
and libraries as well as machine learning models for online learning. We
highlight the most important ideas for classification, regression,
recommendation, and unsupervised modeling from streaming data, and we show how
they are implemented in various distributed data stream processing systems.
This article is a reference material and not a survey. We do not attempt to
be comprehensive in describing all existing methods and solutions; rather, we
give pointers to the most important resources in the field. All related
sub-fields, online algorithms, online learning, and distributed data processing
are hugely dominant in current research and development with conceptually new
research results and software components emerging at the time of writing. In
this article, we refer to several survey results, both for distributed data
processing and for online machine learning. Compared to past surveys, our
article is different because we discuss recommender systems in extended detail
Detection and classification of masses in mammographic images in a multi-kernel approach
According to the World Health Organization, breast cancer is the main cause
of cancer death among adult women in the world. Although breast cancer occurs
indiscriminately in countries with several degrees of social and economic
development, among developing and underdevelopment countries mortality rates
are still high, due to low availability of early detection technologies. From
the clinical point of view, mammography is still the most effective diagnostic
technology, given the wide diffusion of the use and interpretation of these
images. Herein this work we propose a method to detect and classify
mammographic lesions using the regions of interest of images. Our proposal
consists in decomposing each image using multi-resolution wavelets. Zernike
moments are extracted from each wavelet component. Using this approach we can
combine both texture and shape features, which can be applied both to the
detection and classification of mammary lesions. We used 355 images of fatty
breast tissue of IRMA database, with 233 normal instances (no lesion), 72
benign, and 83 malignant cases. Classification was performed by using SVM and
ELM networks with modified kernels, in order to optimize accuracy rates,
reaching 94.11%. Considering both accuracy rates and training times, we defined
the ration between average percentage accuracy and average training time in a
reverse order. Our proposal was 50 times higher than the ratio obtained using
the best method of the state-of-the-art. As our proposed model can combine high
accuracy rate with low learning time, whenever a new data is received, our work
will be able to save a lot of time, hours, in learning process in relation to
the best method of the state-of-the-art
An Online Learning Approach for Dengue Fever Classification
This paper introduces a novel approach for dengue fever classification based
on online learning paradigms. The proposed approach is suitable for practical
implementation as it enables learning using only a few training samples. With
time, the proposed approach is capable of learning incrementally from the data
collected without need for retraining the model or redeployment of the
prediction engine. Additionally, we also provide a comprehensive evaluation of
machine learning methods for prediction of dengue fever. The input to the
proposed pipeline comprises of recorded patient symptoms and diagnostic
investigations. Offline classifier models have been employed to obtain baseline
scores to establish that the feature set is optimal for classification of
dengue. The primary benefit of the online detection model presented in the
paper is that it has been established to effectively identify patients with
high likelihood of dengue disease, and experiments on scalability in terms of
number of training and test samples validate the use of the proposed model
Secure Multi-Party Computation Based Privacy Preserving Extreme Learning Machine Algorithm Over Vertically Distributed Data
Especially in the Big Data era, the usage of different classification methods
is increasing day by day. The success of these classification methods depends
on the effectiveness of learning methods. Extreme learning machine (ELM)
classification algorithm is a relatively new learning method built on
feed-forward neural-network. ELM classification algorithm is a simple and fast
method that can create a model from high-dimensional data sets. Traditional ELM
learning algorithm implicitly assumes complete access to whole data set. This
is a major privacy concern in most of cases. Sharing of private data (i.e.
medical records) is prevented because of security concerns. In this research,
we propose an efficient and secure privacy-preserving learning algorithm for
ELM classification over data that is vertically partitioned among several
parties. The new learning method preserves the privacy on numerical attributes,
builds a classification model without sharing private data without disclosing
the data of each party to others.Comment: 22nd International Conference, ICONIP 201
- …