28,534 research outputs found

    Fast Incremental SVDD Learning Algorithm with the Gaussian Kernel

    Full text link
    Support vector data description (SVDD) is a machine learning technique that is used for single-class classification and outlier detection. The idea of SVDD is to find a set of support vectors that defines a boundary around data. When dealing with online or large data, existing batch SVDD methods have to be rerun in each iteration. We propose an incremental learning algorithm for SVDD that uses the Gaussian kernel. This algorithm builds on the observation that all support vectors on the boundary have the same distance to the center of sphere in a higher-dimensional feature space as mapped by the Gaussian kernel function. Each iteration involves only the existing support vectors and the new data point. Moreover, the algorithm is based solely on matrix manipulations; the support vectors and their corresponding Lagrange multiplier αi\alpha_i's are automatically selected and determined in each iteration. It can be seen that the complexity of our algorithm in each iteration is only O(k2)O(k^2), where kk is the number of support vectors. Experimental results on some real data sets indicate that FISVDD demonstrates significant gains in efficiency with almost no loss in either outlier detection accuracy or objective function value.Comment: 18 pages, 1 table, 4 figure

    Evaluating and Characterizing Incremental Learning from Non-Stationary Data

    Full text link
    Incremental learning from non-stationary data poses special challenges to the field of machine learning. Although new algorithms have been developed for this, assessment of results and comparison of behaviors are still open problems, mainly because evaluation metrics, adapted from more traditional tasks, can be ineffective in this context. Overall, there is a lack of common testing practices. This paper thus presents a testbed for incremental non-stationary learning algorithms, based on specially designed synthetic datasets. Also, test results are reported for some well-known algorithms to show that the proposed methodology is effective at characterizing their strengths and weaknesses. It is expected that this methodology will provide a common basis for evaluating future contributions in the field

    Processing Analytical Workloads Incrementally

    Full text link
    Analysis of large data collections using popular machine learning and statistical algorithms has been a topic of increasing research interest. A typical analysis workload consists of applying an algorithm to build a model on a data collection and subsequently refining it based on the results. In this paper we introduce model materialization and incremental model reuse as first class citizens in the execution of analysis workloads. We materialize built models instead of discarding them in a way that can be reused in subsequent computations. At the same time we consider manipulating an existing model (adding or deleting data from it) in order to build a new one. We discuss our approach in the context of popular machine learning models. We specify the details of how to incrementally maintain models as well as outline the suitable optimizations required to optimally use models and their incremental adjustments to build new ones. We detail our techniques for linear regression, naive bayes and logistic regression and present the suitable algorithms and optimizations to handle these models in our framework. We present the results of a detailed performance evaluation, using real and synthetic data sets. Our experiments analyze the various trade offs inherent in our approach and demonstrate vast performance benefits

    Learning Certifiably Optimal Rule Lists for Categorical Data

    Full text link
    We present the design and implementation of a custom discrete optimization technique for building rule lists over a categorical feature space. Our algorithm produces rule lists with optimal training performance, according to the regularized empirical risk, with a certificate of optimality. By leveraging algorithmic bounds, efficient data structures, and computational reuse, we achieve several orders of magnitude speedup in time and a massive reduction of memory consumption. We demonstrate that our approach produces optimal rule lists on practical problems in seconds. Our results indicate that it is possible to construct optimal sparse rule lists that are approximately as accurate as the COMPAS proprietary risk prediction tool on data from Broward County, Florida, but that are completely interpretable. This framework is a novel alternative to CART and other decision tree methods for interpretable modeling.Comment: A short version of this work appeared in KDD '17 as "Learning Certifiably Optimal Rule Lists

    State of the Art, Evaluation and Recommendations regarding "Document Processing and Visualization Techniques"

    Full text link
    Several Networks of Excellence have been set up in the framework of the European FP5 research program. Among these Networks of Excellence, the NEMIS project focuses on the field of Text Mining. Within this field, document processing and visualization was identified as one of the key topics and the WG1 working group was created in the NEMIS project, to carry out a detailed survey of techniques associated with the text mining process and to identify the relevant research topics in related research areas. In this document we present the results of this comprehensive survey. The report includes a description of the current state-of-the-art and practice, a roadmap for follow-up research in the identified areas, and recommendations for anticipated technological development in the domain of text mining.Comment: 54 pages, Report of Working Group 1 for the European Network of Excellence (NoE) in Text Mining and its Applications in Statistics (NEMIS

    Grounding semantics in robots for Visual Question Answering

    Get PDF
    In this thesis I describe an operational implementation of an object detection and description system that incorporates in an end-to-end Visual Question Answering system and evaluated it on two visual question answering datasets for compositional language and elementary visual reasoning

    Online Machine Learning in Big Data Streams

    Full text link
    The area of online machine learning in big data streams covers algorithms that are (1) distributed and (2) work from data streams with only a limited possibility to store past data. The first requirement mostly concerns software architectures and efficient algorithms. The second one also imposes nontrivial theoretical restrictions on the modeling methods: In the data stream model, older data is no longer available to revise earlier suboptimal modeling decisions as the fresh data arrives. In this article, we provide an overview of distributed software architectures and libraries as well as machine learning models for online learning. We highlight the most important ideas for classification, regression, recommendation, and unsupervised modeling from streaming data, and we show how they are implemented in various distributed data stream processing systems. This article is a reference material and not a survey. We do not attempt to be comprehensive in describing all existing methods and solutions; rather, we give pointers to the most important resources in the field. All related sub-fields, online algorithms, online learning, and distributed data processing are hugely dominant in current research and development with conceptually new research results and software components emerging at the time of writing. In this article, we refer to several survey results, both for distributed data processing and for online machine learning. Compared to past surveys, our article is different because we discuss recommender systems in extended detail

    Detection and classification of masses in mammographic images in a multi-kernel approach

    Full text link
    According to the World Health Organization, breast cancer is the main cause of cancer death among adult women in the world. Although breast cancer occurs indiscriminately in countries with several degrees of social and economic development, among developing and underdevelopment countries mortality rates are still high, due to low availability of early detection technologies. From the clinical point of view, mammography is still the most effective diagnostic technology, given the wide diffusion of the use and interpretation of these images. Herein this work we propose a method to detect and classify mammographic lesions using the regions of interest of images. Our proposal consists in decomposing each image using multi-resolution wavelets. Zernike moments are extracted from each wavelet component. Using this approach we can combine both texture and shape features, which can be applied both to the detection and classification of mammary lesions. We used 355 images of fatty breast tissue of IRMA database, with 233 normal instances (no lesion), 72 benign, and 83 malignant cases. Classification was performed by using SVM and ELM networks with modified kernels, in order to optimize accuracy rates, reaching 94.11%. Considering both accuracy rates and training times, we defined the ration between average percentage accuracy and average training time in a reverse order. Our proposal was 50 times higher than the ratio obtained using the best method of the state-of-the-art. As our proposed model can combine high accuracy rate with low learning time, whenever a new data is received, our work will be able to save a lot of time, hours, in learning process in relation to the best method of the state-of-the-art

    An Online Learning Approach for Dengue Fever Classification

    Full text link
    This paper introduces a novel approach for dengue fever classification based on online learning paradigms. The proposed approach is suitable for practical implementation as it enables learning using only a few training samples. With time, the proposed approach is capable of learning incrementally from the data collected without need for retraining the model or redeployment of the prediction engine. Additionally, we also provide a comprehensive evaluation of machine learning methods for prediction of dengue fever. The input to the proposed pipeline comprises of recorded patient symptoms and diagnostic investigations. Offline classifier models have been employed to obtain baseline scores to establish that the feature set is optimal for classification of dengue. The primary benefit of the online detection model presented in the paper is that it has been established to effectively identify patients with high likelihood of dengue disease, and experiments on scalability in terms of number of training and test samples validate the use of the proposed model

    Secure Multi-Party Computation Based Privacy Preserving Extreme Learning Machine Algorithm Over Vertically Distributed Data

    Full text link
    Especially in the Big Data era, the usage of different classification methods is increasing day by day. The success of these classification methods depends on the effectiveness of learning methods. Extreme learning machine (ELM) classification algorithm is a relatively new learning method built on feed-forward neural-network. ELM classification algorithm is a simple and fast method that can create a model from high-dimensional data sets. Traditional ELM learning algorithm implicitly assumes complete access to whole data set. This is a major privacy concern in most of cases. Sharing of private data (i.e. medical records) is prevented because of security concerns. In this research, we propose an efficient and secure privacy-preserving learning algorithm for ELM classification over data that is vertically partitioned among several parties. The new learning method preserves the privacy on numerical attributes, builds a classification model without sharing private data without disclosing the data of each party to others.Comment: 22nd International Conference, ICONIP 201
    • …
    corecore