14,166 research outputs found

    Inference and Evaluation of the Multinomial Mixture Model for Text Clustering

    Full text link
    In this article, we investigate the use of a probabilistic model for unsupervised clustering in text collections. Unsupervised clustering has become a basic module for many intelligent text processing applications, such as information retrieval, text classification or information extraction. The model considered in this contribution consists of a mixture of multinomial distributions over the word counts, each component corresponding to a different theme. We present and contrast various estimation procedures, which apply both in supervised and unsupervised contexts. In supervised learning, this work suggests a criterion for evaluating the posterior odds of new documents which is more statistically sound than the "naive Bayes" approach. In an unsupervised context, we propose measures to set up a systematic evaluation framework and start with examining the Expectation-Maximization (EM) algorithm as the basic tool for inference. We discuss the importance of initialization and the influence of other features such as the smoothing strategy or the size of the vocabulary, thereby illustrating the difficulties incurred by the high dimensionality of the parameter space. We also propose a heuristic algorithm based on iterative EM with vocabulary reduction to solve this problem. Using the fact that the latent variables can be analytically integrated out, we finally show that Gibbs sampling algorithm is tractable and compares favorably to the basic expectation maximization approach

    Naive bayes multi-label classification approach for high-voltage condition monitoring

    Get PDF
    This paper addresses for the first time the multilabel classification of High-Voltage (HV) discharges captured using the Electromagnetic Interference (EMI) method for HV machines. The approach involves feature extraction from EMI time signals, emitted during the discharge events, by means of 1D-Local Binary Pattern (LBP) and 1D-Histogram of Oriented Gradients (HOG) techniques. Their combination provides a feature vector that is implemented in a naive Bayes classifier designed to identify the labels of two or more discharge sources contained within a single signal. The performance of this novel approach is measured using various metrics including average precision, accuracy, specificity, hamming loss etc. Results demonstrate a successful performance that is in line with similar application to other fields such as biology and image processing. This first attempt of multi-label classification of EMI discharge sources opens a new research topic in HV condition monitoring

    Tupleware: Redefining Modern Analytics

    Full text link
    There is a fundamental discrepancy between the targeted and actual users of current analytics frameworks. Most systems are designed for the data and infrastructure of the Googles and Facebooks of the world---petabytes of data distributed across large cloud deployments consisting of thousands of cheap commodity machines. Yet, the vast majority of users operate clusters ranging from a few to a few dozen nodes, analyze relatively small datasets of up to a few terabytes, and perform primarily compute-intensive operations. Targeting these users fundamentally changes the way we should build analytics systems. This paper describes the design of Tupleware, a new system specifically aimed at the challenges faced by the typical user. Tupleware's architecture brings together ideas from the database, compiler, and programming languages communities to create a powerful end-to-end solution for data analysis. We propose novel techniques that consider the data, computations, and hardware together to achieve maximum performance on a case-by-case basis. Our experimental evaluation quantifies the impact of our novel techniques and shows orders of magnitude performance improvement over alternative systems

    Linked Data approach for selection process automation in Systematic Reviews

    Get PDF
    Background: a systematic review identifies, evaluates and synthesizes the available literature on a given topic using scientific and repeatable methodologies. The significant workload required and the subjectivity bias could affect results. Aim: semi-automate the selection process to reduce the amount of manual work needed and the consequent subjectivity bias. Method: extend and enrich the selection of primary studies using the existing technologies in the field of Linked Data and text mining. We define formally the selection process and we also develop a prototype that implements it. Finally, we conduct a case study that simulates the selection process of a systematic literature published in literature. Results: the process presented in this paper could reduce the work load of 20% with respect to the work load needed in the fully manually selection, with a recall of 100%. Conclusions: the extraction of knowledge from scientific studies through Linked Data and text mining techniques could be used in the selection phase of the systematic review process to reduce the work load and subjectivity bia

    Pengoptimalan Naive Bayes Dan Regresi Logistik Menggunakan Algoritma Genetika Untuk Data Klasifikasi

    Get PDF
    Klasifikasi pada data dalam jumlah banyak dan dengan fitur atau atribut yang beragam sering membuat hasil akurasi menjadi rendah. Untuk itu diperlukan metode yang dapat menangani pada data dengan jenis beragam tersebut. Metode yang dapat menangani masalah tersebut adalah metode Naïve Bayes dan Regresi Logistik. Metode Naïve Bayes merupakan salah satu metode data mining yang dapat mengatasi masalah data klasifikasi. Sedangkan regresi logistik merupakan salah satu metode klasifikasi, jika variabel respon tersebut bersifat biner dan terdapat banyak variabel prediktor berupa gabungan katagori dan kontinu. Metode Naive Bayes dan Regresi Logistik ini membutuhkan tahapan seleksi variabel Independen dalam meningkatkan keakurasian model dari Naive Bayes dan Regresi Logistik. Sehingga dibutuhkan metode yang bagus dalam memperbaiki kekurangan tersebut yaitu Genetic Algorithm (GA). Metode ini merupakan metode iteratif untuk mendapatkan global optimum. Hasil ketepatan klasifikasi dari Regresi Logistik Biner dan Naive Bayes pada kasus data septictank di wilayah Surabaya Timur dengan 11 variabel independen dan variabel dependennya berbentuk biner menghasilkan Naive Bayes lebih tinggi dibandingkan dengan Regresi Logistik dengan akurasi Naive Bayes sebesar 72.73 %, sedangkan Regresi Logistik Biner dengan akurasi sebesar 54.55 %. Namun ketika diseleksi dengan GA, hasil akurasi dari Naive Bayes dan Regresi Logistik Biner memiliki ketepatan klasifikasi yang sama yaitu 90.91 %. ============================================================================================================ Classification on large of data, and with a variety of features or attributes often makes the law accuracy. It required a method that has immunity in such diverse data types. The method can deal with the problem are Naïve Bayes method and Logistic Regression method. Naïve Bayes method is one of data mining that can be overcome the problem of data mining. While Logistic Regression is one of classification method, if response variable has binary characteristic and there are many predictor variable such as combination of category and continue. Method of Naive Bayes and Logistic Regression requires a stage selection independent variable in improving model accuration of Naive Bayes and Logistic Regression. So it takes a good method in fixing the deficiency is Genetic Algorithm (GA). This method is an iterative method to get global optimum. The results of the classification accuracy of Naive Bayes and Logistic Regression in the case of septictank data in East Surabaya with 11 independent variables and binary dependent variable is Naive Bayes higher than Logistic Regression with Naive Bayes accuracy of 72.73%, and Logiatic Regression accuracy of 54.55%. However when selected with GA, the accuracy of Naive Bayes and Binary Logistic Regression has the same classification accuracy of 90.91%

    Learning Fair Naive Bayes Classifiers by Discovering and Eliminating Discrimination Patterns

    Full text link
    As machine learning is increasingly used to make real-world decisions, recent research efforts aim to define and ensure fairness in algorithmic decision making. Existing methods often assume a fixed set of observable features to define individuals, but lack a discussion of certain features not being observed at test time. In this paper, we study fairness of naive Bayes classifiers, which allow partial observations. In particular, we introduce the notion of a discrimination pattern, which refers to an individual receiving different classifications depending on whether some sensitive attributes were observed. Then a model is considered fair if it has no such pattern. We propose an algorithm to discover and mine for discrimination patterns in a naive Bayes classifier, and show how to learn maximum likelihood parameters subject to these fairness constraints. Our approach iteratively discovers and eliminates discrimination patterns until a fair model is learned. An empirical evaluation on three real-world datasets demonstrates that we can remove exponentially many discrimination patterns by only adding a small fraction of them as constraints
    corecore