14,166 research outputs found
Inference and Evaluation of the Multinomial Mixture Model for Text Clustering
In this article, we investigate the use of a probabilistic model for
unsupervised clustering in text collections. Unsupervised clustering has become
a basic module for many intelligent text processing applications, such as
information retrieval, text classification or information extraction. The model
considered in this contribution consists of a mixture of multinomial
distributions over the word counts, each component corresponding to a different
theme. We present and contrast various estimation procedures, which apply both
in supervised and unsupervised contexts. In supervised learning, this work
suggests a criterion for evaluating the posterior odds of new documents which
is more statistically sound than the "naive Bayes" approach. In an unsupervised
context, we propose measures to set up a systematic evaluation framework and
start with examining the Expectation-Maximization (EM) algorithm as the basic
tool for inference. We discuss the importance of initialization and the
influence of other features such as the smoothing strategy or the size of the
vocabulary, thereby illustrating the difficulties incurred by the high
dimensionality of the parameter space. We also propose a heuristic algorithm
based on iterative EM with vocabulary reduction to solve this problem. Using
the fact that the latent variables can be analytically integrated out, we
finally show that Gibbs sampling algorithm is tractable and compares favorably
to the basic expectation maximization approach
Naive bayes multi-label classification approach for high-voltage condition monitoring
This paper addresses for the first time the multilabel classification of High-Voltage (HV) discharges captured using the Electromagnetic Interference (EMI) method for HV machines. The approach involves feature extraction from EMI time signals, emitted during the discharge events, by means of 1D-Local Binary Pattern (LBP) and 1D-Histogram of Oriented Gradients (HOG) techniques. Their combination provides a feature vector that is implemented in a naive Bayes classifier designed to identify the labels of two or more discharge sources contained within a single signal. The performance of this novel approach is measured using various metrics including average precision, accuracy, specificity, hamming loss etc. Results demonstrate a successful performance that is in line with similar application to other fields such as biology and image processing. This first attempt of multi-label classification of EMI discharge sources opens a new research topic in HV condition monitoring
Tupleware: Redefining Modern Analytics
There is a fundamental discrepancy between the targeted and actual users of
current analytics frameworks. Most systems are designed for the data and
infrastructure of the Googles and Facebooks of the world---petabytes of data
distributed across large cloud deployments consisting of thousands of cheap
commodity machines. Yet, the vast majority of users operate clusters ranging
from a few to a few dozen nodes, analyze relatively small datasets of up to a
few terabytes, and perform primarily compute-intensive operations. Targeting
these users fundamentally changes the way we should build analytics systems.
This paper describes the design of Tupleware, a new system specifically aimed
at the challenges faced by the typical user. Tupleware's architecture brings
together ideas from the database, compiler, and programming languages
communities to create a powerful end-to-end solution for data analysis. We
propose novel techniques that consider the data, computations, and hardware
together to achieve maximum performance on a case-by-case basis. Our
experimental evaluation quantifies the impact of our novel techniques and shows
orders of magnitude performance improvement over alternative systems
Linked Data approach for selection process automation in Systematic Reviews
Background: a systematic review identifies, evaluates and synthesizes the available literature on a given topic using scientific and repeatable methodologies. The significant workload required and the subjectivity bias could affect results. Aim: semi-automate the selection process to reduce the amount of manual work needed and the consequent subjectivity bias. Method: extend and enrich the selection of primary studies using the existing technologies in the field of Linked Data and text mining. We define formally the selection process and we also develop a prototype that implements it. Finally, we conduct a case study that simulates the selection process of a systematic literature published in literature. Results: the process presented in this paper could reduce the work load of 20% with respect to the work load needed in the fully manually selection, with a recall of 100%. Conclusions: the extraction of knowledge from scientific studies through Linked Data and text mining techniques could be used in the selection phase of the systematic review process to reduce the work load and subjectivity bia
Pengoptimalan Naive Bayes Dan Regresi Logistik Menggunakan Algoritma Genetika Untuk Data Klasifikasi
Klasifikasi pada data dalam jumlah banyak dan dengan fitur atau atribut yang beragam sering membuat hasil akurasi menjadi rendah. Untuk itu diperlukan metode yang dapat menangani pada data dengan jenis beragam tersebut. Metode yang dapat menangani masalah tersebut adalah metode Naïve Bayes dan Regresi Logistik. Metode Naïve Bayes merupakan salah satu metode data mining yang dapat mengatasi masalah data klasifikasi. Sedangkan regresi logistik merupakan salah satu metode klasifikasi, jika variabel respon tersebut bersifat biner dan terdapat banyak variabel prediktor berupa gabungan katagori dan kontinu. Metode Naive Bayes dan Regresi Logistik ini membutuhkan tahapan seleksi variabel Independen dalam meningkatkan keakurasian model dari Naive Bayes dan Regresi Logistik. Sehingga dibutuhkan metode yang bagus dalam memperbaiki kekurangan tersebut yaitu Genetic Algorithm (GA). Metode ini merupakan metode iteratif untuk mendapatkan global optimum. Hasil ketepatan klasifikasi dari Regresi Logistik Biner dan Naive Bayes pada kasus data septictank di wilayah Surabaya Timur dengan 11 variabel independen dan variabel dependennya berbentuk biner menghasilkan Naive Bayes lebih tinggi dibandingkan dengan Regresi Logistik dengan akurasi Naive Bayes sebesar 72.73 %, sedangkan Regresi Logistik Biner dengan akurasi sebesar 54.55 %. Namun ketika diseleksi dengan GA, hasil akurasi dari Naive Bayes dan Regresi Logistik Biner memiliki ketepatan klasifikasi yang sama yaitu 90.91 %.
============================================================================================================
Classification on large of data, and with a variety of features or attributes often makes the law accuracy. It required a method that has immunity in such diverse data types. The method can deal with the problem are Naïve Bayes method and Logistic Regression method. Naïve Bayes method is one of data mining that can be overcome the problem of data mining. While Logistic Regression is one of classification method, if response variable has binary characteristic and there are many predictor variable such as combination of category and continue. Method of Naive Bayes and Logistic Regression requires a stage selection independent variable in improving model accuration of Naive Bayes and Logistic Regression. So it takes a good method in fixing the deficiency is Genetic Algorithm (GA). This method is an iterative method to get global optimum. The results of the classification accuracy of Naive Bayes and Logistic Regression in the case of septictank data in East Surabaya with 11 independent variables and binary dependent variable is Naive Bayes higher than Logistic Regression with Naive Bayes accuracy of 72.73%, and Logiatic Regression accuracy of 54.55%. However when selected with GA, the accuracy of Naive Bayes and Binary Logistic Regression has the same classification accuracy of 90.91%
Learning Fair Naive Bayes Classifiers by Discovering and Eliminating Discrimination Patterns
As machine learning is increasingly used to make real-world decisions, recent
research efforts aim to define and ensure fairness in algorithmic decision
making. Existing methods often assume a fixed set of observable features to
define individuals, but lack a discussion of certain features not being
observed at test time. In this paper, we study fairness of naive Bayes
classifiers, which allow partial observations. In particular, we introduce the
notion of a discrimination pattern, which refers to an individual receiving
different classifications depending on whether some sensitive attributes were
observed. Then a model is considered fair if it has no such pattern. We propose
an algorithm to discover and mine for discrimination patterns in a naive Bayes
classifier, and show how to learn maximum likelihood parameters subject to
these fairness constraints. Our approach iteratively discovers and eliminates
discrimination patterns until a fair model is learned. An empirical evaluation
on three real-world datasets demonstrates that we can remove exponentially many
discrimination patterns by only adding a small fraction of them as constraints
- …