Search CORE

50,312 research outputs found

Computing Multi-Relational Sufficient Statistics for Large Databases

Author: Qian Zhensong
Schulte Oliver
Sun Yan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/08/2014
Field of study

Databases contain information about which relationships do and do not hold among entities. To make this information accessible for statistical analysis requires computing sufficient statistics that combine information from different database tables. Such statistics may involve any number of {\em positive and negative} relationships. With a naive enumeration approach, computing sufficient statistics for negative relationships is feasible only for small databases. We solve this problem with a new dynamic programming algorithm that performs a virtual join, where the requisite counts are computed without materializing join tables. Contingency table algebra is a new extension of relational algebra, that facilitates the efficient implementation of this M\"obius virtual join operation. The M\"obius Join scales to large datasets (over 1M tuples) with complex schemas. Empirical evaluation with seven benchmark datasets showed that information about the presence and absence of links can be exploited in feature selection, association rule mining, and Bayesian network learning.Comment: 11pages, 8 figures, 8 tables, CIKM'14,November 3--7, 2014, Shanghai, Chin

arXiv.org e-Print Archive

CiteSeerX

Text to Emotion Extraction Using Supervised Machine Learning Techniques

Author: Azim Muhammad Anwarul
Bhuiyan Mahmudul Hasan
Publication venue: 'Universitas Ahmad Dahlan'
Publication date: 01/06/2018
Field of study

Proliferation of internet and social media has greatly increased the popularity of text communication. People convey their sentiment and emotion through text which promotes lively communication. Consequently, a tremendous amount of emotional text is generated on different social media and blogs in every moment. This has raised the necessity of automated tool for emotion mining from text. There are various rule based approaches of emotion extraction form text based on emotion intensity lexicon. However, creating emotion intensity lexicon is a time consuming and tedious process. Moreover, there is no hard and fast rule for assigning emotion intensity to words. To solve these difficulties, we propose a machine learning based approach of emotion extraction from text which relies on annotated example rather emotion intensity lexicon. We investigated Multinomial Naïve Bayesian (MNB) Classifier, Artificial Neural Network (ANN) and Support Vector Machine (SVM) for mining emotion from text. In our setup, SVM outperformed other classifiers with promising accuracy

TELKOMNIKA (Telecommunication Computing Electronics and Control)

Interpretable multiclass classification by MDL-based rule lists

Author: Proença Hugo M.
van Leeuwen Matthijs
Publication venue: 'Elsevier BV'
Publication date: 31/10/2019
Field of study

Interpretable classifiers have recently witnessed an increase in attention from the data mining community because they are inherently easier to understand and explain than their more complex counterparts. Examples of interpretable classification models include decision trees, rule sets, and rule lists. Learning such models often involves optimizing hyperparameters, which typically requires substantial amounts of data and may result in relatively large models. In this paper, we consider the problem of learning compact yet accurate probabilistic rule lists for multiclass classification. Specifically, we propose a novel formalization based on probabilistic rule lists and the minimum description length (MDL) principle. This results in virtually parameter-free model selection that naturally allows to trade-off model complexity with goodness of fit, by which overfitting and the need for hyperparameter tuning are effectively avoided. Finally, we introduce the Classy algorithm, which greedily finds rule lists according to the proposed criterion. We empirically demonstrate that Classy selects small probabilistic rule lists that outperform state-of-the-art classifiers when it comes to the combination of predictive performance and interpretability. We show that Classy is insensitive to its only parameter, i.e., the candidate set, and that compression on the training set correlates with classification performance, validating our MDL-based selection criterion

arXiv.org e-Print Archive

Leiden University Scholary Publications