30 research outputs found
Non-Standard Words as Features for Text Categorization
This paper presents categorization of Croatian texts using Non-Standard Words
(NSW) as features. Non-Standard Words are: numbers, dates, acronyms,
abbreviations, currency, etc. NSWs in Croatian language are determined
according to Croatian NSW taxonomy. For the purpose of this research, 390 text
documents were collected and formed the SKIPEZ collection with 6 classes:
official, literary, informative, popular, educational and scientific. Text
categorization experiment was conducted on three different representations of
the SKIPEZ collection: in the first representation, the frequencies of NSWs are
used as features; in the second representation, the statistic measures of NSWs
(variance, coefficient of variation, standard deviation, etc.) are used as
features; while the third representation combines the first two feature sets.
Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms
were used in text categorization experiments. The best categorization results
are achieved using the first feature set (NSW frequencies) with the
categorization accuracy of 87%. This suggests that the NSWs should be
considered as features in highly inflectional languages, such as Croatian. NSW
based features reduce the dimensionality of the feature space without standard
lemmatization procedures, and therefore the bag-of-NSWs should be considered
for further Croatian texts categorization experiments.Comment: IEEE 37th International Convention on Information and Communication
Technology, Electronics and Microelectronics (MIPRO 2014), pp. 1415-1419,
201
k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples)
Perhaps the most straightforward classifier in the arsenal or machine
learning techniques is the Nearest Neighbour Classifier -- classification is
achieved by identifying the nearest neighbours to a query example and using
those neighbours to determine the class of the query. This approach to
classification is of particular importance because issues of poor run-time
performance is not such a problem these days with the computational power that
is available. This paper presents an overview of techniques for Nearest
Neighbour classification focusing on; mechanisms for assessing similarity
(distance), computational issues in identifying nearest neighbours and
mechanisms for reducing the dimension of the data.
This paper is the second edition of a paper previously published as a
technical report. Sections on similarity measures for time-series, retrieval
speed-up and intrinsic dimensionality have been added. An Appendix is included
providing access to Python code for the key methods.Comment: 22 pages, 15 figures: An updated edition of an older tutorial on kN
A multistrategy approach for digital text
The goal of the research described here is to develop a multistrategy classifier system that can be used for document categorization. The system automatically discovers classification patterns by applying several empirical learning methods to different representations for preclassified documents.
The learners work in a parallel manner, where each learner carries out its own feature selection based on evolutionary techniques and then obtains a classification model. In classifying documents, the system combines the predictions of the learners by applying evolutionary techniques as well. The system relies
on a modular, flexible architecture that makes no assumptions about the design of learners or the number of learners available and guarantees the
independence of the thematic domain
Proceedings of the Workshop Semantic Content Acquisition and Representation (SCAR) 2007
This is the proceedings of the Workshop on Semantic Content Acquisition and Representation, held in conjunction with NODALIDA 2007, on May 24 2007 in Tartu, Estonia.</p
Recommended from our members
In an absolute state: elevated use of absolutist words is a marker specific to anxiety, depression and suicidal ideation
Absolutist thinking is considered a cognitive distortion by most cognitive therapies for anxiety and depression. Yet, there is little empirical evidence of its prevalence or specificity. Across three studies, we conducted a text analysis of 63 internet forums (over 6,400 members) using the Linguistic- Inquiry and Word Count software (Pennebaker, Booth, Boyd, & Francis, 2015) to examine absolutism at the linguistic level. We predicted and found that anxiety, depression and suicidal ideation forums contained more absolutist words than control forums (d’s > 3.14). Suicidal ideation forums also contained more absolutist words than anxiety and depression forums (d’s > 1.71). We show that these differences are more reflective of absolutist thinking than psychological distress. Interestingly, absolutist words tracked the severity of affective disorder forums more faithfully than ‘negative emotion’ words. Finally, we found elevated levels of absolutist words in depression ‘recovery’ forums. This suggests that absolutist thinking may be a vulnerability factor
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
k-Nearest Neighbour Classifiers - A Tutorial
Perhaps the most straightforward classifier in the arsenal or Machine Learning techniques is the Nearest Neighbour Classifier – classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of the query. This approach to classification is of particular importance because issues of poor run-time performance is not such a problem these days with the computational power that is available. This paper presents an overview of techniques for Nearest Neighbour classification focusing on; mechanisms for assessing similarity (distance), computational issues in identifying nearest neighbours and mechanisms for reducing the dimension of the data.This paper is the second edition of a paper previously published as a technical report . Sections on similarity measures for time-series, retrieval speed-up and intrinsic dimensionality have been added. An Appendix is included providing access to Python code for the key methods
Sparse Representation of High Dimensional Data for Classification
In this thesis we propose the use of sparse Principal Component Analysis (PCA) for representing high dimensional data for classification. Sparse transformation reduces the data volume/dimensionality without loss of critical information, so that it can be processed efficiently and assimilated by a human. We obtained sparse representation of high dimensional dataset using Sparse Principal Component Analysis (SPCA) and Direct formulation of Sparse Principal Component Analysis (DSPCA). Later we performed classification using k Nearest Neighbor (kNN) Method and compared its result with regular PCA. The experiments were performed on hyperspectral data and various datasets obtained from University of California, Irvine (UCI) machine learning dataset repository. The results suggest that sparse data representation is desirable because sparse representation enhances interpretation. It also improves classification performance with certain number of features and in most of the cases classification performance is similar to regular PCA