1,599 research outputs found
Fast logistic regression for text categorization with variable-length n-grams
A common representation used in text categorization is the bag of words model (aka. unigram model). Learning with this particular representation involves typically some preprocessing, e.g. stopwords-removal, stemming. This results in one explicit tokenization of the corpus. In this work, we introduce a logistic regression approach where learning involves automatic tokenization. This allows us to weaken the a-priori required knowledge about the corpus and results in a tokenization with variable-length (word or character) n-grams as basic tokens. We accomplish this by solving logistic regression using gradient ascent in the space of all n-grams. We show that this can be done very efficiently using a branch and bound approach which chooses the maximum gradient ascent direction projected onto a single dimension (i.e., candidate feature). Although the space is very large, our method allows us to investigate variable-length n-gram learning. We demonstrate the efficiency of our approach compared to state-of-the-art classifiers used for text categorization such as cyclic coordinate descent logistic regression and support vector machines
Enhanced ontology-based text classification algorithm for structurally organized documents
Text classification (TC) is an important foundation of information retrieval and text
mining. The main task of a TC is to predict the text‟s class according to the type of tag given in advance. Most TC algorithms used terms in representing the document which does not consider the relations among the terms. These algorithms represent documents in a space where every word is assumed to be a dimension. As a result such representations generate high dimensionality which gives a negative effect on
the classification performance. The objectives of this thesis are to formulate algorithms for classifying text by creating suitable feature vector and reducing the dimension of data which will enhance the classification accuracy. This research combines the ontology and text representation for classification by developing five algorithms. The first and second algorithms namely Concept Feature Vector (CFV)
and Structure Feature Vector (SFV), create feature vector to represent the document.
The third algorithm is the Ontology Based Text Classification (OBTC) and is designed to reduce the dimensionality of training sets. The fourth and fifth algorithms, Concept Feature Vector_Text Classification (CFV_TC) and Structure Feature Vector_Text Classification (SFV_TC) classify the document to its related
set of classes. These proposed algorithms were tested on five different scientific paper datasets downloaded from different digital libraries and repositories. Experimental obtained from the proposed algorithm, CFV_TC and SFV_TC shown better average results in terms of precision, recall, f-measure and accuracy compared against SVM and RSS approaches. The work in this study contributes to exploring the related document in information retrieval and text mining research by using ontology in TC
Sentiment Analysis of Czech Texts: An Algorithmic Survey
In the area of online communication, commerce and transactions, analyzing
sentiment polarity of texts written in various natural languages has become
crucial. While there have been a lot of contributions in resources and studies
for the English language, "smaller" languages like Czech have not received much
attention. In this survey, we explore the effectiveness of many existing
machine learning algorithms for sentiment analysis of Czech Facebook posts and
product reviews. We report the sets of optimal parameter values for each
algorithm and the scores in both datasets. We finally observe that support
vector machines are the best classifier and efforts to increase performance
even more with bagging, boosting or voting ensemble schemes fail to do so.Comment: 7 pages, 2 figures, 7 tables. Published in proceedings of the 11th
International Conference on Agents and Artificial Intelligence - ICAART 2019
and can be found at
http://www.scitepress.org/PublicationsDetail.aspx?ID=1InVq6xKdwE=&t=1 The
paper content is identical to the previous one, only updated publication
metadat
A sentiment analysis model to evaluate people’s opinion about artificial intelligence
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsWith the use of internet, people are much more able to express and share what they think about a certain topic, their ideas and so on. Facebook and Twitter social networks, YouTube, online review sites like Zomato, online news sites or personal blogs are platforms that are usually used for this purpose. Every business wants to know what people think about their products; many people and politicians want to know the prediction for political elections; sometimes it can be useful to understand how opinions are distributed in some controversial themes. Thus, the analysis of textual data is also a need to stay competitive.
In this work, through Sentiment Analysis techniques, different opinions from different online sources regarding to artificial intelligence are analyzed - a controversial field that have been a target of some debate in recent years.
First, it is done a careful revision of the concept of Sentiment Analysis and all the involved techniques and processes such as data preprocessing, feature extraction and selection, sentiment classification approaches and machine learning algorithms – Naïve Bayes, Neural Networks, Random Forest, Support Vector Machine, Logistic Regression, Stochastic Gradient Descent. Based on previous works, the main conclusions, regarding to which techniques work better in which situations, are highlighted. Then, it is described the followed methodology in the application of Sentiment Analysis to artificial intelligence as a controversial field. The auxiliary tool used for this work is Python. In the end, results are presented and discussed
Identifying Hidden Visits from Sparse Call Detail Record Data
Despite a large body of literature on trip inference using call detail record
(CDR) data, a fundamental understanding of their limitations is lacking. In
particular, because of the sparse nature of CDR data, users may travel to a
location without being revealed in the data, which we refer to as a "hidden
visit". The existence of hidden visits hinders our ability to extract reliable
information about human mobility and travel behavior from CDR data. In this
study, we propose a data fusion approach to obtain labeled data for statistical
inference of hidden visits. In the absence of complementary data, this can be
accomplished by extracting labeled observations from more granular cellular
data access records, and extracting features from voice call and text messaging
records. The proposed approach is demonstrated using a real-world CDR dataset
of 3 million users from a large Chinese city. Logistic regression, support
vector machine, random forest, and gradient boosting are used to infer whether
a hidden visit exists during a displacement observed from CDR data. The test
results show significant improvement over the naive no-hidden-visit rule, which
is an implicit assumption adopted by most existing studies. Based on the
proposed model, we estimate that over 10% of the displacements extracted from
CDR data involve hidden visits. The proposed data fusion method offers a
systematic statistical approach to inferring individual mobility patterns based
on telecommunication records
Priors for Random Count Matrices Derived from a Family of Negative Binomial Processes
We define a family of probability distributions for random count matrices
with a potentially unbounded number of rows and columns. The three
distributions we consider are derived from the gamma-Poisson, gamma-negative
binomial, and beta-negative binomial processes. Because the models lead to
closed-form Gibbs sampling update equations, they are natural candidates for
nonparametric Bayesian priors over count matrices. A key aspect of our analysis
is the recognition that, although the random count matrices within the family
are defined by a row-wise construction, their columns can be shown to be i.i.d.
This fact is used to derive explicit formulas for drawing all the columns at
once. Moreover, by analyzing these matrices' combinatorial structure, we
describe how to sequentially construct a column-i.i.d. random count matrix one
row at a time, and derive the predictive distribution of a new row count vector
with previously unseen features. We describe the similarities and differences
between the three priors, and argue that the greater flexibility of the gamma-
and beta- negative binomial processes, especially their ability to model
over-dispersed, heavy-tailed count data, makes these well suited to a wide
variety of real-world applications. As an example of our framework, we
construct a naive-Bayes text classifier to categorize a count vector to one of
several existing random count matrices of different categories. The classifier
supports an unbounded number of features, and unlike most existing methods, it
does not require a predefined finite vocabulary to be shared by all the
categories, and needs neither feature selection nor parameter tuning. Both the
gamma- and beta- negative binomial processes are shown to significantly
outperform the gamma-Poisson process for document categorization, with
comparable performance to other state-of-the-art supervised text classification
algorithms.Comment: To appear in Journal of the American Statistical Association (Theory
and Methods). 31 pages + 11 page supplement, 5 figure
- …