6,039 research outputs found
Computing with Granular Words
Computational linguistics is a sub-field of artificial intelligence; it is an interdisciplinary field dealing with statistical and/or rule-based modeling of natural language from a computational perspective. Traditionally, fuzzy logic is used to deal with fuzziness among single linguistic terms in documents. However, linguistic terms may be related to other types of uncertainty. For instance, different users search ‘cheap hotel’ in a search engine, they may need distinct pieces of relevant hidden information such as shopping, transportation, weather, etc. Therefore, this research work focuses on studying granular words and developing new algorithms to process them to deal with uncertainty globally. To precisely describe the granular words, a new structure called Granular Information Hyper Tree (GIHT) is constructed. Furthermore, several technologies are developed to cooperate with computing with granular words in spam filtering and query recommendation. Based on simulation results, the GIHT-Bayesian algorithm can get more accurate spam filtering rate than conventional method Naive Bayesian and SVM; computing with granular word also generates better recommendation results based on users’ assessment when applied it to search engine
Applications of Mining Arabic Text: A Review
Since the appearance of text mining, the Arabic language gained some interest in applying several text mining tasks over a text written in the Arabic language. There are several challenges faced by the researchers. These tasks include Arabic text summarization, which is one of the challenging open areas for research in natural language processing (NLP) and text mining fields, Arabic text categorization, and Arabic sentiment analysis. This chapter reviews some of the past and current researches and trends in these areas and some future challenges that need to be tackled. It also presents some case studies for two of the reviewed approaches
Ariadne's Thread - Interactive Navigation in a World of Networked Information
This work-in-progress paper introduces an interface for the interactive
visual exploration of the context of queries using the ArticleFirst database, a
product of OCLC. We describe a workflow which allows the user to browse live
entities associated with 65 million articles. In the on-line interface, each
query leads to a specific network representation of the most prevailing
entities: topics (words), authors, journals and Dewey decimal classes linked to
the set of terms in the query. This network represents the context of a query.
Each of the network nodes is clickable: by clicking through, a user traverses a
large space of articles along dimensions of authors, journals, Dewey classes
and words simultaneously. We present different use cases of such an interface.
This paper provides a link between the quest for maps of science and on-going
debates in HCI about the use of interactive information visualisation to
empower users in their search.Comment: CHI'15 Extended Abstracts, April 18-23, 2015, Seoul, Republic of
Korea. ACM 978-1-4503-3146-3/15/0
MCMC Estimation of Extended Hodrick-Prescott (HP) Filtering Models
The Hodrick-Prescott (HP) method was originally developed to smooth time series, i.e. to get a smooth (long-term) component. We show that the HP smoother can be viewed as a Bayesian linear model with a strong prior for the smoothness component. Extending this Bayesian approach in a linear model set-up is possible by a conjugate and a non-conjugate model using MCMC. The Bayesian HP smoothing model is also extended to a spatial smoothing model. We have to define spatial neighbors for each observation and we can use in a similar way a smoothness prior as for the HP filter in time series. The new smoothing approaches are applied to the (textbook) airline passenger data for time series and to the problem of smoothing spatial regional data. This new approach can be used for a new class of model-based smoothers for time series and spatial models.Hodrick-Prescott (HP) smoothers, Spatial econometrics, MCMC estimation, Airline passenger time series, Spatial smoothing of regional data, NUTS: nomenclature of territorial units for statistics
An email classification model based on rough set theory
The communication via email is one of the most popular services of the Internet. Emails have brought us great convenience in our daily work and life. However, unsolicited messages or spam, flood our email boxes, which results in bandwidth, time and money wasting. To this end, this paper presents a rough set based model to classify emails into three categories - spam, no-spam and suspicious, rather than two classes (spam and non-spam) in most currently used approaches. By comparing with popular classification methods like Naive Bayes classification, the error ratio that a non-spam is discriminated to spam can be reduced using our proposed model.<br /
Making AI Meaningful Again
Artificial intelligence (AI) research enjoyed an initial period of enthusiasm in the 1970s and 80s. But this enthusiasm was tempered by a long interlude of frustration when genuinely useful AI applications failed to be forthcoming. Today, we are experiencing once again a period of enthusiasm, fired above all by the successes of the technology of deep neural networks or deep machine learning. In this paper we draw attention to what we take to be serious problems underlying current views of artificial intelligence encouraged by these successes, especially in the domain of language processing. We then show an alternative approach to language-centric AI, in which we identify a role for philosophy
Inference and Evaluation of the Multinomial Mixture Model for Text Clustering
In this article, we investigate the use of a probabilistic model for
unsupervised clustering in text collections. Unsupervised clustering has become
a basic module for many intelligent text processing applications, such as
information retrieval, text classification or information extraction. The model
considered in this contribution consists of a mixture of multinomial
distributions over the word counts, each component corresponding to a different
theme. We present and contrast various estimation procedures, which apply both
in supervised and unsupervised contexts. In supervised learning, this work
suggests a criterion for evaluating the posterior odds of new documents which
is more statistically sound than the "naive Bayes" approach. In an unsupervised
context, we propose measures to set up a systematic evaluation framework and
start with examining the Expectation-Maximization (EM) algorithm as the basic
tool for inference. We discuss the importance of initialization and the
influence of other features such as the smoothing strategy or the size of the
vocabulary, thereby illustrating the difficulties incurred by the high
dimensionality of the parameter space. We also propose a heuristic algorithm
based on iterative EM with vocabulary reduction to solve this problem. Using
the fact that the latent variables can be analytically integrated out, we
finally show that Gibbs sampling algorithm is tractable and compares favorably
to the basic expectation maximization approach
- …