178 research outputs found
Wikipedia-based hybrid document representation for textual news classification
The sheer amount of news items that are published every day makes worth the task of automating their classification. The common approach consists in representing news items by the frequency of the words they contain and using supervised learning algorithms to train a classifier. This bag-of-words (BoW) approach is oblivious to three aspects of natural language: synonymy, polysemy, and multiword terms. More sophisticated representations based on conceptsâor units of meaningâhave been proposed, following the intuition that document representations that better capture the semantics of text will lead to higher performance in automatic classification tasks. The reality is that, when classifying news items, the BoW representation has proven to be really strong, with several studies reporting it to perform above different âflavoursâ of bag of concepts (BoC). In this paper, we propose a hybrid classifier that enriches the traditional BoW representation with concepts extracted from textâleveraging Wikipedia as background knowledge for the semantic analysis of text (WikiBoC). We benchmarked the proposed classifier, comparing it with BoW and several BoC approaches: Latent Dirichlet Allocation (LDA), Explicit Semantic Analysis, and word embeddings (doc2vec). We used two corpora: the well-known Reuters-21578, composed of newswire items, and a new corpus created ex professo for this study: the Reuters-27000. Results show that (1) the performance of concept-based classifiers is very sensitive to the corpus used, being higher in the more âconcept-friendlyâ Reuters-27000; (2) the Hybrid-WikiBoC approach proposed offers performance increases over BoW up to 4.12 and 49.35% when classifying Reuters-21578 and Reuters-27000 corpora, respectively; and (3) for average performance, the proposed Hybrid-WikiBoC outperforms all the other classifiers, achieving a performance increase of 15.56% over the best state-of-the-art approach (LDA) for the largest training sequence. Results indicate that concepts extracted with the help of Wikipedia add useful information that improves classification performance for news items.Atlantic Research Center for Information and Communication TechnologiesXunta de Galicia | Ref. R2014/034 (RedPlir)Xunta de Galicia | Ref. R2014/029 (TELGalicia
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
Support Vector Machines (SVM) in Test Extraction
Text categorization is the process of grouping documents or words into predefined
categories. Each category consists of documents or words having similar attributes.
There exist numerous algorithms to address the need of text categorization including
Naive Bayes, k-nearest-neighbor classifier, and decision trees. In this project, Support
Vector Machines (SVM) is studied and experimented by the implementation ofa textual
extractor. This algorithm is used to extract important points from a lengthy document,
by which it classifies each word in the document under its relevant category and
constructs the structure of the summary with reference to the categorized words. The
performance of the extractor is evaluated using a similar corpus against an existing
summarizer, which uses a different kind of approach. Summarization is part of text
categorization whereby it is considered an essential part of today's information-led
society, and it has been a growing area of research for over 40 years. This project's
objective is to create a summarizer, or extractor, based on machine learning algorithms,
which are namely SVM and K-Means. Each word in the particular document is
processed by both algorithms to determine its actual occurrence in the document by
which it will first be clustered or grouped into categories based on parts of speech (verb,
noun, adjective) which is done by K-Means, then later processed by SVM to determine
the actual occurrence of each word in each of the cluster, taking into account whether
the words have similar meanings with otherwords in the subsequent cluster. The corpus
chosen to evaluate the application is the Reuters-21578 dataset comprising of
newspaper articles. Evaluation of the applications are carried out against another
accompanying system-generated extract which is already in the market, as a means to
observe the amount of sentences overlap with the tested applications, in this case, the
Text Extractor and also Microsoft Word AutoSummarizer. Results show that the Text
Extractor has optimal results at compression rates of 10 - 20% and 35 - 45
Integrating Structure and Meaning: Using Holographic Reduced Representations to Improve Automatic Text Classification
Current representation schemes for automatic text classification treat documents as syntactically unstructured collections of words (Bag-of-Words) or `concepts' (Bag-of-Concepts). Past attempts to encode syntactic structure have treated part-of-speech information as another word-like feature, but have been shown to be less effective than non-structural approaches. We propose a new representation scheme using Holographic Reduced Representations (HRRs) as a technique to encode both semantic and syntactic structure, though in very different ways. This method is unique in the literature in that it encodes the structure across all features of the document vector while preserving text semantics. Our method does not increase the dimensionality of the document vectors, allowing for efficient computation and storage. We present the results of various Support Vector Machine classification experiments that demonstrate the superiority of this method over Bag-of-Concepts representations and improvement over Bag-of-Words in certain classification contexts
A new classification technique based on hybrid fuzzy soft set theory and supervised fuzzy c-means
Recent advances in information technology have led to significant changes in todayâs
world. The generating and collecting data have been increasing rapidly. Popular use
of the World Wide Web (www) as a global information system led to a tremendous
amount of information, and this can be in the form of text document. This explosive
growth has generated an urgent need for new techniques and automated tools that can
assist us in transforming the data into more useful information and knowledge. Data
mining was born for these requirements. One of the essential processes contained in
the data mining is classification, which can be used to classify such text documents
and utilize it in many daily useful applications. There are many classification
methods, such as Bayesian, K-Nearest Neighbor, Rocchio, SVM classifier, and Soft
Set Theory used to classify text document. Although those methods are quite
successful, but accuracy and efficiency are still outstanding for text classification
problem. This study is to propose a new approach on classification problem based on
hybrid fuzzy soft set theory and supervised fuzzy c-means. It is called Hybrid Fuzzy
Classifier (HFC). The HFC used the fuzzy soft set as data representation and then
using the supervised fuzzy c-mean as classifier. To evaluate the performance of
HFC, two well-known datasets are used i.e., 20 Newsgroups and Reuters-21578, and
compared it with the performance of classic fuzzy soft set classifiers and classic text
classifiers. The results show that the HFC outperforms up to 50.42% better as
compared to classic fuzzy soft set classifier and up to 0.50% better as compare
classic text classifier
- âŠ