15,195 research outputs found
Thematic Annotation: extracting concepts out of documents
Contrarily to standard approaches to topic annotation, the technique used in
this work does not centrally rely on some sort of -- possibly statistical --
keyword extraction. In fact, the proposed annotation algorithm uses a large
scale semantic database -- the EDR Electronic Dictionary -- that provides a
concept hierarchy based on hyponym and hypernym relations. This concept
hierarchy is used to generate a synthetic representation of the document by
aggregating the words present in topically homogeneous document segments into a
set of concepts best preserving the document's content.
This new extraction technique uses an unexplored approach to topic selection.
Instead of using semantic similarity measures based on a semantic resource, the
later is processed to extract the part of the conceptual hierarchy relevant to
the document content. Then this conceptual hierarchy is searched to extract the
most relevant set of concepts to represent the topics discussed in the
document. Notice that this algorithm is able to extract generic concepts that
are not directly present in the document.Comment: Technical report EPFL/LIA. 81 pages, 16 figure
From Review to Rating: Exploring Dependency Measures for Text Classification
Various text analysis techniques exist, which attempt to uncover unstructured
information from text. In this work, we explore using statistical dependence
measures for textual classification, representing text as word vectors. Student
satisfaction scores on a 3-point scale and their free text comments written
about university subjects are used as the dataset. We have compared two textual
representations: a frequency word representation and term frequency
relationship to word vectors, and found that word vectors provide a greater
accuracy. However, these word vectors have a large number of features which
aggravates the burden of computational complexity. Thus, we explored using a
non-linear dependency measure for feature selection by maximizing the
dependence between the text reviews and corresponding scores. Our quantitative
and qualitative analysis on a student satisfaction dataset shows that our
approach achieves comparable accuracy to the full feature vector, while being
an order of magnitude faster in testing. These text analysis and feature
reduction techniques can be used for other textual data applications such as
sentiment analysis.Comment: 8 page
Wikipedia-based hybrid document representation for textual news classification
Automatic classification of news articles is a relevant problem due to the large amount of news generated every day, so it is crucial that these news are classified to allow for users to access to information of interest quickly and effectively. On the one hand, traditional classification systems represent documents as bag-of-words (BoW), which are oblivious to two problems of language: synonymy and polysemy. On the other hand, several authors propose the use of a bag-of-concepts (BoC) representation of documents, which tackles synonymy and polysemy. This paper shows the benefits of using a hybrid representation of documents to the classification of textual news, leveraging the advantages of both approaches-the traditional BoW representation and a BoC approach based on Wikipedia knowledge. To evaluate the proposal, we used three of the most relevant algorithms in the state-of-the art-SVM, Random Forest and NaiÌve Bayes-and two corpora: the Reuters-21578 corpus and a purpose-built corpus, Reuters-27000. Results obtained show that the performance of the classification algorithm depends on the dataset used, and also demonstrate that the enrichment of the BoW representation with the concepts extracted from documents through the semantic annotator adds useful information to the classifier and improves their performance. Experiments conducted show performance increases up to 4.12% when classifying the Reuters-21578 corpus with the SVM algorithm and up to 49.35% when classifying the corpus Reuters-27000 with the Random Forest algorithm.Atlantic Research Center for Information and Communication TechnologiesXunta de Galicia | Ref. R2014/034 (RedPlir)Xunta de Galicia | Ref. R2014/029 (TELGalicia
- âŠ