7 research outputs found

    Detection of barriers to mobility in the smart city using Twitter

    Get PDF
    We present a system that analyzes data extracted from the microbloging site Twitter to detect the occurrence of events and obstacles that can affect pedestrian mobility, with a special focus on people with impaired mobility. First, the system extracts tweets that match certain prede ned terms. Then, it obtains location information from them by using the location provided by Twitter when available, as well as searching the text of the tweet for locations. Finally, it applies natural language processing techniques to con rm that an actual event that affects mobility is reported and extract its properties (which urban element is affected and how). We also present some empirical results that validate the feasibility of our approach.This work was supported in part by the Analytics Using Sensor Data for FLATCity Project (Ministerio de Ciencia, innovaciĂłn y Universidades/ERDF, EU) funded by the Spanish Agencia Estatal de InvestigaciĂłn (AEI), under Grant TIN2016-77158-C4-1-R, and in part by the European Regional Development Fund (ERDF)

    Wikipedia-based hybrid document representation for textual news classification

    Get PDF
    The sheer amount of news items that are published every day makes worth the task of automating their classification. The common approach consists in representing news items by the frequency of the words they contain and using supervised learning algorithms to train a classifier. This bag-of-words (BoW) approach is oblivious to three aspects of natural language: synonymy, polysemy, and multiword terms. More sophisticated representations based on concepts—or units of meaning—have been proposed, following the intuition that document representations that better capture the semantics of text will lead to higher performance in automatic classification tasks. The reality is that, when classifying news items, the BoW representation has proven to be really strong, with several studies reporting it to perform above different ‘flavours’ of bag of concepts (BoC). In this paper, we propose a hybrid classifier that enriches the traditional BoW representation with concepts extracted from text—leveraging Wikipedia as background knowledge for the semantic analysis of text (WikiBoC). We benchmarked the proposed classifier, comparing it with BoW and several BoC approaches: Latent Dirichlet Allocation (LDA), Explicit Semantic Analysis, and word embeddings (doc2vec). We used two corpora: the well-known Reuters-21578, composed of newswire items, and a new corpus created ex professo for this study: the Reuters-27000. Results show that (1) the performance of concept-based classifiers is very sensitive to the corpus used, being higher in the more “concept-friendly” Reuters-27000; (2) the Hybrid-WikiBoC approach proposed offers performance increases over BoW up to 4.12 and 49.35% when classifying Reuters-21578 and Reuters-27000 corpora, respectively; and (3) for average performance, the proposed Hybrid-WikiBoC outperforms all the other classifiers, achieving a performance increase of 15.56% over the best state-of-the-art approach (LDA) for the largest training sequence. Results indicate that concepts extracted with the help of Wikipedia add useful information that improves classification performance for news items.Atlantic Research Center for Information and Communication TechnologiesXunta de Galicia | Ref. R2014/034 (RedPlir)Xunta de Galicia | Ref. R2014/029 (TELGalicia

    Wikipedia-based hybrid document representation for textual news classification

    Get PDF
    Automatic classification of news articles is a relevant problem due to the large amount of news generated every day, so it is crucial that these news are classified to allow for users to access to information of interest quickly and effectively. On the one hand, traditional classification systems represent documents as bag-of-words (BoW), which are oblivious to two problems of language: synonymy and polysemy. On the other hand, several authors propose the use of a bag-of-concepts (BoC) representation of documents, which tackles synonymy and polysemy. This paper shows the benefits of using a hybrid representation of documents to the classification of textual news, leveraging the advantages of both approaches-the traditional BoW representation and a BoC approach based on Wikipedia knowledge. To evaluate the proposal, we used three of the most relevant algorithms in the state-of-the art-SVM, Random Forest and Naïve Bayes-and two corpora: the Reuters-21578 corpus and a purpose-built corpus, Reuters-27000. Results obtained show that the performance of the classification algorithm depends on the dataset used, and also demonstrate that the enrichment of the BoW representation with the concepts extracted from documents through the semantic annotator adds useful information to the classifier and improves their performance. Experiments conducted show performance increases up to 4.12% when classifying the Reuters-21578 corpus with the SVM algorithm and up to 49.35% when classifying the corpus Reuters-27000 with the Random Forest algorithm.Atlantic Research Center for Information and Communication TechnologiesXunta de Galicia | Ref. R2014/034 (RedPlir)Xunta de Galicia | Ref. R2014/029 (TELGalicia

    Cross-repository aggregation of educational resources

    Get PDF
    The proliferation of educational resource repositories promoted the development of aggregators to facilitate interoperability, that is, a unified access that would allow users to fetch a given resource independently of its origin. The CROERA system is a repository aggregator that provides access to educational resources independently of the classification taxonomy utilized in the hosting repository. For that, an automated classification algorithm is trained using the information extracted from the metadata of a collection of educational resources hosted in different repositories, which in turn depends on the classification taxonomy used in each case. Then, every resource will be automatically classified on demand independently of the original classification scheme. As a consequence, resources can be retrieved independently of the original taxonomy utilized using any taxonomy supported by the aggregator, and exploratory searches can be made without a previous taxonomy mapping. This approach overcomes one of the recurring problems in taxonomy mapping, namely the one-to-none matching situation. To evaluate the performance of this proposal two methods were applied. Resource classification in categories existing in all repositories was automatically evaluated, obtaining maximum performance values of 84% (F1 score), 87.8% (area under the receiver operator characteristic curve), 86% (area under the precision-recall curve) and 75.1% (Cohen's Îș). In the case of resources not belonging to one of the common categories, human inspection was used as a reference to compute classification performance. In this case, maximum performance values obtained were respectively 69.8%, 73.8%, 75% and 54.3%. These results demonstrate the potential of this approach as a tool to facilitate resource classification, for example to provide a preliminary classification that would require just minor corrections from human classifiers.Xunta de Galicia | Ref. R2014/034 (RedPlir)Xunta de Galicia | Ref. R2014/029 (TELGalicia

    Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach

    Get PDF
    Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria—that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text—thus suffering from synonymy and polysemy—and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge—concretely Wikipedia—in order to create bag-of-concepts (BoC) representations of documents, understanding concept as “unit of meaning”, and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus.Xunta de Galicia | Ref. GRC2013-006Red Gallega de Procesamiento del Lenguaje y Recuperacion de Informacion | Ref. R2014/03
    corecore