2,850 research outputs found

    Semantically Aware Text Categorisation for Metadata Annotation

    Get PDF
    In this paper we illustrate a system aimed at solving a longstanding and challenging problem: acquiring a classifier to automatically annotate bibliographic records by starting from a huge set of unbalanced and unlabelled data. We illustrate the main features of the dataset, the learning algorithm adopted, and how it was used to discriminate philosophical documents from documents of other disciplines. One strength of our approach lies in the novel combination of a standard learning approach with a semantic one: the results of the acquired classifier are improved by accessing a semantic network containing conceptual information. We illustrate the experimentation by describing the construction rationale of training and test set, we report and discuss the obtained results and conclude by drawing future work.</p

    Familiar Categories and Documentary Forms: Readers’ Perspectives

    Get PDF
    This paper presents an evaluation of the ways in which three different groups of readers (recordkeepers, teachers and secondary school students) categorise documents. This is used to show how they understand documents, documentary forms and genre. Drawing on a card sorting activity conducted around a set of cards of documents related to The Hobbit by J.R.R. Tolkien, the paper discusses the significance of familiar categories as cultural markers (closely linked to particular rhetorical genres). It considers the impact of domain knowledge on the process of sorting and naming of categories, and compares the approaches taken by participants with those of library catalogues. It finds that there is no single, consistent approach to categorising the cards, with different literary genres, rhetorical genres, reasons for using, format, accessibility, and form all affecting the final categories each participant developed

    Identifying and Extracting Named Entities from Wikipedia Database Using Entity Infoboxes

    Get PDF
    An approach for named entity classification based on Wikipedia article infoboxes is described in this paper. It identifies the three fundamental named entity types, namely; Person, Location and Organization. An entity classification is accomplished by matching entity attributes extracted from the relevant entity article infobox against core entity attributes built from Wikipedia Infobox Templates. Experimental results showed that the classifier can achieve a high accuracy and F-measure scores of 97%. Based on this approach, a database of around 1.6 million 3-typed named entities is created from 20140203 Wikipedia dump. Experiments on CoNLL2003 shared task named entity recognition (NER) dataset disclosed the system’s outstanding performance in comparison to three different state-of-the-art systems

    Navigating multilingual news collections using automatically extracted information

    Get PDF
    We are presenting a text analysis tool set that allows analysts in various fields to sieve through large collections of multilingual news items quickly and to find information that is of relevance to them. For a given document collection, the tool set automatically clusters the texts into groups of similar articles, extracts names of places, people and organisations, lists the user-defined specialist terms found, links clusters and entities, and generates hyperlinks. Through its daily news analysis operating on thousands of articles per day, the tool also learns relationships between people and other entities. The fully functional prototype system allows users to explore and navigate multilingual document collections across languages and time.Comment: This paper describes the main functionality of the JRC's fully-automatic news analysis system NewsExplorer, which is freely accessible in currently thirteen languages at http://press.jrc.it/NewsExplorer/ . 8 page

    Automatic Creation of Arabic Named Entity Annotated Corpus Using Wikipedia

    Get PDF
    In this paper we propose a new methodology to exploit Wikipedia features and structure to automatically develop an Arabic NE annotated corpus. Each Wikipedia link is transformed into an NE type of the target article in order to produce the NE annotation. Other Wikipedia features - namely redirects, anchor texts, and inter-language links - are used to tag additional NEs, which appear without links in Wikipedia texts. Furthermore, we have developed a filtering algorithm to eliminate ambiguity when tagging candidate NEs. Herein we also introduce a mechanism based on the high coverage of Wikipedia in order to address two challenges particular to tagging NEs in Arabic text: rich morphology and the absence of capitalisation. The corpus created with our new method (WDC) has been used to train an NE tagger which has been tested on different domains. Judging by the results, an NE tagger trained on WDC can compete with those trained on manually annotated corpora

    Mining named entities from search engine query logs

    Get PDF
    We present a seed expansion based approach to classify named entities in web search queries. Previous approaches to this classification problem relied on contextual clues in the form of keywords surrounding a named entity in the query. Here we propose an alternative approach in the form of a Bag-of-Context-Words (BoCW) that is used to represent the context words as they appear in the snippets of the top search results for the query. This is particularly useful in the case where the query consists of only the named entity without any context words, since in the previous approaches no context is discovered. In order to construct the BoCW, we employ a novel algorithm, which iteratively expands a Class Vector that is created through expansion by gradually aggregating the BoCWs of similar named entities appearing in other queries. We provide comprehensive experimental evidence using a commercial query log showing that our approach is competitive with existing approaches

    Foreword

    Get PDF
    The aim of this Workshop is to focus on building and evaluating resources used to facilitate biomedical text mining, including their design, update, delivery, quality assessment, evaluation and dissemination. Key resources of interest are lexical and knowledge repositories (controlled vocabularies, terminologies, thesauri, ontologies) and annotated corpora, including both task-specific resources and repositories reengineered from biomedical or general language resources. Of particular interest is the process of building annotated resources, including designing guidelines and annotation schemas (aiming at both syntactic and semantic interoperability) and relying on language engineering standards. Challenging aspects are updates and evolution management of resources, as well as their documentation, dissemination and evaluation

    Combining Minimally-supervised Methods for Arabic Named Entity Recognition.

    Get PDF
    Supervised methods can achieve high performance on NLP tasks, such as Named Entity Recognition (NER), but new annotations are required for every new domain and/or genre change. This has motivated research in minimally supervised methods such as semi-supervised learning and distant learning, but neither technique has yet achieved performance levels comparable to those of supervised methods. Semi-supervised methods tend to have very high precision but comparatively low recall, whereas distant learning tends to achieve higher recall but lower precision. This complementarity suggests that better results may be obtained by combining the two types of minimally supervised methods. In this paper we present a novel approach to Arabic NER using a combination of semi-supervised and distant learning techniques. We trained a semi-supervised NER classifier and another one using distant learning techniques, and then combined them using a variety of classifier combination schemes, including the Bayesian Classifier Combination (BCC) procedure recently proposed for sentiment analysis. According to our results, the BCC model leads to an increase in performance of 8 percentage points over the best base classifiers
    • …
    corecore