5 research outputs found

    Identifying and Extracting Named Entities from Wikipedia Database Using Entity Infoboxes

    Get PDF
    An approach for named entity classification based on Wikipedia article infoboxes is described in this paper. It identifies the three fundamental named entity types, namely; Person, Location and Organization. An entity classification is accomplished by matching entity attributes extracted from the relevant entity article infobox against core entity attributes built from Wikipedia Infobox Templates. Experimental results showed that the classifier can achieve a high accuracy and F-measure scores of 97%. Based on this approach, a database of around 1.6 million 3-typed named entities is created from 20140203 Wikipedia dump. Experiments on CoNLL2003 shared task named entity recognition (NER) dataset disclosed the system’s outstanding performance in comparison to three different state-of-the-art systems

    Effectively Grouping Named Entities From Click- Through Data Into Clusters Of Generated Keywords1

    Get PDF
    Many studies show that named entities are closely related to users\u27 search behaviors, which brings increasing interest in studying named entities in search logs recently. This paper addresses the problem of forming fine grained semantic clusters of named entities within a broad domain such as “company”, and generating keywords for each cluster, which help users to interpret the embedded semantic information in the cluster. By exploring contexts, URLs and session IDs as features of named entities, a three-phase approach proposed in this paper first disambiguates named entities according to the features. Then it properly weights the features with a novel measurement, calculates the semantic similarity between named entities with the weighted feature space, and clusters named entities accordingly. After that, keywords for the clusters are generated using a text-oriented graph ranking algorithm. Each phase of the proposed approach solves problems that are not addressed in existing works, and experimental results obtained from a real click through data demonstrate the effectiveness of the proposed approach

    Métrica de dissimilaridade semântica baseada na wikipédia

    Get PDF
    Não obstante a vasta quantidade de informações disponibilizadas nem sempre é fácil obter o conhecimento que se almeja alcançar, devido à dificuldade de catalogar a informação. Os sistemas de “descoberta de conhecimento” atuais centram-se na procura de palavras idênticas, podendo aqui observar-se variadas limitações, entre elas a falta de capacidade de interpretação. A compreensão do significado semântico do conjunto de expressões é uma característica do ser humano, sendo difícil de replicar em sistemas computacionais. O objetivo principal deste trabalho consiste na criação de um sistema de cálculo de semelhança semântica entre classes abstratas, sistema esse que deve possuir por base uma ontologia de conhecimento. Para atingirmos o objetivo proposto começou-se por identificar e analisar a necessidade de uma máquina conseguir simular ou melhorar a apreciação do ser humano relativamente à interpretação semântica. Apôs a definição e enquadramento do problema na área de conhecimento respetiva partiu-se para a criação do sistema capacitado de calcular uma medida de semelhança entre entidades, tendo em consideração a importância que o desempenho apresenta neste tipo de sistema

    Automatic text summarisation using linguistic knowledge-based semantics

    Get PDF
    Text summarisation is reducing a text document to a short substitute summary. Since the commencement of the field, almost all summarisation research works implemented to this date involve identification and extraction of the most important document/cluster segments, called extraction. This typically involves scoring each document sentence according to a composite scoring function consisting of surface level and semantic features. Enabling machines to analyse text features and understand their meaning potentially requires both text semantic analysis and equipping computers with an external semantic knowledge. This thesis addresses extractive text summarisation by proposing a number of semantic and knowledge-based approaches. The work combines the high-quality semantic information in WordNet, the crowdsourced encyclopaedic knowledge in Wikipedia, and the manually crafted categorial variation in CatVar, to improve the summary quality. Such improvements are accomplished through sentence level morphological analysis and the incorporation of Wikipedia-based named-entity semantic relatedness while using heuristic algorithms. The study also investigates how sentence-level semantic analysis based on semantic role labelling (SRL), leveraged with a background world knowledge, influences sentence textual similarity and text summarisation. The proposed sentence similarity and summarisation methods were evaluated on standard publicly available datasets such as the Microsoft Research Paraphrase Corpus (MSRPC), TREC-9 Question Variants, and the Document Understanding Conference 2002, 2005, 2006 (DUC 2002, DUC 2005, DUC 2006) Corpora. The project also uses Recall-Oriented Understudy for Gisting Evaluation (ROUGE) for the quantitative assessment of the proposed summarisers’ performances. Results of our systems showed their effectiveness as compared to related state-of-the-art summarisation methods and baselines. Of the proposed summarisers, the SRL Wikipedia-based system demonstrated the best performance

    特定領域研究「日本語コーパス」平成21年度公開ワークショップ(研究成果報告会)予稿集

    Get PDF
    特定領域研究「日本語コーパス」平成21年度公開ワークショップ,国立国語研究所,2010年3月15-16日,特定領域研究「日本語コーパス」総括
    corecore