26 research outputs found

    Classifying the Wikipedia articles into the OpenCyc taxonomy

    Get PDF
    This article presents a method of classification of the Wikipedia articles into the taxonomy of OpenCyc. This method utilises several sources of the classification information, namely the Wikipedia category system, the infoboxes attached to the articles, the first sentences of the articles, treated as their definitions and the direct mapping between the articles and the Cyc symbols. The classification decision made using these methods are accommodated using the Cyc built-in inconsistency detection mechanism. The combination of the best classification methods yields 1,47 millions of classified articles and has a manually verified precision above 97%, while the combination of all of them yields 2.2 millions of articles with estimated precision of 93%

    Automatic mapping of Wikipedia categories into OpenCyc types

    Get PDF
    The aim of the research presented in the article is the mapping between the English Wikipedia categories and OpenCyc types. The mapping algorithm is heuristic and it takes into account structural similarities between the categories and the corresponding types. The achieved mapping precision ranges from 82 to 92 % (depending on the evaluation scheme), recall from 67 to 76%. The results of the algorithm and its code are available at http://cycloped.i

    The importance of cross-lingual information for matching Wikipedia with the Cyc ontology

    Get PDF
    In this paper we try to answer the question how cross-lingual evidence may improve matching between different classification schemas. We concentrate specifcally on the task of mapping between Wikipedia categories and Cycterms as well as the classication of Wikipedia articles to the Cyctaxonomy and show how this process may be improved by consuming the evidence that is available in different editions of Wikipedia. The results show that the performance of the mapping procedure may be improved from 0.6 to 4.9 percentage points, depending on the number of external Wikipedia editions and the given task

    Meta鈥怳ser2Vec model for addressing the user and item cold鈥恠tart problem in recommender systems

    Get PDF
    The cold-start scenario is a critical problem for recommendation systems, especially in dynamically changing domains such as online news services. In this research, we aim at addressing the cold-start situation by adapting an unsupervised neural User2Vec method to represent new users and articles in a multidimensional space. Toward this goal, we propose an extension of the Doc2Vec model that is capable of representing users with unknown history by building embeddings of their metadata labels along with item representations. We evaluate our proposed approach with respect to different parameter configurations on three real-world recommendation datasets with different characteristics. Our results show that this approach may be applied as an efficient alternative to the factorization machine-based method when the user and item metadata are used and hence can be applied in the cold-start scenario for both new users and new items. Additionally, as our solution represents the user and item labels in the same vector space, we can analyze the spatial relations among these labels to reveal latent interest features of the audience groups as well as possible data biases and disparities

    Improving the Wikipedia Miner word sense disambiguation algorithm

    No full text
    This document describes the improvements of the Wikipedia Miner word sense disambiguation algorithm. The original algorithm performs very well in detecting key terms in documents and disambiguating them against Wikipedia articles. By replacing the original Normalized Google Distance inspired measure with Jaccard coefficient inspired measure and taking into account additional features, the disambiguation algorithm was improved by 8 percentage points (F1-measure), without impeding its performance nor introducing any additional preprocessing overhead. This document also presents some statistical data that are extracted from the Polish Wikipedia by Wikipedia Miner. An automatic evaluation of the performance of the disambiguation algorithm for Polish shows that it is almost as good as for English, even though the Polish Wikipedia has only a quarter of the number of the articles of the English Wikipedia

    Knowledge-based named entity recognition in polish

    No full text
    This document describes an algorithm aimed at recognizing Named Entities in Polish text, which is powered by two knowledge sources: the Polish Wikipedia and the Cyc ontology. Besides providing the rough types for the recognized entities, the algorithm links them to the Wikipedia pages and assigns precise semantic types taken from Cyc. The algorithm is verified against manually identified Named Entities in the one- million sub-corpus of the National Corpus of Polis

    ROD : Ruby Object Database

    No full text
    ROD (Ruby Object Database) jest otwart膮, obiektow膮 baz膮 danych zaprojektowan膮 do przechowywania i odczytywania danych, kt贸re rzadko ulegaj膮 zmianie. Podstawowym powodem jej utworzenia by艂a ch臋膰 stworzenia bazy dla s艂ownik贸w oraz korpus贸w wykorzystywanych w przetwarzaniu j臋zyka naturalnego. Baza ta jest zoptymalizowana pod k膮tem szybko艣ci odczytu danych oraz艂atwo艣ci jej u偶ycia.Ruby Object Database is an open-source object database designed for storing and accessing data which rarely changes. The primary reason for designing it was to create a storage facility for natural language dictionaries and corpora. it is optimazed for reading speed and easiness of useage

    Extraction of "part-whole" relations from Polish texts based on Wikipedia and Cyc

    No full text

    An ontology-based method for an efficient acquisition of relation extraction training and testing examples

    No full text
    In this paper, we describe an ontology-based method of selection of test examples for relation extraction, as well as a method of their validation apt to be carried out by ordinary language-speakers. The results will be used to validate performance of various relation extraction algorithms. In performed tests we utilize the ResearchCyc ontology and demonstrate the method's performance in gathering examples from Polish texts
    corecore