4 research outputs found

    Clustering and Bootstrapping Based Framework for News Knowledge Base Completion

    Get PDF
    Extracting the facts, namely entities and relations, from unstructured sources is an essential step in any knowledge base construction. At the same time, it is also necessary to ensure the completeness of the knowledge base by incrementally extracting the new facts from various sources. To date, the knowledge base completion is studied as a problem of knowledge refinement where the missing facts are inferred by reasoning about the information already present in the knowledge base. However, facts missed while extracting the information from multilingual sources are ignored. Hence, this work proposed a generic framework for knowledge base completion to enrich a knowledge base of crime-related facts extracted from online news articles in the English language, with the facts extracted from low resourced Indian language Hindi news articles. Using the framework, information from any low-resourced language news articles can be extracted without using language-specific tools like POS tags and using an appropriate machine translation tool. To achieve this, a clustering algorithm is proposed, which explores the redundancy among the bilingual collection of news articles by representing the clusters with knowledge base facts unlike the existing Bag of Words representation. From each cluster, the facts extracted from English language articles are bootstrapped to extract the facts from comparable Hindi language articles. This way of bootstrapping within the cluster helps to identify the sentences from a low-resourced language that are enriched with new information related to the facts extracted from a high-resourced language like English. The empirical result shows that the proposed clustering algorithm produced more accurate and high-quality clusters for monolingual and cross-lingual facts, respectively. Experiments also proved that the proposed framework achieves a high recall rate in extracting the new facts from Hindi news articles

    State of the art document clustering algorithms based on semantic similarity

    Get PDF
    The constant success of the Internet made the number of text documents in electronic forms increases hugely. The techniques to group these documents into meaningful clusters are becoming critical missions. The traditional clustering method was based on statistical features, and the clustering was done using a syntactic notion rather than semantically. However, these techniques resulted in un-similar data gathered in the same group due to polysemy and synonymy problems. The important solution to this issue is to document clustering based on semantic similarity, in which the documents are grouped according to the meaning and not keywords. In this research, eighty papers that use semantic similarity in different fields have been reviewed; forty of them that are using semantic similarity based on document clustering in seven recent years have been selected for a deep study, published between the years 2014 to 2020. A comprehensive literature review for all the selected papers is stated. Detailed research and comparison regarding their clustering algorithms, utilized tools, and methods of evaluation are given. This helps in the implementation and evaluation of the clustering of documents. The exposed research is used in the same direction when preparing the proposed research. Finally, an intensive discussion comparing the works is presented, and the result of our research is shown in figures

    Ontology Based Document Clustering Using MapReduce

    No full text
    corecore