72,975 research outputs found

    Big Data and Cross-Document Coreference Resolution: Current State and Future Opportunities

    Full text link
    Information Extraction (IE) is the task of automatically extracting structured information from unstructured/semi-structured machine-readable documents. Among various IE tasks, extracting actionable intelligence from ever-increasing amount of data depends critically upon Cross-Document Coreference Resolution (CDCR) - the task of identifying entity mentions across multiple documents that refer to the same underlying entity. Recently, document datasets of the order of peta-/tera-bytes has raised many challenges for performing effective CDCR such as scaling to large numbers of mentions and limited representational power. The problem of analysing such datasets is called "big data". The aim of this paper is to provide readers with an understanding of the central concepts, subtasks, and the current state-of-the-art in CDCR process. We provide assessment of existing tools/techniques for CDCR subtasks and highlight big data challenges in each of them to help readers identify important and outstanding issues for further investigation. Finally, we provide concluding remarks and discuss possible directions for future work

    Doc2RDFa: Semantic Annotation for Web Documents

    Get PDF
    Ever since its conception, the amount of data published on the worldwide web has been rapidly growing to the point where it has become an important source of both general and domain specific information. However, the majority of documents published online are not machine readable by default. Many researchers believe that the answer to this problem is to semantically annotate these documents, and thereby contribute to the linked "Web of Data". Yet, the process of annotating web documents remains an open challenge. While some efforts towards simplifying this process have been made in the recent years, there is still a lack of semantic content creation tools that integrate well with information worker toolsets. Towards this end, we introduce Doc2RDFa, an HTML rich text processor with the ability to automatically and manually annotate domain-specific Content

    Optical Font Recognition in Smartphone-Captured Images, and its Applicability for ID Forgery Detection

    Full text link
    In this paper, we consider the problem of detecting counterfeit identity documents in images captured with smartphones. As the number of documents contain special fonts, we study the applicability of convolutional neural networks (CNNs) for detection of the conformance of the fonts used with the ones, corresponding to the government standards. Here, we use multi-task learning to differentiate samples by both fonts and characters and compare the resulting classifier with its analogue trained for binary font classification. We train neural networks for authenticity estimation of the fonts used in machine-readable zones and ID numbers of the Russian national passport and test them on samples of individual characters acquired from 3238 images of the Russian national passport. Our results show that the usage of multi-task learning increases sensitivity and specificity of the classifier. Moreover, the resulting CNNs demonstrate high generalization ability as they correctly classify fonts which were not present in the training set. We conclude that the proposed method is sufficient for authentication of the fonts and can be used as a part of the forgery detection system for images acquired with a smartphone camera

    RNeXML: a package for reading and writing richly annotated phylogenetic, character, and trait data in R

    Full text link
    NeXML is a powerful and extensible exchange standard recently proposed to better meet the expanding needs for phylogenetic data and metadata sharing. Here we present the RNeXML package, which provides users of the R programming language with easy-to-use tools for reading and writing NeXML documents, including rich metadata, in a way that interfaces seamlessly with the extensive library of phylogenetic tools already available in the R ecosystem
    corecore