72,975 research outputs found
Big Data and Cross-Document Coreference Resolution: Current State and Future Opportunities
Information Extraction (IE) is the task of automatically extracting
structured information from unstructured/semi-structured machine-readable
documents. Among various IE tasks, extracting actionable intelligence from
ever-increasing amount of data depends critically upon Cross-Document
Coreference Resolution (CDCR) - the task of identifying entity mentions across
multiple documents that refer to the same underlying entity. Recently, document
datasets of the order of peta-/tera-bytes has raised many challenges for
performing effective CDCR such as scaling to large numbers of mentions and
limited representational power. The problem of analysing such datasets is
called "big data". The aim of this paper is to provide readers with an
understanding of the central concepts, subtasks, and the current
state-of-the-art in CDCR process. We provide assessment of existing
tools/techniques for CDCR subtasks and highlight big data challenges in each of
them to help readers identify important and outstanding issues for further
investigation. Finally, we provide concluding remarks and discuss possible
directions for future work
Doc2RDFa: Semantic Annotation for Web Documents
Ever since its conception, the amount of data published on the worldwide
web has been rapidly growing to the point where it has become an important
source of both general and domain specific information. However, the majority
of documents published online are not machine readable by default. Many researchers
believe that the answer to this problem is to semantically annotate these
documents, and thereby contribute to the linked "Web of Data". Yet, the process
of annotating web documents remains an open challenge. While some efforts towards
simplifying this process have been made in the recent years, there is still a
lack of semantic content creation tools that integrate well with information worker
toolsets. Towards this end, we introduce Doc2RDFa, an HTML rich text processor
with the ability to automatically and manually annotate domain-specific Content
Optical Font Recognition in Smartphone-Captured Images, and its Applicability for ID Forgery Detection
In this paper, we consider the problem of detecting counterfeit identity
documents in images captured with smartphones. As the number of documents
contain special fonts, we study the applicability of convolutional neural
networks (CNNs) for detection of the conformance of the fonts used with the
ones, corresponding to the government standards. Here, we use multi-task
learning to differentiate samples by both fonts and characters and compare the
resulting classifier with its analogue trained for binary font classification.
We train neural networks for authenticity estimation of the fonts used in
machine-readable zones and ID numbers of the Russian national passport and test
them on samples of individual characters acquired from 3238 images of the
Russian national passport. Our results show that the usage of multi-task
learning increases sensitivity and specificity of the classifier. Moreover, the
resulting CNNs demonstrate high generalization ability as they correctly
classify fonts which were not present in the training set. We conclude that the
proposed method is sufficient for authentication of the fonts and can be used
as a part of the forgery detection system for images acquired with a smartphone
camera
RNeXML: a package for reading and writing richly annotated phylogenetic, character, and trait data in R
NeXML is a powerful and extensible exchange standard recently proposed to
better meet the expanding needs for phylogenetic data and metadata sharing.
Here we present the RNeXML package, which provides users of the R programming
language with easy-to-use tools for reading and writing NeXML documents,
including rich metadata, in a way that interfaces seamlessly with the extensive
library of phylogenetic tools already available in the R ecosystem
- …
