18 research outputs found
Privacy in text documents
The process of sensitive data preservation is a manual and a semi-automatic procedure. Sensitive data preservation suffers various problems, in particular, affect the handling of confidential, sensitive and personal information, such as the identification of sensitive data in documents requiring human intervention that is costly and propense to generate error, and the identification of sensitive data in large-scale documents does not allow an approach that depends on human expertise for their identification and relationship. DataSense will be highly exportable software that will enable organizations to identify and understand the sensitive data in their possession in unstructured textual information (digital documents) in order to comply with legal, compliance and security purposes. The goal is to identify and classify sensitive data (Personal Data) present in large-scale structured and non-structured information in a way that allows entities and/or organizations to understand it without calling into question security or confidentiality issues. The DataSense project will be based on European-Portuguese text documents with different approaches of NLP (Natural Language Processing) technologies and the advances in machine learning, such as Named Entity Recognition, Disambiguation, Co-referencing (ARE) and Automatic Learning and Human Feedback. It will also be characterized by the ability to assist organizations in complying with standards such as the GDPR (General Data Protection Regulation), which regulate data protection in the European Union.info:eu-repo/semantics/acceptedVersio
Neural Coreference Resolution for Turkish
Coreference resolution deals with resolving mentions of the same underlying entity in a given text. This challenging task is an indispensable aspect of text understanding and has important applications in various language processing systems such as question answering and machine translation. Although a significant amount of studies is devoted to coreference resolution, the research on Turkish is scarce and mostly limited to pronoun resolution. To our best knowledge, this article presents the first neural Turkish coreference resolution study where two learning-based models are explored. Both models follow the mention-ranking approach while forming clusters of mentions. The first model uses a set of hand-crafted features whereas the second coreference model relies on embeddings learned from large-scale pre-trained language models for capturing similarities between a mention and its candidate antecedents. Several language models trained specifically for Turkish are used to obtain mention representations and their effectiveness is compared in conducted experiments using automatic metrics. We argue that the results of this study shed light on the possible contributions of neural architectures to Turkish coreference resolution.119683
A Benchmark of Rule-Based and Neural Coreference Resolution in Dutch Novels and News
We evaluate a rule-based (Lee et al., 2013) and neural (Lee et al., 2018)
coreference system on Dutch datasets of two domains: literary novels and
news/Wikipedia text. The results provide insight into the relative strengths of
data-driven and knowledge-driven systems, as well as the influence of domain,
document length, and annotation schemes. The neural system performs best on
news/Wikipedia text, while the rule-based system performs best on literature.
The neural system shows weaknesses with limited training data and long
documents, while the rule-based system is affected by annotation differences.
The code and models used in this paper are available at
https://github.com/andreasvc/crac2020Comment: Accepted for CRAC 2020 @ COLIN
Resource Mention Extraction for MOOC Discussion Forums
In discussions hosted on discussion forums for MOOCs, references to online
learning resources are often of central importance. They contextualize the
discussion, anchoring the discussion participants' presentation of the issues
and their understanding. However they are usually mentioned in free text,
without appropriate hyperlinking to their associated resource. Automated
learning resource mention hyperlinking and categorization will facilitate
discussion and searching within MOOC forums, and also benefit the
contextualization of such resources across disparate views. We propose the
novel problem of learning resource mention identification in MOOC forums. As
this is a novel task with no publicly available data, we first contribute a
large-scale labeled dataset, dubbed the Forum Resource Mention (FoRM) dataset,
to facilitate our current research and future research on this task. We then
formulate this task as a sequence tagging problem and investigate solution
architectures to address the problem. Importantly, we identify two major
challenges that hinder the application of sequence tagging models to the task:
(1) the diversity of resource mention expression, and (2) long-range contextual
dependencies. We address these challenges by incorporating character-level and
thread context information into a LSTM-CRF model. First, we incorporate a
character encoder to address the out-of-vocabulary problem caused by the
diversity of mention expressions. Second, to address the context dependency
challenge, we encode thread contexts using an RNN-based context encoder, and
apply the attention mechanism to selectively leverage useful context
information during sequence tagging. Experiments on FoRM show that the proposed
method improves the baseline deep sequence tagging models notably,
significantly bettering performance on instances that exemplify the two
challenges
Review of coreference resolution in English and Persian
Coreference resolution (CR) is one of the most challenging areas of natural
language processing. This task seeks to identify all textual references to the
same real-world entity. Research in this field is divided into coreference
resolution and anaphora resolution. Due to its application in textual
comprehension and its utility in other tasks such as information extraction
systems, document summarization, and machine translation, this field has
attracted considerable interest. Consequently, it has a significant effect on
the quality of these systems. This article reviews the existing corpora and
evaluation metrics in this field. Then, an overview of the coreference
algorithms, from rule-based methods to the latest deep learning techniques, is
provided. Finally, coreference resolution and pronoun resolution systems in
Persian are investigated.Comment: 44 pages, 11 figures, 5 table