525 research outputs found

    Semantic Relation Extraction. Resources, Tools and Strategies

    Get PDF
    [Abstract] Relation extraction is a subtask of information extraction that aims at obtaining instances of semantic relations present in texts. This information can be arranged in machine-readable formats, useful for several applications that need structured semantic knowledge. The work presented in this paper explores different strategies to automate the extraction of semantic relations from texts in Portuguese, Galician and Spanish. Both machine learning (distant-supervised and supervised) and rule-based techniques are investigated, and the impact of the different levels of linguistic knowledge is analyzed for the various approaches. Regarding domains, the experiments are focused on the extraction of encyclopedic knowledge, by means of the development of biographical relations classifiers (in a closed domain) and the evaluation of an open information extraction tool. To implement the extraction systems, several natural language processing tools have been built for the three research languages: From sentence splitting and tokenization modules to part-of-speech taggers, named entity recognizers and coreference resolution systems. Furthermore, several lexica and corpora have been compiled and enriched with different levels of linguistic annotation, which are useful for both training and testing probabilistic and symbolic models. As a result of the performed work, new resources and tools are available for automated processing of texts in Portuguese, Galician and Spanish.Ministerio de EconomĂ­a y Competitividad; FFI2014-51978-C2-1-RMinisterio de EconomĂ­a y Competitividad; FJCI-2014-2285

    ParCorFull2.0: a Parallel Corpus Annotated with Full Coreference

    Get PDF
    In this paper, we describe ParCorFull2.0, a parallel corpus annotated with full coreference chains for multiple languages, which is an extension of the existing corpus ParCorFull (Lapshinova-Koltunski et al., 2018). Similar to the previous version, this corpus has been created to address translation of coreference across languages, a phenomenon still challenging for machine translation (MT) and other multilingual natural language processing (NLP) applications. The current version of the corpus that we present here contains not only parallel texts for the language pair English-German, but also for English-French and English-Portuguese, which are all major European languages. The new language pairs belong to the Romance languages. The addition of a new language group creates a need of extension not only in terms of texts added, but also in terms of the annotation guidelines. Both French and Portuguese contain structures not found in English and German. Moreover, Portuguese is a pro-drop language bringing even more systemic differences in the realisation of coreference into our cross-lingual resources. These differences cause problems for multilingual coreference resolution and machine translation. Our parallel corpus with full annotation of coreference will be a valuable resource with a variety of uses not only for NLP applications, but also for contrastive linguists and researchers in translation studies.Christian Hardmeier and Elina Lartaud were supported by the Swedish Research Council under grant 2017-930, which also funded the annotation work of the French data. Pedro Augusto Ferreira was supported by FCT, Foundation for Science and Technology, Portugal, under grant SFRH/BD/146578/2019

    Privacy in text documents

    Get PDF
    The process of sensitive data preservation is a manual and a semi-automatic procedure. Sensitive data preservation suffers various problems, in particular, affect the handling of confidential, sensitive and personal information, such as the identification of sensitive data in documents requiring human intervention that is costly and propense to generate error, and the identification of sensitive data in large-scale documents does not allow an approach that depends on human expertise for their identification and relationship. DataSense will be highly exportable software that will enable organizations to identify and understand the sensitive data in their possession in unstructured textual information (digital documents) in order to comply with legal, compliance and security purposes. The goal is to identify and classify sensitive data (Personal Data) present in large-scale structured and non-structured information in a way that allows entities and/or organizations to understand it without calling into question security or confidentiality issues. The DataSense project will be based on European-Portuguese text documents with different approaches of NLP (Natural Language Processing) technologies and the advances in machine learning, such as Named Entity Recognition, Disambiguation, Co-referencing (ARE) and Automatic Learning and Human Feedback. It will also be characterized by the ability to assist organizations in complying with standards such as the GDPR (General Data Protection Regulation), which regulate data protection in the European Union.info:eu-repo/semantics/acceptedVersio

    Learning discourse-new references in portuguese texts

    Get PDF
    This work presents the evaluation of a discourse status classifier for the Portuguese language. It considers two distinguished classes of discourse novelty: Brand-new and New references. An evaluation of the relevant features according to different linguistic levels are presented in detail.IFIP International Conference on Artificial Intelligence in Theory and Practice - Speech and Natural LanguageRed de Universidades con Carreras en InformĂĄtica (RedUNCI

    Improving Relation Extraction From Unstructured Genealogical Texts Using Fine-Tuned Transformers

    Get PDF
    Though exploring one’s family lineage through genealogical family trees can be insightful to developing one’s identity, this knowledge is typically held behind closed doors by private companies or require expensive technologies, such as DNA testing, to uncover. With the ever-booming explosion of data on the world wide web, many unstructured text documents, both old and new, are being discovered, written, and processed which contain rich genealogical information. With access to this immense amount of data, however, entails a costly process whereby people, typically volunteers, have to read large amounts of text to find relationships between people. This delays having genealogical information be open and accessible to all. This thesis explores state-of-the-art methods for relation extraction across the genealogical and biomedical domains and bridges new and old research by proposing an updated three-tier system for parsing unstructured documents. This system makes use of recently developed and massively pretrained transformers and fine-tuning techniques to take advantage of these deep neural models’ inherent understanding of English syntax and semantics for classification. With only a fraction of labeled data typically needed to train large models, fine-tuning a LUKE relation classification model with minimal added features can identify genealogical relationships with macro precision, recall, and F1 scores of 0.880, 0.867, and 0.871, respectively, in data sets with scarce (∌10%) positive relations. Further- more, with the advent of a modern coreference resolution system utilizing SpanBERT embeddings and a modern named entity parser, our end-to-end pipeline can extract and correctly classify relationships within unstructured documents with macro precision, recall, and F1 scores of 0.794, 0.616, and 0.676, respectively. This thesis also evaluates individual components of the system and discusses future improvements to be made

    Coreference resolution for portuguese using parallel corpora word alignment

    Get PDF
    A ĂĄrea de Extração da Informação tem como objetivo essencial investigar mĂ©todos e tĂ©cnicas para transformar a informação nĂŁo estruturada presente em textos de lĂ­ngua natural em dados estruturados. Um importante passo deste processo Ă© a resolução de correferĂȘncia, tarefa que identifica diferentes sintagmas nominais que se referem a mesma entidade no discurso. A ĂĄrea de estudos sobre resolução de correferĂȘncia tem sido extensivamente pesquisada para a LĂ­ngua Inglesa (Ng, 2010) lista uma sĂ©rie de estudos da ĂĄrea, entretanto tem recebido menos atenção em outras lĂ­nguas. Isso se deve ao fato de que a grande maioria das abordagens utilizadas nessas pesquisas sĂŁo baseadas em aprendizado de mĂĄquina e, portanto, requerem uma extensa quantidade de dados anotados

    A discursive grid approach to model local coherence in multi-document summaries

    Get PDF
    Multi-document summarization is a very important area of Natural Language Processing (NLP) nowadays because of the huge amount of data in the web. People want more and more information and this information must be coherently organized and summarized. The main focus of this paper is to deal with the coherence of multi-document summaries. Therefore, a model that uses discursive information to automatically evaluate local coherence in multi-document summaries has been developed. This model obtains 92.69% of accuracy in distinguishing coherent from incoherent summaries, outperforming the state of the art in the area.CAPESFAPESPUniversity of GoiĂĄ

    Coreference chains in Czech, English and Russian: Preliminary findings

    Get PDF
    Tento člĂĄnek je pilotnĂ­ srovnavacĂ­ vĂœzkum koreferenčnĂ­ch ƙetězcĆŻ v čeĆĄtině, angličtině a ruĆĄtině. Podrobili jsme analĂœze 16 srovnatelnĂœch textĆŻ ve tƙech jazycĂ­ch. NaĆĄĂ­ motivacĂ­ bylo zjistit lingvistickou strukturu koreferenčnĂ­ch ƙetězcĆŻ v těchto jazycĂ­ch a určit, kterĂ© faktory ovlivƈujĂ­ tuto strukturu

    Coreference Resolution in Freeling 4.0

    Get PDF
    This paper presents the integration of RelaxCor into FreeLing. RelaxCor is a coreference resolution system based on constraint satisfaction that ranked second in the CoNLL-2011 shared task. FreeLing is an open-source library for NLP with more than fifteen years of existence and a widespread user community. We present the difficulties found in porting RelaxCor from a shared task scenario to a production enviroment, as well as the solutions devised. We present two strategies for this integration and a rough evaluation of the obtained resultsPeer ReviewedPostprint (published version
    • 

    corecore