10,080 research outputs found

    many faces, many places (Term21)

    Get PDF
    UIDB/03213/2020 UIDP/03213/2020publishersversionpublishe

    many faces, many places (Term21)

    Get PDF
    UIDB/03213/2020 UIDP/03213/2020Proceedings of the LREC 2022 Workshop Language Resources and Evaluation Conferencepublishersversionpublishe

    Using Linguistic Information and Machine Learning Techniques to Identify Entities from Juridical Documents

    Get PDF
    Information extraction from legal documents is an important and open problem. A mixed approach, using linguistic information and machine learning techniques, is described in this paper. In this approach, top-level legal concepts are identified and used for document classifica- tion using Support Vector Machines. Named entities, such as, locations, organizations, dates, and document references, are identified using se- mantic information from the output of a natural language parser. This information, legal concepts and named entities, may be used to popu- late a simple ontology, allowing the enrichment of documents and the creation of high-level legal information retrieval systems. The proposed methodology was applied to a corpus of legal documents - from the EUR-Lex site – and it was evaluated. The obtained results were quite good and indicate this may be a promising approach to the legal information extraction problem

    Extending the Galician Wordnet Using a Multilingual Bible Through Lexical Alignment and Semantic Annotation

    Get PDF
    In this paper we describe the methodology and evaluation of the expansion of Galnet - the Galician wordnet - using a multilingual Bible through lexical alignment and semantic annotation. For this experiment we used the Galician, Portuguese, Spanish, Catalan and English versions of the Bible. They were annotated with part-of-speech and WordNet sense using FreeLing. The resulting synsets were aligned, and new variants for the Galician language were extracted. After manual evaluation the approach presented a 96.8% accuracy

    Knowledge Representation of Crime-Related Events: a Preliminary Approach

    Get PDF
    The crime is spread in every daily newspaper, and particularly on criminal investigation reports produced by several Police departments, creating an amount of data to be processed by Humans. Other research studies related to relation extraction (a branch of information retrieval) in Portuguese arisen along the years, but with few extracted relations and several computer methods approaches, that could be improved by recent features, to achieve better performance results. This paper aims to present the ongoing work related to SEM (Simple Event Model) ontology population with instances retrieved from crime-related documents, supported by an SVO (Subject, Verb, Object) algorithm using hand-crafted rules to extract events, achieving a performance measure of 0.86 (F-Measure)

    Communicating (in) wine tourism: what are the paths for harmonising the sector and the Translation Process?

    Get PDF
    Wine tourism is an emerging area of specialisation to which several areas of knowledge (marketing, economics, anthropology, viticulture, etc.) converge. Portugal’s wine culture has a long tradition and is internationally recognised, placing it at the forefront of economic, professional, and academic initiatives in this sector. Communication between specialists and between specialists and national and international wine tourists requires an international terminology that is, simultaneously, mindful of tradition and that favours inclusive, efficient, and competitive trade exchanges. 226 Our research aims to contribute to the terminological harmonisation of Portuguese wine tourism, even though no ISO/IPQ standards have been issued in this emerging transdisciplinary area. In this article, two comparable academic sub-corpora (10 theses) on wine tourism will be analysed. Their comparison was carried out with the Sketch Engine programme, which allows, in addition to corpus management, to extract terms, identify keywords and represent their conceptual organisation. Our methodological approach included the analysis of the results based on the 50 most relevant terms in each of the corpus. Ten case studies taken from the corpora emphasise the diversity of terminogenic patterns in each language, the influence of cultural factors in the specialised wine tourism terminology of both languages, and, lastly, the influence of the English language on Portuguese wine tourism terminology. These results should be considered in the proposal of harmonised terminologies and in the translation of specialised wine tourism discourse.info:eu-repo/semantics/publishedVersio

    Privacy in text documents

    Get PDF
    The process of sensitive data preservation is a manual and a semi-automatic procedure. Sensitive data preservation suffers various problems, in particular, affect the handling of confidential, sensitive and personal information, such as the identification of sensitive data in documents requiring human intervention that is costly and propense to generate error, and the identification of sensitive data in large-scale documents does not allow an approach that depends on human expertise for their identification and relationship. DataSense will be highly exportable software that will enable organizations to identify and understand the sensitive data in their possession in unstructured textual information (digital documents) in order to comply with legal, compliance and security purposes. The goal is to identify and classify sensitive data (Personal Data) present in large-scale structured and non-structured information in a way that allows entities and/or organizations to understand it without calling into question security or confidentiality issues. The DataSense project will be based on European-Portuguese text documents with different approaches of NLP (Natural Language Processing) technologies and the advances in machine learning, such as Named Entity Recognition, Disambiguation, Co-referencing (ARE) and Automatic Learning and Human Feedback. It will also be characterized by the ability to assist organizations in complying with standards such as the GDPR (General Data Protection Regulation), which regulate data protection in the European Union.info:eu-repo/semantics/acceptedVersio

    Web 2.0, language resources and standards to automatically build a multilingual named entity lexicon

    Get PDF
    This paper proposes to advance in the current state-of-the-art of automatic Language Resource (LR) building by taking into consideration three elements: (i) the knowledge available in existing LRs, (ii) the vast amount of information available from the collaborative paradigm that has emerged from the Web 2.0 and (iii) the use of standards to improve interoperability. We present a case study in which a set of LRs for different languages (WordNet for English and Spanish and Parole-Simple-Clips for Italian) are extended with Named Entities (NE) by exploiting Wikipedia and the aforementioned LRs. The practical result is a multilingual NE lexicon connected to these LRs and to two ontologies: SUMO and SIMPLE. Furthermore, the paper addresses an important problem which affects the Computational Linguistics area in the present, interoperability, by making use of the ISO LMF standard to encode this lexicon. The different steps of the procedure (mapping, disambiguation, extraction, NE identification and postprocessing) are comprehensively explained and evaluated. The resulting resource contains 974,567, 137,583 and 125,806 NEs for English, Spanish and Italian respectively. Finally, in order to check the usefulness of the constructed resource, we apply it into a state-of-the-art Question Answering system and evaluate its impact; the NE lexicon improves the system’s accuracy by 28.1%. Compared to previous approaches to build NE repositories, the current proposal represents a step forward in terms of automation, language independence, amount of NEs acquired and richness of the information represented

    Natural language processing

    Get PDF
    Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems
    corecore