Search CORE

98 research outputs found

Media monitoring and information extraction for the highly inflected agglutinative language Hungarian

Author: Eszter Simon
Júlia Pajzs
Leonida Della Rocca
Maud Ehrmann
Mohamed Ebrahim
Ralf Steinberger
Stefano Bucci
Tamás Váradi
Publication venue: ELRA
Publication date: 01/01/2014
Field of study

The Europe Media Monitor (EMM) is a fully-automatic system that analyses written online news by gathering articles in over 70 languages and by applying text analysis software for currently 21 languages, without using linguistic tools such as parsers, part-of-speech taggers or morphological analysers. In this paper, we describe the effort of adding to EMM Hungarian text mining tools for news gathering; document categorisation; named entity recognition and classification for persons, organisations and locations; name lemmatisation; quotation recognition; and cross-lingual linking of related news clusters. The major challenge of dealing with the Hungarian language is its high degree of inflection and agglutination. We present several experiments where we apply linguistically light-weight methods to deal with inflection and we propose a method to overcome the challenges. We also present detailed frequency lists of Hungarian person and location name suffixes, as found in real-life news texts. This empirical data can be used to draw further conclusions and to improve existing Named Entity Recognition software. Within EMM, the solutions described here will also be applied to other morphologically complex languages such as those of the Slavic language family. The media monitoring and analysis system EMM is freely accessible online via the web pag

CiteSeerX

Repository of the Academy's Library

JRC-Names: Multilingual Entity Name variants and titles as Linked Data

Author: EHRMANN Maud
JACQUET GUILLAUME
STEINBERGER Ralf
Publication venue: 'IOS Press'
Publication date: 30/04/2015
Field of study

Since 2004 the European Commission’s Joint Research Centre (JRC) has been analysing the online version of printed media in over twenty languages and has automatically recognised and compiled large amounts of named entities (persons and organisations) and their many name variants. The collected variants not only include standard spellings in various countries, languages and scripts, but also frequently found spelling mistakes or lesser used name forms, all occurring in real-life text (e.g. Benjamin/Binyamin/Bibi/Benyamín/Biniamin/Беньямин/ بنیامین Netanyahu/ Netanjahu/Nétanyahou/Netahnyahu/Нетаньяху/ نتنیاهو ). This entity name variant data, known as JRCNames, has been available for public download since 2011. In this article, we report on our efforts to render JRC-Names as Linked Data (LD), using the lexicon model for ontologies lemon. Besides adhering to Semantic Web standards, this new release goes beyond the initial one in that it includes titles found next to the names, as well as date ranges when the titles and the name variants were found. It also establishes links towards existing datasets, such as DBpedia and Talk-Of-Europe. As multilingual linguistic linked dataset, JRC-Names can help bridge the gap between structured data and natural languages, thus supporting large-scale data integration, e.g. cross-lingual mapping, and web-based content processing, e.g. entity linking. JRC-Names is publicly available through the dataset catalogue of the European Union’s Open Data Portal.JRC.G.2-Global security and crisis managemen

JRC Publications Repository

Entity centric neural models for natural language processing

Author: De Cao N.
Publication venue
Publication date: 01/01/2024
Field of study

This thesis explores how to enhance natural language understanding by incorporating entity information into neural network models. It tackles three key questions:1. Leveraging entities for understanding tasks: This work introduces Entity-GCN, a model that performs multi-step reasoning on a graph where nodes represent entity mentions and edges represent relationships. This method achieved state-of-the-art results on a multi-document question-answering dataset.2. Identifying and disambiguating entities using large language models: This research proposes a novel system that retrieves entities by generating their names token-by-token, overcoming limitations of traditional methods and significantly reducing memory footprint. This approach is also extended to a multilingual setting and further optimized for efficiency.3. Interpreting and controlling entity knowledge within models: This thesis presents a post-hoc interpretation technique to analyze how decisions are made across layers in neural models, allowing for visualization and analysis of knowledge representation. Additionally, a method for editing factual knowledge about entities is proposed, enabling correction of model predictions without costly retraining

International Migration, Integration and Social Cohesion online publications

Entity centric neural models for natural language processing

Author: De Cao N.
Publication venue
Publication date: 01/01/2024
Field of study

International Migration, Integration and Social Cohesion online publications

Biographical information extraction: A language-agnostic methodology for datasets and models

Author: Plum Alistair
Publication venue: University of Wolverhampton
Publication date: 01/01/2022
Field of study

A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.Information extraction (IE) refers to the task of detecting and linking information contained in written texts. While it includes various subtasks, relation extraction (RE) is used to link two entities in a text via a common relation. RE can therefore be used to build linked databases of knowledge across a wide area of topics. Today, the task of RE is treated as a supervised machine learning (ML) task, where a model is trained using a specific architecture and a specific annotated dataset. These specific datasets typically aim to represent common patterns that the model is to learn, albeit at the cost of manual annotation, which can be costly and time-consuming. In addition, due to the nature of the training process, the models can be sensitive to a specific genre or topic, and are generally monolingual. It therefore stands to reason, that certain genres and topics have better models, as they are treated with a higher priority due to financial interests for instance. This in turn leads to RE models not being available to every area of research, leaving incomplete linked databases of knowledge. For instance, if the birthplace of a person is not correctly extracted, the place and the person can not be linked correctly, therefore not leaving linked databases incomplete. To address this problem, this thesis explores aspects of RE that could be adapted in ways which require little human effort, therefore making RE models more widely available. The first aspect is the annotated data. During the course of this thesis, Wikipedia and its subsidiaries are used as sources to automatically annotate sentences for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), information is matched with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain: birthdate, birthplace, deathdate, deathplace, occupation, parent, educated, child, sibling and other (all other relations). Furthermore, the effectiveness of the dataset is demonstrated by training a state-of-the-art neural model to classify relation pairs. For its evaluation, a manually annotated gold standard set is used. An investigation of the necessary adaptations to recreate the automatic process in a multilingual setting is also undertaken, looking specifically at English and German, for which similar neural models are trained and evaluated on a gold standard dataset. While the process is aimed here at training neural models for RE within the domain of digital humanities and history, it may be transferable to other domains

Open Repository and Bibliography - Luxembourg

Wolverhampton Intellectual Repository and E-theses

JRC-Names: Multilingual Entity Name variants and titles as Linked Data

Author: Ehrmann Maud
Jacquet Guillaume
Steinberger Ralf
Publication venue: 'IOS Press'
Publication date: 20/05/2016
Field of study

Since 2004 the European Commission's Joint Research Centre (JRC) has been analysing the online version of printed media in over twenty languages and has automatically recognised and compiled large amounts of named entities (persons and organisations) and their many name variants. The collected variants not only include standard spellings in various countries, languages and scripts, but also frequently found spelling mistakes or lesser used name forms, all occurring in real-life text (e.g. Benjamin/Binyamin/Bibi/Benyam'in/Biniamin/Беньямин/بنيامين Netanyahu/Netanjahu/N\'{e}tanyahou/Netahny/Нетаньяху/\نتنياهو). This entity name variant data, known as JRC-Names, has been available for public download since 2011. In this article, we report on our efforts to render JRC-Names as Linked Data (LD), using the lexicon model for ontologies lemon. Besides adhering to Semantic Web standards, this new release goes beyond the initial one in that it includes titles found next to the names, as well as date ranges when the titles and the name variants were found. It also establishes links towards existing datasets, such as DBpedia and Talk-Of-Europe. As multilingual linguistic linked dataset, JRC-Names can help bridge the gap between structured data and natural languages, thus supporting large-scale data integration, e.g. cross-lingual mapping, and web-based content processing, e.g. entity linking. JRC-Names is publicly available through the dataset catalogue of the European Union's Open Data Portal

Infoscience - École polytechnique fédérale de Lausanne