98 research outputs found
Media monitoring and information extraction for the highly inflected agglutinative language Hungarian
The Europe Media Monitor (EMM) is a fully-automatic system that analyses written online news by gathering articles in over 70 languages and by applying text analysis software for currently 21 languages, without using linguistic tools such as parsers, part-of-speech taggers or morphological analysers. In this paper, we describe the effort of adding to EMM Hungarian text mining tools for news gathering; document categorisation; named entity recognition and classification for persons, organisations and locations; name lemmatisation; quotation recognition; and cross-lingual linking of related news clusters. The major challenge of dealing with the Hungarian language is its high degree of inflection and agglutination. We present several experiments where we apply linguistically light-weight methods to deal with inflection and we propose a method to overcome the challenges. We also present detailed frequency lists of Hungarian person and location name suffixes, as found in real-life news texts. This empirical data can be used to draw further conclusions and to improve existing Named Entity Recognition software. Within EMM, the solutions described here will also be applied to other morphologically complex languages such as those of the Slavic language family. The media monitoring and analysis system EMM is freely accessible online via the web pag
JRC-Names: Multilingual Entity Name variants and titles as Linked Data
Since 2004 the European Commission’s Joint Research Centre (JRC) has been analysing the online version of
printed media in over twenty languages and has automatically recognised and compiled large amounts of named
entities (persons and organisations) and their many name variants. The collected variants not only include standard
spellings in various countries, languages and scripts, but also frequently found spelling mistakes or lesser used
name forms, all occurring in real-life text (e.g. Benjamin/Binyamin/Bibi/Benyamín/Biniamin/Беньямин/ بنیامین Netanyahu/
Netanjahu/Nétanyahou/Netahnyahu/Нетаньяху/ نتنیاهو ). This entity name variant data, known as JRCNames,
has been available for public download since 2011. In this article, we report on our efforts to render
JRC-Names as Linked Data (LD), using the lexicon model for ontologies lemon. Besides adhering to Semantic
Web standards, this new release goes beyond the initial one in that it includes titles found next
to the names, as well as date ranges when the titles and the name variants were found. It also establishes
links towards existing datasets, such as DBpedia and Talk-Of-Europe. As multilingual linguistic linked
dataset, JRC-Names can help bridge the gap between structured data and natural languages, thus supporting
large-scale data integration, e.g. cross-lingual mapping, and web-based content processing, e.g. entity linking.
JRC-Names is publicly available through the dataset catalogue of the European Union’s Open Data Portal.JRC.G.2-Global security and crisis managemen
Entity centric neural models for natural language processing
This thesis explores how to enhance natural language understanding by incorporating entity information into neural network models. It tackles three key questions:1. Leveraging entities for understanding tasks: This work introduces Entity-GCN, a model that performs multi-step reasoning on a graph where nodes represent entity mentions and edges represent relationships. This method achieved state-of-the-art results on a multi-document question-answering dataset.2. Identifying and disambiguating entities using large language models: This research proposes a novel system that retrieves entities by generating their names token-by-token, overcoming limitations of traditional methods and significantly reducing memory footprint. This approach is also extended to a multilingual setting and further optimized for efficiency.3. Interpreting and controlling entity knowledge within models: This thesis presents a post-hoc interpretation technique to analyze how decisions are made across layers in neural models, allowing for visualization and analysis of knowledge representation. Additionally, a method for editing factual knowledge about entities is proposed, enabling correction of model predictions without costly retraining
Entity centric neural models for natural language processing
This thesis explores how to enhance natural language understanding by incorporating entity information into neural network models. It tackles three key questions:1. Leveraging entities for understanding tasks: This work introduces Entity-GCN, a model that performs multi-step reasoning on a graph where nodes represent entity mentions and edges represent relationships. This method achieved state-of-the-art results on a multi-document question-answering dataset.2. Identifying and disambiguating entities using large language models: This research proposes a novel system that retrieves entities by generating their names token-by-token, overcoming limitations of traditional methods and significantly reducing memory footprint. This approach is also extended to a multilingual setting and further optimized for efficiency.3. Interpreting and controlling entity knowledge within models: This thesis presents a post-hoc interpretation technique to analyze how decisions are made across layers in neural models, allowing for visualization and analysis of knowledge representation. Additionally, a method for editing factual knowledge about entities is proposed, enabling correction of model predictions without costly retraining
Biographical information extraction: A language-agnostic methodology for datasets and models
A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.Information extraction (IE) refers to the task of detecting and linking information
contained in written texts. While it includes various subtasks, relation extraction
(RE) is used to link two entities in a text via a common relation. RE can therefore
be used to build linked databases of knowledge across a wide area of topics.
Today, the task of RE is treated as a supervised machine learning (ML) task,
where a model is trained using a specific architecture and a specific annotated
dataset. These specific datasets typically aim to represent common patterns that
the model is to learn, albeit at the cost of manual annotation, which can be costly
and time-consuming. In addition, due to the nature of the training process, the
models can be sensitive to a specific genre or topic, and are generally monolingual.
It therefore stands to reason, that certain genres and topics have better models,
as they are treated with a higher priority due to financial interests for instance.
This in turn leads to RE models not being available to every area of research,
leaving incomplete linked databases of knowledge. For instance, if the birthplace
of a person is not correctly extracted, the place and the person can not be linked
correctly, therefore not leaving linked databases incomplete.
To address this problem, this thesis explores aspects of RE that could be
adapted in ways which require little human effort, therefore making RE models
more widely available. The first aspect is the annotated data. During the course of this thesis, Wikipedia and its subsidiaries are used as sources to automatically
annotate sentences for RE. The dataset, which is aimed towards digital humanities
(DH) and historical research, is automatically compiled by aligning sentences
from Wikipedia articles with matching structured data from sources including
Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and
robust named entity recognition (NER), information is matched with relatively
high precision in order to compile annotated relation pairs for ten different
relations that are important in the DH domain: birthdate, birthplace, deathdate,
deathplace, occupation, parent, educated, child, sibling and other (all other
relations). Furthermore, the effectiveness of the dataset is demonstrated by
training a state-of-the-art neural model to classify relation pairs. For its evaluation,
a manually annotated gold standard set is used. An investigation of the necessary
adaptations to recreate the automatic process in a multilingual setting is also
undertaken, looking specifically at English and German, for which similar neural
models are trained and evaluated on a gold standard dataset. While the process
is aimed here at training neural models for RE within the domain of digital
humanities and history, it may be transferable to other domains
JRC-Names: Multilingual Entity Name variants and titles as Linked Data
Since 2004 the European Commission's Joint Research Centre (JRC) has been analysing the online version of printed media in over twenty languages and has automatically recognised and compiled large amounts of named entities (persons and organisations) and their many name variants. The collected variants not only include standard spellings in various countries, languages and scripts, but also frequently found spelling mistakes or lesser used name forms, all occurring in real-life text (e.g. Benjamin/Binyamin/Bibi/Benyam'in/Biniamin/Беньямин/بنيامين Netanyahu/Netanjahu/N\'{e}tanyahou/Netahny/Нетаньяху/\نتنياهو). This entity name variant data, known as JRC-Names, has been available for public download since 2011. In this article, we report on our efforts to render JRC-Names as Linked Data (LD), using the lexicon model for ontologies lemon. Besides adhering to Semantic Web standards, this new release goes beyond the initial one in that it includes titles found next to the names, as well as date ranges when the titles and the name variants were found. It also establishes links towards existing datasets, such as DBpedia and Talk-Of-Europe. As multilingual linguistic linked dataset, JRC-Names can help bridge the gap between structured data and natural languages, thus supporting large-scale data integration, e.g. cross-lingual mapping, and web-based content processing, e.g. entity linking. JRC-Names is publicly available through the dataset catalogue of the European Union's Open Data Portal
Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme
Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie
- …