12 research outputs found

    Acronym recognition and processing in 22 languages

    Full text link
    We are presenting work on recognising acronyms of the form Long-Form (Short-Form) such as "International Monetary Fund (IMF)" in millions of news articles in twenty-two languages, as part of our more general effort to recognise entities and their variants in news text and to use them for the automatic analysis of the news, including the linking of related news across languages. We show how the acronym recognition patterns, initially developed for medical terms, needed to be adapted to the more general news domain and we present evaluation results. We describe our effort to automatically merge the numerous long-form variants referring to the same short-form, while keeping non-related long-forms separate. Finally, we provide extensive statistics on the frequency and the distribution of short-form/long-form pairs across languages

    Media monitoring and information extraction for the highly inflected agglutinative language Hungarian

    Get PDF
    The Europe Media Monitor (EMM) is a fully-automatic system that analyses written online news by gathering articles in over 70 languages and by applying text analysis software for currently 21 languages, without using linguistic tools such as parsers, part-of-speech taggers or morphological analysers. In this paper, we describe the effort of adding to EMM Hungarian text mining tools for news gathering; document categorisation; named entity recognition and classification for persons, organisations and locations; name lemmatisation; quotation recognition; and cross-lingual linking of related news clusters. The major challenge of dealing with the Hungarian language is its high degree of inflection and agglutination. We present several experiments where we apply linguistically light-weight methods to deal with inflection and we propose a method to overcome the challenges. We also present detailed frequency lists of Hungarian person and location name suffixes, as found in real-life news texts. This empirical data can be used to draw further conclusions and to improve existing Named Entity Recognition software. Within EMM, the solutions described here will also be applied to other morphologically complex languages such as those of the Slavic language family. The media monitoring and analysis system EMM is freely accessible online via the web pag

    Creation and use of multilingual named entity variant dictionaries

    No full text
    The highly multilingual media analysis application Europe Media Monitor (EMM) makes extensive use of name dictionaries, including not only large lists of person, organisation and location names, but also many spelling variants for the same named entity, both within the same language and across languages and scripts. As EMM could not operate without these non-traditional dictionaries, we wish to make a strong case in their favour. In this chapter, we will explain how such vocabulary lists are used within EMM and how they were produced automatically by analysing over 100,000 news articles per day in over twenty languages. A large part of EMM’s vocabulary lists is made publicly available for download as part of JRC-Names.JRC.G.2-Global security and crisis managemen

    COVID-19 news monitoring with Medical Information System (Medisys)

    No full text
    Dataset of metadata created with Europe Media Monitor (EMM)/Medical Information System (MediSys) processing chain from news articles. MEDISYS is a media monitoring system providing event-based surveillance to rapidly identify potential public health threats using information from media reports. The system displays only those articles with interest to public health (e. g. diseases, plant pests, psychoactive substances), analyses news reports and warns users with automatically generated alerts. This dataset has a focus on Covid-19. It provides a large set of metadata automatically extracted from news articles related to Covid -19, stored as rss/xml format. It is publicly available, and anyone can build applications on top of that. The current version contains 4 months of news articles, from December 2019 to April 2020, which corresponds to more than 6 Million news articles. There is one zip file per month, containing the whole metadata information. As a example, the biggest month is March 2020, it contains 4.1 million news articles, from 76 different languages, 36 million entity occurrences (person names, organization names, location names, …), 15 million dates, 0.8 million quotations. The information processed by MediSys is derived from the Europe Media Monitor (EMM). The freely accessible Europe Media Monitor (EMM) is a fully automatic system that analyses on-line media. It gathers and aggregates about 300,000 news articles per day from news portals world-wide in up to 80 languagesJRC.I.3-Text and Data Minin

    Acronym recognition and processing in 22 languages

    No full text
    We are presenting work on recognising acronyms of the form Long-Form (Short-Form) such as “International Monetary Fund (IMF)” in millions of news articles in twenty-two languages, as part of our more general effort to recognise entities and their variants in news text and to use them for the automatic analysis of the news, including the linking of related news across languages. We show how the acronym recognition patterns, initially developed for medical terms, needed to be adapted to the more general news domain and we present evaluation results. We describe our effort to automatically merge the numerous long-form variants referring to the same short-form, while keeping non-related long-forms separate. Finally, we provide extensive statistics on the frequency and the distribution of short-form/long-form pairs across languages.JRC.G.2-Global security and crisis managemen

    Highly Multilingual Coreference Resolution Exploiting a Mature Entity Repository

    No full text
    In this paper we present an approach to large-scale coreference resolution for an ample set of human languages, with a particular emphasis on time performance and precision. One of the distinctive features of our approach is the use of a mature multilingual named entity repository (persons and organizations) gradually compiled over the past few years. Our experiments show promising results – an overall precision of 94% tested on seven different languages. We also present an extrinsic evaluation on seven languages in the context of summarization where we gauge the contribution of the coreference resolver towards the end summarization performance.JRC.G.2-Global security and crisis managemen

    Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili

    No full text
    The Europe Media Monitor (EMM) family of applications is a set of multilingual tools that gather, cluster and classify news in currently fifty languages and that extract named entities and quotations (reported speech) from twenty languages. In this paper, we describe the recent effort of adding the African Bantu language Swahili to EMM. EMM is designed in an entirely modular way, allowing plugging in a new language by providing the language-specific resources for that language. We thus describe the type of language-specific resources needed, the effort involved, and ways of boot-strapping the generation of these resources in order to keep the effort of adding a new language to a minimum. The text analysis applications pursued in our efforts include clustering, classification, recognition and disambiguation of named entities (persons, organisations and locations), recognition and normalisation of date expressions, as well as the identification of reported speech quotations by and about people.JRC.G.2-Global security and crisis managemen

    Supernarrative country distribution.

    No full text
    The table shows the country distribution of all supernarratives and narratives. The bigger the square, the more articles from the monitored sources of a given country have been assigned to the narrative in question. The colour represents the distribution of articles in percentile across the narrative. Using percentile allows us to display each source country’s ranking within each narrative, despite the different numbers of monitored sources for each country.</p

    Query. Keyword-based query.

    No full text
    To tackle the COVID-19 infodemic, we analysed 58,625 articles from 460 unverified sources, that is, sources that were indicated by fact checkers and other mis/disinformation experts as frequently spreading mis/disinformation, covering the period from 1 January 2020 to 31 December 2022. Our aim was to identify the main narratives of COVID-19 mis/disinformation, develop a codebook, automate the process of narrative classification by training an automatic classifier, and analyse the spread of narratives over time and across countries. Articles were retrieved with a customised version of the Europe Media Monitor (EMM) processing chain providing a stream of text items. Machine translation was employed to automatically translate non-English text to English and clustering was carried out to group similar articles. A multi-level codebook of COVID-19 mis/disinformation narratives was developed following an inductive approach; a transformer-based model was developed to classify all text items according to the codebook. Using the transformer-based model, we identified 12 supernarratives that evolved over the three years studied. The analysis shows that there are often real events behind mis/disinformation trends, which unverified sources misrepresent or take out of context. We established a process that allows for near real-time monitoring of COVID-19 mis/disinformation. This experience will be useful to analyse mis/disinformation about other topics, such as climate change, migration, and geopolitical developments.</div
    corecore