1,128 research outputs found

    Media monitoring and information extraction for the highly inflected agglutinative language Hungarian

    Get PDF
    The Europe Media Monitor (EMM) is a fully-automatic system that analyses written online news by gathering articles in over 70 languages and by applying text analysis software for currently 21 languages, without using linguistic tools such as parsers, part-of-speech taggers or morphological analysers. In this paper, we describe the effort of adding to EMM Hungarian text mining tools for news gathering; document categorisation; named entity recognition and classification for persons, organisations and locations; name lemmatisation; quotation recognition; and cross-lingual linking of related news clusters. The major challenge of dealing with the Hungarian language is its high degree of inflection and agglutination. We present several experiments where we apply linguistically light-weight methods to deal with inflection and we propose a method to overcome the challenges. We also present detailed frequency lists of Hungarian person and location name suffixes, as found in real-life news texts. This empirical data can be used to draw further conclusions and to improve existing Named Entity Recognition software. Within EMM, the solutions described here will also be applied to other morphologically complex languages such as those of the Slavic language family. The media monitoring and analysis system EMM is freely accessible online via the web pag

    Multilingual person name recognition and transliteration

    Get PDF
    Nous présentons ici un outil de repérage des noms de personnes, à partir d’articles de la presse internationale, capable de reconnaître les différentes variantes d’un même nom. L’originalité de notre approche vient de l’identification des variantes de noms à travers les langues et systèmes d’écriture, grec, cyrillique et arabe compris. Étant donné notre contexte multilingue, nous utilisons une représentation interne standard de chaque nom ainsi qu’une même mesure de similarité (au lieu d’adopter l’approche bilingue habituelle de la translittération). Ce module fait partie d’un outil plus général qui analyse en moyenne 15.000 articles de journaux chaque jour, afin de regrouper les documents similaires, aussi bien dans une même langue que dans des langues différentes.We present an exploratory tool that extracts person names from multilingual news collections, matches name variants referring to the same person, and infers relationships between people based on the co-occurrence of their names in related news. A novel feature is the matching of name variants across languages and writing systems, including names written with the Greek, Cyrillic and Arabic writing system. Due to our highly multilingual setting, we use an internal standard representation for name representation and matching, instead of adopting the traditional bilingual approach to transliteration. This work is part of a news analysis system that clusters an average of 25,000 news articles per day to detect related news within the same and across different languages

    Design of a Controlled Language for Critical Infrastructures Protection

    Get PDF
    We describe a project for the construction of controlled language for critical infrastructures protection (CIP). This project originates from the need to coordinate and categorize the communications on CIP at the European level. These communications can be physically represented by official documents, reports on incidents, informal communications and plain e-mail. We explore the application of traditional library science tools for the construction of controlled languages in order to achieve our goal. Our starting point is an analogous work done during the sixties in the field of nuclear science known as the Euratom Thesaurus.JRC.G.6-Security technology assessmen

    Morphological awareness in readers of IsiXhosa

    Get PDF
    This study focuses particularly on the development of four Morphological Awareness reading tests in isiXhosa and on the relationship of Morphological Awareness to reading success among 74 Grade 3 isiXhosa-speaking foundation-phase learners from three peri-urban schools. It explores in-depth why not all previously established Morphological Awareness tests for other languages suit the morphology of isiXhosa and how these tests have been revised in order to do so. Conventionally, the focus of Morphological Awareness literature has been on derivational morphology and reading comprehension. This study did not find significant correlations with comprehension, but rather with the children's ability to decode. Fluency and Morphological Awareness have not been given as much attention in the literature, but Morphological Awareness could be important for processing the agglutinating structure of the language in reading. This study also argues that it is not a specific awareness of derivational morphology over inflectional morphology, but rather a general awareness of one's language structure that is more important at this stage in their literacy development; specifically a general awareness of prefixes and suffixes. In addition, it was found that an explicit awareness of the morphological structure of the language related more to fluency and tests that accessed an innate and implicit Morphological Awareness had the strongest correlations overall with comprehension. The findings from this report have implications regarding how future curriculum developments for morphologically rich languages like isiXhosa should be approached. The positive and practical implications of including different types of Morphological Awareness tutoring in curricula is argued for, especially when teaching younger readers how to approach morphologically complex words in texts

    Enhancing the Performance of Text Mining

    Get PDF
    The amount of text data produced in science, finance, social media, and medicine is growing at an unprecedented pace. The raw text data typically introduces major computational and analytical obstacles (e.g., extremely high dimensionality) to data mining and machine learning algorithms. Besides, the growth in the size of text data makes the search process more difficult for information retrieval systems, making retrieving relevant results to match the users’ search queries challenging. Moreover, the availability of text data in different languages creates the need to develop new methods to analyze multilingual topics to help policymakers in governmental and health systems to make risk decisions and to create policies to respond to public health crises, natural disasters, and political or social movements. The goal of this thesis is to develop new methods that handle computational and analytical problems for complex high-dimensional text data, develop a new query expansion approach to enhance the performance of information retrieval systems, and to present new techniques for analyzing multilingual topics using a translation service. First, in the field of dimensionality reduction, we develop a new method for detecting and eliminating domain-based words. In this method, we use three different datasets and five classifiers for testing and evaluating the performance of our new approach before and after eliminating domain-based words. We compare the performance of our approach with other feature selection methods. We find that the new approach improves the performance of the binary classifier and reduces the dimensionality of the feature space by 90%. Also, our approach reduces the execution time of the classifier and outperforms one of the feature selection methods. Second, in the field of information retrieval, we design and implement a method that integrates words from a current stream with external data sources in order to predict the occurrence of relevant words that have not yet appeared in the primary source. This algorithm enables the construction of new queries that effectively capture emergent events that a user may not have anticipated when initiating the data collection stream. The added value of using the external data sources appears when we have a stream of data and we want to predict something that has not yet happened instead of using only the stream that is limited to the available information at a specific time. We compare the performance of our approach with two alternative approaches. The first approach (static) expands user queries with words extracted from a probabilistic topic model of the stream. The second approach (emergent) reinforces user queries with emergent words extracted from the stream. We find that our method outperforms alternative approaches, exhibiting particularly good results in identifying future emergent topics. Third, in the field of the multilingual text, we present a strategy to analyze the similarity between multilingual topics in English and Arabic tweets surrounding the 2020 COVID-19 pandemic. We make a descriptive comparison between topics in Arabic and English tweets about COVID-19 using tweets collected in the same way and filtered using the same keywords. We analyze Twitter’s discussion to understand the evolution of topics over time and reveal topic similarity among tweets across the datasets. We use probabilistic topic modeling to identify and extract the key topics of Twitter’s discussion in Arabic and English tweets. We use two methods to analyze the similarity between multilingual topics. The first method (full-text topic modeling approach) translates all text to English and then runs topic modeling to find similar topics. The second method (term-based topic modeling approach) runs topic modeling on the text before translation then translates the top keywords in each topic to find similar topics. We find similar topics related to COVID-19 pandemic covered in English and Arabic tweets for certain time intervals. Results indicate that the term-based topic modeling approach can reduce the cost compared to the full-text topic modeling approach and still have comparable results in finding similar topics. The computational time to translate the terms is significantly lower than the translation of the full text

    Building and evaluating resources for sentiment analysis in the Greek language

    Get PDF
    Sentiment lexicons and word embeddings constitute well-established sources of information for sentiment analysis in online social media. Although their effectiveness has been demonstrated in state-of-the-art sentiment analysis and related tasks in the English language, such publicly available resources are much less developed and evaluated for the Greek language. In this paper, we tackle the problems arising when analyzing text in such an under-resourced language. We present and make publicly available a rich set of such resources, ranging from a manually annotated lexicon, to semi-supervised word embedding vectors and annotated datasets for different tasks. Our experiments using different algorithms and parameters on our resources show promising results over standard baselines; on average, we achieve a 24.9% relative improvement in F-score on the cross-domain sentiment analysis task when training the same algorithms with our resources, compared to training them on more traditional feature sources, such as n-grams. Importantly, while our resources were built with the primary focus on the cross-domain sentiment analysis task, they also show promising results in related tasks, such as emotion analysis and sarcasm detection

    Lynx: A knowledge-based AI service platform for content processing, enrichment and analysis for the legal domain

    Get PDF
    The EU-funded project Lynx focuses on the creation of a knowledge graph for the legal domain (Legal Knowledge Graph, LKG) and its use for the semantic processing, analysis and enrichment of documents from the legal domain. This article describes the use cases covered in the project, the entire developed platform and the semantic analysis services that operate on the documents. © 202
    • …
    corecore