8 research outputs found

    Creating a Thesaurus "Crime-Related Web Content" Based on a Multilingual Corpus

    No full text
    An overview of the most common ontological resources and methods of their construction and application is given. For purposes of scientific research we analyzed the characteristics of ontologies in the public domain and corpus containing criminal context. Additionally, we have recently developed a Flask-based web application that generates ontologies using the Anytree library. The authors also developed a multilingual basic ontology called "Illegal Web content" based on a corpus of texts in criminal context in English, Ukrainian, Kazakh and Russian languages. The development of this ontology was motivated by the need for effective analysis and prevention of criminal activities based on textual information disseminated on the internet. The newly developed web application allows users to create ontologies by importing text files in different languages, and then automatically generates an ontology based on the text. The application is user-friendly, and allows users to customize the ontology by adding or removing nodes, changing the labels of nodes and edges, and setting the weight of edges. Overall, the development of the "Illegal Web content" ontology and the web application represents a significant contribution to the field of ontology development and text processing for criminal investigation and prevention. The main characteristics of the Web application, including its ease of use and customizability, make it a valuable tool for researchers and practitioners alike

    Logical-linguistic model for multilingual Open Information Extraction

    No full text
    Open Information Extraction (OIE) is a modern strategy to extract the triplet of facts from Web-document collections. However, most part of the current OIE approaches is based on NLP techniques such as POS tagging and dependency parsing, which tools are accessible not to all languages. In this paper, we suggest the logical-linguistic model, which basic mathematical means are logical-algebraic equations of finite predicates algebra. These equations allow expressing a semantic role of the participant of a triplet of the fact (Subject-Predicate-Object) due to the relations of grammatical characteristics of words in the sentence. We propose the model that extracts the unlimited domain-independent number of facts from sentences of different languages. The use of our model allows extracting the facts from unstructured texts without requiring a pre-specified vocabulary, by identifying relations in phrases and associated arguments in arbitrary sentences of English, Kazakh, and Russian languages. We evaluate our approach on corpora of three languages based on English and Kazakh bilingual news websites. We achieve the precision of facts extraction over 87% for English corpus, over 82% for Russian corpus and 71% for Kazakh corpus

    A Parallel Corpus-Based Approach to the Crime Event Extraction for Low-Resource Languages

    No full text
    These days, a lot of crime-related events take place all over the world. Most of them are reported in news portals and social media. Crime-related event extraction from the published texts can allow monitoring, analysis, and comparison of police or criminal activities in different countries or regions. Existing approaches to event extraction mainly suggest processing texts in English, French, Chinese, and some other resource-rich and well-annotated languages. This paper presents a parallel corpus-based approach that follows a closed-domain event extraction methodology to event extraction from web news articles in low-resource languages. To identify the event, its arguments, and the arguments’ roles in the source-language part of the corpus we utilize an enhanced pattern-based method that involves the multilingual synonyms dictionary with knowledge about crime-related concepts and logic-linguistic equations. The event extraction from the target-language part of the corpus uses a cross-lingual crime-related event extraction transfer technique that is based on supplementary knowledge about the semantic similarity patterns of the considered pair of languages. The presented approach does not require a preliminarily annotated corpus for training making it more attractive to low-resource languages and allows extracting TRANSFER, CRIME, and POLICE types of events and their seven subtypes from various topics of news articles simultaneously. Implementation of our approach for the Russian-Kazakh parallel corpus of news portals articles allowed obtaining the F1-measure of crime-related event extraction of over 82% for the source language and 63% for the target language

    Understanding the Ukrainian migrants challenges in the EU : a topic modeling approach

    No full text
    Confronted with the aggression against Ukraine in 2022, Europe faces one of the most important humanitarian challenges - the migration of war refugees from Ukraine, most of them women with children and the elderly. Both international institutions such as the European Union and the United Nations, but also national governments and, above all, local governments, which are the main providers of services and resources for refugees, are taking a number of measures to meet the needs. The extraordinary nature and extensive humanitarian needs pose exceptional challenges for both governments and Non-Governmental Organizations (NGOs) as well as civil society. The European countries adopted distinct reception procedures to accommodate war refugees in their territories. The purpose of this paper is to examine the challenges of war refugees from Ukraine and gain an understanding of how they vary across selected European countries. Using a text analytics approach such as BERTopic topic modeling, we analyzed text messages published on Telegram channels from February 2022 to September 2023, revealing 12 challenges facing Ukrainian migrants. Furthermore, our study delves into these challenges distribution across 6 major European countries with significant migrant populations, providing insights into regional differences. Additionally, temporal changes in 8 narrative themes in discussions of Ukrainian migration, extracted from official government websites, were examined. Together, this research contributes (1) to demonstrating how analytics-driven methodology can potentially be used to extract in-depth knowledge from textual data freely available on social media; and (2) to a deeper understanding of the various issues affecting the adaptation of Ukrainian migrants in European countries. The study also provides recommendations to improve programs and policies to better support the successful integration of Ukrainian migrants in host countries

    Creating a Thesaurus "Crime-Related Web Content" Based on a Multilingual Corpus

    No full text
    An overview of the most common ontological resources and methods of their construction and application is given. For purposes of scientific research we analyzed the characteristics of ontologies in the public domain and corpus containing criminal context. Additionally, we have recently developed a Flask-based web application that generates ontologies using the Anytree library. The authors also developed a multilingual basic ontology called "Illegal Web content" based on a corpus of texts in criminal context in English, Ukrainian, Kazakh and Russian languages. The development of this ontology was motivated by the need for effective analysis and prevention of criminal activities based on textual information disseminated on the internet. The newly developed web application allows users to create ontologies by importing text files in different languages, and then automatically generates an ontology based on the text. The application is user-friendly, and allows users to customize the ontology by adding or removing nodes, changing the labels of nodes and edges, and setting the weight of edges. Overall, the development of the "Illegal Web content" ontology and the web application represents a significant contribution to the field of ontology development and text processing for criminal investigation and prevention. The main characteristics of the Web application, including its ease of use and customizability, make it a valuable tool for researchers and practitioners alike

    Topic modelling of ukraine war-related news using latent dirichlet allocation with collapsed Gibbs sampling

    No full text
    The context of this research is the application of topic modeling to war-related news in the context of the Ukraine war. The objective of the research is to use Latent Dirichlet Allocation (LDA) with Collapsed Gibbs sampling to identify distinct content groups in war-related news. The method used in the research involves data scraping from a Ukrainian news website, data preprocessing, and applying the LDA with Collapsed Gibbs algorithm to infer the latent topics within the corpus. The results of the research include the identification of twelve distinct topics and the corresponding keywords that characterize each topic. The analysis of the results provides insights into the context of each topic, such as discussions on safety measures during wartime, consequences of military actions, and reports on military casualties. The research concludes that the application of LDA with Collapsed Gibbs is a valuable tool for identifying and understanding the context of war-related news. However, there may be discrepancies between the results of the model and human interpretation, which may be due to limitations in the results, model parameters, and the presence of noise data. Future research should focus on optimizing model parameters, filtering noise data, and improving the analysis of topic context to enhance the reliability and interpretability of the results
    corecore