4 research outputs found

    AIKoGAM: An AI-driven Knowledge Graph of the Antiquities Market: Toward Automatised Methods to Identify Illicit Trafficking Networks

    Get PDF
    The longstanding illicit trafficking of archaeological artefacts has persistently presented a global issue, posing a substantial threat to cultural heritage. This paper introduces an innovative automated system that utilises Natural Language Processing (NLP), Machine Learning (ML), and Social Network Analysis (SNA) to construct a Knowledge Graph for antiquities. The objective is to offer insights into the provenance of artefacts and identify potential instances of illicit trafficking. The paper delineates a comprehensive methodology, from the ontology to the Knowledge Graph. The methodology comprises four distinct phases: the initial phase involves tailoring existing ontologies to match project-specific needs; the second phase centres on selecting appropriate technologies, and scraping and text-mining tools are designed to extract pertinent data from textual sources; the third phase centres in the creation of a robust and accurate Knowledge Graph that captures artefact provenance. The paper suggests employing NLP models, specifically utilising Named Entity Recognition (NER) techniques. These models automatically extract relevant information from the unstructured provenance texts, organising them as events to which both objects and actors participated with their locations and dates. The final phase is concerned with defining and building the Knowledge Graph. The authors explore a property graph model that distinctively represents nodes and relationships, each augmented by associated properties. Employing an SNA approach, the model is projected in multiple network levels of ownership histories (actor-object network) or actor relationships (actor-actor network). This approach reveals patterns within the antiquities market. When integrated with the authors’ recommended strategies such as crowdsourced ontology definition, collaboration with reputable organisations for quality sources, and the application of transfer learning techniques, the suggested approach holds promising implications for the protection of cultural heritage

    User Interface of System for Plagiarism Detection

    Get PDF
    Tato diplomová práce popisuje vývoj webové aplikace, která tvoří grafické uživatelské rozhraní systému pro odhalování plagiátů. Cílem této práce je návrh a implementace webové aplikace. Implementovaný systém poskytuje kontrolu plagiátorství u PDF dokumentů a zdrojových kódů psaných v prakticky libovolném programovacím jazyce. Zároveň byl systém specificky navržen tak, aby umožňoval uživatelům vytvářet kolekce souborů a v rámci těchto kolekcí umožňoval provádět kontrolu plagiátorství mezi nahranými soubory. K barevnému vyznačování plagiovaných částí PDF dokumentů byly použity textové anotace, jež jsou u programů určených k prohlížení PDF dokumentů dostupné. Použitím textových anotací lze barevně vyznačit plagiovaný text přímo v PDF dokumentech. Podobné barevné zvýraznění se pak používá i u zdrojových kódů. V práci jsou provedeny experimenty nad větší kolekcí závěrečných prací studentů Katedry informatiky.This thesis describes development of a web application that provides graphical user interface of a plagiarism detection system. The aim of this thesis is design and implementation of the web application. The implemented system provides plagiarism checking for PDF documents and source codes written in virtually any programming language. At the same time, the system was specifically designed to allow users to create collections of files and perform plagiarism checks among uploaded files of these collections. Text annotations were used to highlight plagiarized portions of PDF documents, which are available in PDF viewing programs. Using text annotations, plagiarised text can be highlighted directly in PDF documents. Similar type of highlighting is also applied for source codes. In this thesis, experiments are performed on a larger collection of thesis written by students of the Computer Science Department.460 - Katedra informatikyvýborn

    Index-based n-gram extraction from large document collections

    No full text

    Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora

    Get PDF
    The amount of information available through the Internet has been showing a significant growth in the last decade. The information can result from various sources such as scientific experiments resulting from particle acceleration, recording the flight data of a commercial aircraft, or sets of documents from a given domain such as medical articles, news headlines from a newspaper, or social networks contents. Due to the volume of data that must be analyzed, it is necessary to endow the search engines with new tools that allow the user to obtain the desired information in a timely and accurate manner. One approach is the annotation of documents with their relevant expressions. The extraction of relevant expressions from natural language text documents can be accomplished by the use of semantic, syntactic, or statistical techniques. Although the latter tend to be not so accurate, they have the advantage of being independent of the language. This investigation was performed in the context of LocalMaxs, which is a statistical method, thus language-independent, capable of extracting relevant expressions from natural language corpora. However, due to the large volume of data involved, the sequential implementations of the above techniques have severe limitations both in terms of execution time and memory space. In this thesis we propose a distributed architecture and strategies for parallel implementations of statistical-based extraction of relevant expressions from large corpora. A methodology was developed for modeling and evaluating those strategies based on empirical and theoretical approaches to estimate the statistical distribution of n-grams in natural language corpora. These approaches were applied to guide the design and evaluation of the behavior of LocalMaxs parallel and distributed implementations on cluster and cloud computing platforms. The implementation alternatives were compared regarding their precision and recall, and their performance metrics, namely, execution time, parallel speedup and sizeup. The performance results indicate almost linear speedup and sizeup for the range of large corpora sizes
    corecore