117,206 research outputs found

    Fine Tuning Named Entity Extraction Models for the Fantasy Domain

    Full text link
    Named Entity Recognition (NER) is a sequence classification Natural Language Processing task where entities are identified in the text and classified into predefined categories. It acts as a foundation for most information extraction systems. Dungeons and Dragons (D&D) is an open-ended tabletop fantasy game with its own diverse lore. DnD entities are domain-specific and are thus unrecognizable by even the state-of-the-art off-the-shelf NER systems as the NER systems are trained on general data for pre-defined categories such as: person (PERS), location (LOC), organization (ORG), and miscellaneous (MISC). For meaningful extraction of information from fantasy text, the entities need to be classified into domain-specific entity categories as well as the models be fine-tuned on a domain-relevant corpus. This work uses available lore of monsters in the D&D domain to fine-tune Trankit, which is a prolific NER framework that uses a pre-trained model for NER. Upon this training, the system acquires the ability to extract monster names from relevant domain documents under a novel NER tag. This work compares the accuracy of the monster name identification against; the zero-shot Trankit model and two FLAIR models. The fine-tuned Trankit model achieves an 87.86% F1 score surpassing all the other considered models

    Exploiting Large Language Models to Train Automatic Detectors of Sensitive Data

    Get PDF
    openThis thesis proposes an automated system designed to identify sensitive data within text documents, aligning with the definitions and regulations outlined in the General Data Protection Regulation (GDPR). It reviews the current state of the art in Personally Identifiable Information (PII) and sensitive data detection, and how machine learning models for Natural Language Processing (NLP) are tailored to perform these tasks. A critical challenge addressed in this work pertains to the acquisition of suitable datasets for the training and evaluation of the proposed system. To overcome this obstacle, we explore the use of Large Language Model (LLM)s to generate synthetic datasets, thus serving as a valuable resource for training classification models. Both proprietary and open-source LLMs are leveraged to investigate the capabilities of local models in document generation. It then presents a comprehensive framework for sensitive data detection, covering six key domains and proposing specific criteria to identify the disclosure of sensitive data, which take into account the context and the domain relevance. To achieve the detection of sensitive data, a variety of models are explored, mainly based on the Transformer architecture (Bidirectional Encoder Representations from Transformers (BERT)), adapted to fulfill tasks of text classification and Named Entity Recognition (NER). It evaluates the performance of the models using fine-grained metrics, and shows that the NER model achieves the best results (90% score) when trained interchangeably on both datasets, also confirming the quality of the dataset generated with the open source LLM.This thesis proposes an automated system designed to identify sensitive data within text documents, aligning with the definitions and regulations outlined in the General Data Protection Regulation (GDPR). It reviews the current state of the art in Personally Identifiable Information (PII) and sensitive data detection, and how machine learning models for Natural Language Processing (NLP) are tailored to perform these tasks. A critical challenge addressed in this work pertains to the acquisition of suitable datasets for the training and evaluation of the proposed system. To overcome this obstacle, we explore the use of Large Language Model (LLM)s to generate synthetic datasets, thus serving as a valuable resource for training classification models. Both proprietary and open-source LLMs are leveraged to investigate the capabilities of local models in document generation. It then presents a comprehensive framework for sensitive data detection, covering six key domains and proposing specific criteria to identify the disclosure of sensitive data, which take into account the context and the domain relevance. To achieve the detection of sensitive data, a variety of models are explored, mainly based on the Transformer architecture (Bidirectional Encoder Representations from Transformers (BERT)), adapted to fulfill tasks of text classification and Named Entity Recognition (NER). It evaluates the performance of the models using fine-grained metrics, and shows that the NER model achieves the best results (90% score) when trained interchangeably on both datasets, also confirming the quality of the dataset generated with the open source LLM

    Exploratory Search on Mobile Devices

    Get PDF
    The goal of this thesis is to provide a general framework (MobEx) for exploratory search especially on mobile devices. The central part is the design, implementation, and evaluation of several core modules for on-demand unsupervised information extraction well suited for exploratory search on mobile devices and creating the MobEx framework. These core processing elements, combined with a multitouch - able user interface specially designed for two families of mobile devices, i.e. smartphones and tablets, have been finally implemented in a research prototype. The initial information request, in form of a query topic description, is issued online by a user to the system. The system then retrieves web snippets by using standard search engines. These snippets are passed through a chain of NLP components which perform an ondemand or ad-hoc interactive Query Disambiguation, Named Entity Recognition, and Relation Extraction task. By on-demand or ad-hoc we mean the components are capable to perform their operations on an unrestricted open domain within special time constraints. The result of the whole process is a topic graph containing the detected associated topics as nodes and the extracted relation ships as labelled edges between the nodes. The Topic Graph is presented to the user in different ways depending on the size of the device she is using. Various evaluations have been conducted that help us to understand the potentials and limitations of the framework and the prototype
    • …
    corecore