1,594 research outputs found

    Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers

    Get PDF
    With the advent of the modern pre-trained Transformers, the text preprocessing has started to be neglected and not specifically addressed in recent NLP literature. However, both from a linguistic and from a computer science point of view, we believe that even when using modern Transformers, text preprocessing can significantly impact on the performance of a classification model. We want to investigate and compare, through this study, how preprocessing impacts on the Text Classification (TC) performance of modern and traditional classification models. We report and discuss the preprocessing techniques found in the literature and their most recent variants or applications to address TC tasks in different domains. In order to assess how much the preprocessing affects classification performance, we apply the three top referenced preprocessing techniques (alone or in combination) to four publicly available datasets from different domains. Then, nine machine learning models – including modern Transformers – get the preprocessed text as input. The results presented show that an educated choice on the text preprocessing strategy to employ should be based on the task as well as on the model considered. Outcomes in this survey show that choosing the best preprocessing technique – in place of the worst – can significantly improve accuracy on the classification (up to 25%, as in the case of an XLNet on the IMDB dataset). In some cases, by means of a suitable preprocessing strategy, even a simple Naïve Bayes classifier proved to outperform (i.e., by 2% in accuracy) the best performing Transformer. We found that Transformers and traditional models exhibit a higher impact of the preprocessing on the TC performance. Our main findings are: (1) also on modern pre-trained language models, preprocessing can affect performance, depending on the datasets and on the preprocessing technique or combination of techniques used, (2) in some cases, using a proper preprocessing strategy, simple models can outperform Transformers on TC tasks, (3) similar classes of models exhibit similar level of sensitivity to text preprocessing

    AGI-P: A Gender Identification Framework for Authorship Analysis Using Customized Fine-Tuning of Multilingual Language Model

    Get PDF
    In this investigation, we propose a solution for the author’s gender identification task called AGI-P. This task has several real-world applications across different fields, such as marketing and advertising, forensic linguistics, sociology, recommendation systems, language processing, historical analysis, education, and language learning. We created a new dataset to evaluate our proposed method. The dataset is balanced in terms of gender using a random sampling method and consists of 1944 samples in total. We use accuracy as an evaluation measure and compare the performance of the proposed solution (AGI-P) against state-of-the-art machine learning classifiers and fine-tuned pre-trained multilingual language models such as DistilBERT, mBERT, XLM-RoBERTa, and Multilingual DEBERTa. In this regard, we also propose a customized fine-tuning strategy that improves the accuracy of the pre-trained language models for the author gender identification task. Our extensive experimental studies reveal that our solution (AGI-P) outperforms the well-known machine learning classifiers and fine-tuned pre-trained multilingual language models with an accuracy level of 92.03%. Moreover, the pre-trained multilingual language models, fine-tuned with the proposed customized strategy, outperform the fine-tuned pre-trained language models using an out-of-the-box fine-tuning strategy. The codebase and corpus can be accessed on our GitHub page at: https://github.com/mumairhassan/AGI-

    UMSL Bulletin 2023-2024

    Get PDF
    The 2023-2024 Bulletin and Course Catalog for the University of Missouri St. Louis.https://irl.umsl.edu/bulletin/1088/thumbnail.jp

    Sound Event Detection by Exploring Audio Sequence Modelling

    Get PDF
    Everyday sounds in real-world environments are a powerful source of information by which humans can interact with their environments. Humans can infer what is happening around them by listening to everyday sounds. At the same time, it is a challenging task for a computer algorithm in a smart device to automatically recognise, understand, and interpret everyday sounds. Sound event detection (SED) is the process of transcribing an audio recording into sound event tags with onset and offset time values. This involves classification and segmentation of sound events in the given audio recording. SED has numerous applications in everyday life which include security and surveillance, automation, healthcare monitoring, multimedia information retrieval, and assisted living technologies. SED is to everyday sounds what automatic speech recognition (ASR) is to speech and automatic music transcription (AMT) is to music. The fundamental questions in designing a sound recognition system are, which portion of a sound event should the system analyse, and what proportion of a sound event should the system process in order to claim a confident detection of that particular sound event. While the classification of sound events has improved a lot in recent years, it is considered that the temporal-segmentation of sound events has not improved in the same extent. The aim of this thesis is to propose and develop methods to improve the segmentation and classification of everyday sound events in SED models. In particular, this thesis explores the segmentation of sound events by investigating audio sequence encoding-based and audio sequence modelling-based methods, in an effort to improve the overall sound event detection performance. In the first phase of this thesis, efforts are put towards improving sound event detection by explicitly conditioning the audio sequence representations of an SED model using sound activity detection (SAD) and onset detection. To achieve this, we propose multi-task learning-based SED models in which SAD and onset detection are used as auxiliary tasks for the SED task. The next part of this thesis explores self-attention-based audio sequence modelling, which aggregates audio representations based on temporal relations within and between sound events, scored on the basis of the similarity of sound event portions in audio event sequences. We propose SED models that include memory-controlled, adaptive, dynamic, and source separation-induced self-attention variants, with the aim to improve overall sound recognition

    UMSL Bulletin 2022-2023

    Get PDF
    The 2022-2023 Bulletin and Course Catalog for the University of Missouri St. Louis.https://irl.umsl.edu/bulletin/1087/thumbnail.jp

    VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models

    Full text link
    The VNHSGE (VietNamese High School Graduation Examination) dataset, developed exclusively for evaluating large language models (LLMs), is introduced in this article. The dataset, which covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests. 300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics. The dataset assesses LLMs in multitasking situations such as question answering, text generation, reading comprehension, visual question answering, and more by including both textual data and accompanying images. Using ChatGPT and BingChat, we evaluated LLMs on the VNHSGE dataset and contrasted their performance with that of Vietnamese students to see how well they performed. The results show that ChatGPT and BingChat both perform at a human level in a number of areas, including literature, English, history, geography, and civics education. They still have space to grow, though, especially in the areas of mathematics, physics, chemistry, and biology. The VNHSGE dataset seeks to provide an adequate benchmark for assessing the abilities of LLMs with its wide-ranging coverage and variety of activities. We intend to promote future developments in the creation of LLMs by making this dataset available to the scientific community, especially in resolving LLMs' limits in disciplines involving mathematics and the natural sciences.Comment: 74 pages, 44 figure

    Evaluating automated and hybrid neural disambiguation for African historical named entities

    Get PDF
    Documents detailing South African history contain ambiguous names. Ambiguous names may be due to people having the same name or the same person being referred to by multiple different names. Thus when searching for or attempting to extract information about a particular person, the name used may affect the results. This problem may be alleviated by using a Named Entity Disambiguation (NED) system to disambiguate names by linking them to a knowledge base. In recent years, transformer-based language models have led to improvements in NED systems. Furthermore, multilingual language models have shown the ability to learn concepts across languages, reducing the amount of training data required in low-resource languages. Thus a multilingual language model-based NED system was developed to disambiguate people's names within a historical South African context using documents written in English and isiZulu from the 500 Year Archive (FHYA). The multilingual language model-based system substantially improved on a probability-based baseline and achieved a micro F1-score of 0.726. At the same time, the entity linking component was able to link 81.9% of the mentions to the correct entity. However, the system's performance on documents written in isiZulu was significantly lower than on the documents written in English. Thus the system was augmented with handcrafted rules to improve its performance. The addition of handcrafted rules resulted in a small but significant improvement in performance when compared to the unaugmented NED system

    Stress detection in lifelog data for improved personalized lifelog retrieval system

    Get PDF
    Stress can be categorized into acute and chronic types, with acute stress having short-term positive effects in managing hazardous situations, while chronic stress can adversely impact mental health. In a biological context, stress elicits a physiological response indicative of the fight-or-flight mechanism, accompanied by measurable changes in physiological signals such as blood volume pulse (BVP), galvanic skin response (GSR), and skin temperature (TEMP). While clinical-grade devices have traditionally been used to measure these signals, recent advancements in sensor technology enable their capture using consumer-grade wearable devices, providing opportunities for research in acute stress detection. Despite these advancements, there has been limited focus on utilizing low-resolution data obtained from sensor technology for early stress detection and evaluating stress detection models under real-world conditions. Moreover, the potential of physiological signals to infer mental stress information remains largely unexplored in lifelog retrieval systems. This thesis addresses these gaps through empirical investigations and explores the potential of utilizing physiological signals for stress detection and their integration within the state-of-the-art (SOTA) lifelog retrieval system. The main contributions of this thesis are as follows. Firstly, statistical analyses are conducted to investigate the feasibility of using low-resolution data for stress detection and emphasize the superiority of subject-dependent models over subject-independent models, thereby proposing the optimal approach to training stress detection models with low-resolution data. Secondly, longitudinal stress lifelog data is collected to evaluate stress detection models in real-world settings. It is proposed that training lifelog models on physiological signals in real-world settings is crucial to avoid detection inaccuracies caused by differences between laboratory and free-living conditions. Finally, a state-of-the-art lifelog interactive retrieval system called \lifeseeker is developed, incorporating the stress-moment filter function. Experimental results demonstrate that integrating this function improves the overall performance of the system in both interactive and non-interactive modes. In summary, this thesis contributes to the understanding of stress detection applied in real-world settings and showcases the potential of integrating stress information for enhancing personalized lifelog retrieval system performance

    Predicate Matrix: an interoperable lexical knowledge base for predicates

    Get PDF
    183 p.La Matriz de Predicados (Predicate Matrix en inglés) es un nuevo recurso léxico-semántico resultado de la integración de múltiples fuentes de conocimiento, entre las cuales se encuentran FrameNet, VerbNet, PropBank y WordNet. La Matriz de Predicados proporciona un léxico extenso y robusto que permite mejorar la interoperabilidad entre los recursos semánticos mencionados anteriormente. La creación de la Matriz de Predicados se basa en la integración de Semlink y nuevos mappings obtenidos utilizando métodos automáticos que enlazan el conocimiento semántico a nivel léxico y de roles. Asimismo, hemos ampliado la Predicate Matrix para cubrir los predicados nominales (inglés, español) y predicados en otros idiomas (castellano, catalán y vasco). Como resultado, la Matriz de predicados proporciona un léxico multilingüe que permite el análisis semántico interoperable en múltiples idiomas
    corecore