28 research outputs found

    An energy-based comparative analysis of common approaches to text classification in the Legal domain

    Full text link
    Most Machine Learning research evaluates the best solutions in terms of performance. However, in the race for the best performing model, many important aspects are often overlooked when, on the contrary, they should be carefully considered. In fact, sometimes the gaps in performance between different approaches are neglectable, whereas factors such as production costs, energy consumption, and carbon footprint must take into consideration. Large Language Models (LLMs) are extensively adopted to address NLP problems in academia and industry. In this work, we present a detailed quantitative comparison of LLM and traditional approaches (e.g. SVM) on the LexGLUE benchmark, which takes into account both performance (standard indices) and alternative metrics such as timing, power consumption and cost, in a word: the carbon-footprint. In our analysis, we considered the prototyping phase (model selection by training-validation-test iterations) and in-production phases separately, since they follow different implementation procedures and also require different resources. The results indicate that very often, the simplest algorithms achieve performance very close to that of large LLMs but with very low power consumption and lower resource demands. The results obtained could suggest companies to include additional evaluations in the choice of Machine Learning (ML) solutions.Comment: Accepted at The 4th International Conference on NLP & Text Mining (NLTM 2024), January 27-28 2024, Copenhagen, Denmark - 12 pages, 1 figure, 7 table

    Text Classification in an Under-Resourced Language via Lexical Normalization and Feature Pooling

    Get PDF
    Automatic classification of textual content in an under-resourced language is challenging, since lexical resources and preprocessing tools are not available for such languages. Their bag-of-words (BoW) representation is usually highly sparse and noisy, and text classification built on such a representation yields poor performance. In this paper, we explore the effectiveness of lexical normalization of terms and statistical feature pooling for improving text classification in an under-resourced language. We focus on classifying citizen feedback on government services provided through SMS texts which are written predominantly in Roman Urdu (an informal forward transliterated version of the Urdu language). Our proposed methodology performs normalization of lexical variations of terms using phonetic and string similarity. It subsequently employs a supervised feature extraction technique to obtain category-specific highly discriminating features. Our experiments with classifiers reveal that significant improvement in classification performance is achieved by lexical normalization plus feature pooling over standard representations

    Detecting explicit lyrics: a case study in Italian music

    Get PDF
    Preventing the reproduction of songs whose textual content is offensive or inappropriate for kids is an important issue in the music industry. In this paper, we investigate the problem of assessing whether music lyrics contain content unsuitable for children (a.k.a., explicit content). Previous works that have computationally tackled this problem have dealt with English or Korean songs, comparing the performance of various machine learning approaches. We investigate the automatic detection of explicit lyrics for Italian songs, complementing previous analyses performed on different languages. We assess the performance of many classifiers, including those-not fully exploited so far for this task-leveraging neural language models, i.e., rich language representations built from textual corpora in an unsupervised way, that can be fine-tuned on various natural language processing tasks, including text classification. For the comparison of the different systems, we exploit a novel dataset we contribute, consisting of approximately 34K songs, annotated with labels indicating explicit content. The evaluation shows that, on this dataset, most of the classifiers built on top of neural language models perform substantially better than non-neural approaches. We also provide further analyses, including: a qualitative assessment of the predictions produced by the classifiers, an assessment of the performance of the best performing classifier in a few-shot learning scenario, and the impact of dataset balancing

    Word Sense Disambiguation for Exploiting Hierarchical Thesauri in Text Classification

    Full text link
    The introduction of hierarchical thesauri (HT) that contain significant semantic information, has led researchers to investigate their potential for improving performance of the text classification task, extending the traditional “bag of words” representation, incorporating syntactic and semantic relationships among words. In this paper we address this problem by proposing a Word Sense Disambiguation (WSD) approach based on the intuition that word proximity in the document implies proximity also in the HT graph. We argue that the high precision exhibited by our WSD algorithm in various humanly-disambiguated benchmark datasets, is appropriate for the classification task. Moreover, we define a semantic kernel, based on the general concept of GVSM kernels, that captures the semantic relations contained in the hierarchical thesaurus. Finally, we conduct experiments using various corpora achieving a systematic improvement in classification accuracy using the SVM algorithm, especially when the training set is small

    Improving the Automatic Text Classification Algorithm of Siav, a Case Study

    Get PDF
    Siav on ettevĂ”te, mis pakub digitaalsete dokumentide haldamise- ja sĂ€ilitamise- ning töövoogude juhtimisele keskenduvaid infotehnoloogiateenuseid. Üheks firma projektiks on Ă€rilises kontekstis kasutatava automaatse tekstiklassifitseerimise teenuse loomine. Antud lĂ”putöö eesmĂ€rgiks on parandada praeguse klassifikaatori tĂ€psust ja usaldusvÀÀrsust lĂ€bi tehislike nĂ€rvivĂ”rkude. Olemasolevat lahendust analĂŒĂŒsitakse ja selle kitsaskohtade parandamiseks pakutakse vĂ€lja mitu edasiarendust, mis kasutavad lingvistilist eeltöötlemist ja tehislikke nĂ€rvivĂ”rke. Pakutud lahendused teostatakse ja nende jĂ”udlust vĂ”rreldakse olemasoleva lahendusega. LĂ”petuseks arutletakse vĂ€ljapakutud lahenduse ja selle konteksti sobimise ĂŒle.Siav is an IT service company that provides products for electronic document management, workflow management and the preservation of digital documents. One of their projects is to create an automatic text classifier suitable for use in business contexts. The primary aim of this thesis is to improve the current accuracy and confidence reliability of the text classifier using neural networks. In order to accomplish these goals, the baselined implementation is analysed and a number of approaches from linguistic processing and neural networks are proposed to address limitations in the current technology. The proposed techniques are then implemented and the performance results are compared against the existing metrics. Finally, observations are made regarding the proposed solution and its suitability for business use compared to the existing one
    corecore