Search CORE

28 research outputs found

An energy-based comparative analysis of common approaches to text classification in the Legal domain

Author: Ernandes Marco
Globo Achille
Gultekin Sinan
Rigutini Leonardo
Zugarini Andrea
Publication venue
Publication date: 02/11/2023
Field of study

Most Machine Learning research evaluates the best solutions in terms of performance. However, in the race for the best performing model, many important aspects are often overlooked when, on the contrary, they should be carefully considered. In fact, sometimes the gaps in performance between different approaches are neglectable, whereas factors such as production costs, energy consumption, and carbon footprint must take into consideration. Large Language Models (LLMs) are extensively adopted to address NLP problems in academia and industry. In this work, we present a detailed quantitative comparison of LLM and traditional approaches (e.g. SVM) on the LexGLUE benchmark, which takes into account both performance (standard indices) and alternative metrics such as timing, power consumption and cost, in a word: the carbon-footprint. In our analysis, we considered the prototyping phase (model selection by training-validation-test iterations) and in-production phases separately, since they follow different implementation procedures and also require different resources. The results indicate that very often, the simplest algorithms achieve performance very close to that of large LLMs but with very low power consumption and lower resource demands. The results obtained could suggest companies to include additional evaluations in the choice of Machine Learning (ML) solutions.Comment: Accepted at The 4th International Conference on NLP & Text Mining (NLTM 2024), January 27-28 2024, Copenhagen, Denmark - 12 pages, 1 figure, 7 table

arXiv.org e-Print Archive

Archivio della Ricerca - Università degli Studi di Siena

Text Classification in an Under-Resourced Language via Lexical Normalization and Feature Pooling

Author: Elahi Inam
Ijaz Ahsan
Kamiran Faisal
Karim Asim
Sohail Omayya
Publication venue: AIS Electronic Library (AISeL)
Publication date: 26/06/2018
Field of study

Automatic classification of textual content in an under-resourced language is challenging, since lexical resources and preprocessing tools are not available for such languages. Their bag-of-words (BoW) representation is usually highly sparse and noisy, and text classification built on such a representation yields poor performance. In this paper, we explore the effectiveness of lexical normalization of terms and statistical feature pooling for improving text classification in an under-resourced language. We focus on classifying citizen feedback on government services provided through SMS texts which are written predominantly in Roman Urdu (an informal forward transliterated version of the Urdu language). Our proposed methodology performs normalization of lexical variations of terms using phonetic and string similarity. It subsequently employs a supervised feature extraction technique to obtain category-specific highly discriminating features. Our experiments with classifiers reveal that significant improvement in classification performance is achieved by lexical normalization plus feature pooling over standard representations

AIS Electronic Library (AISeL)

Towards classifying species in systems biology papers using text mining

Author: Collier Nigel
Wei Qi
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Springer - Publisher Connector

PubMed Central

Detecting explicit lyrics: a case study in Italian music

Author: Marco Rospocher
Publication venue
Publication date: 01/01/2023
Field of study

Preventing the reproduction of songs whose textual content is offensive or inappropriate for kids is an important issue in the music industry. In this paper, we investigate the problem of assessing whether music lyrics contain content unsuitable for children (a.k.a., explicit content). Previous works that have computationally tackled this problem have dealt with English or Korean songs, comparing the performance of various machine learning approaches. We investigate the automatic detection of explicit lyrics for Italian songs, complementing previous analyses performed on different languages. We assess the performance of many classifiers, including those-not fully exploited so far for this task-leveraging neural language models, i.e., rich language representations built from textual corpora in an unsupervised way, that can be fine-tuned on various natural language processing tasks, including text classification. For the comparison of the different systems, we exploit a novel dataset we contribute, consisting of approximately 34K songs, annotated with labels indicating explicit content. The evaluation shows that, on this dataset, most of the classifiers built on top of neural language models perform substantially better than non-neural approaches. We also provide further analyses, including: a qualitative assessment of the predictions produced by the classifiers, an assessment of the performance of the best performing classifier in a few-shot learning scenario, and the impact of dataset balancing

Catalogo dei prodotti della ricerca

Table 4: Hamming loss, precision, accuracy, recall and F1-score for BoW and BoC varying the length of the training sequence in multi-labelled UVigoMED corpus.

Author: Aronson
Blei
Blizard
Bloehdorn
Bodenreider
Dai
Deerwester
Egozi
Elkin
Gabrilovich
Gabrilovich
Godbole
Harris
Hearst
Huang
Joachims
Jonquet
Kang
Kim
Landauer
Levelt
Lipscomb
Lowe
Medelyan
Milne
Pedregosa
Phan
Porter
Rigutini
Sahlgren
Sahlgren
Salton
Schapire
Sebastiani
Settles
Stock
Tsao
Tsoumakas
Täckström
Vivaldi
Wang
Wang
Yang
Yetisgen-Yildiz
Zhang
Zheng
Zhou
Zhou
Zhou
Publication venue: 'PeerJ'
Publication date
Field of study

Crossref

Word Sense Disambiguation for Exploiting Hierarchical Thesauri in Text Classification

Author: Mavroeidis D.
Theobald M.
Tsatsaronis G.
Vazirgiannis M.
Weikum G.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2005
Field of study

The introduction of hierarchical thesauri (HT) that contain significant semantic information, has led researchers to investigate their potential for improving performance of the text classification task, extending the traditional “bag of words” representation, incorporating syntactic and semantic relationships among words. In this paper we address this problem by proposing a Word Sense Disambiguation (WSD) approach based on the intuition that word proximity in the document implies proximity also in the HT graph. We argue that the high precision exhibited by our WSD algorithm in various humanly-disambiguated benchmark datasets, is appropriate for the classification task. Moreover, we define a semantic kernel, based on the general concept of GVSM kernels, that captures the semantic relations contained in the hierarchical thesaurus. Finally, we conduct experiments using various corpora achieving a systematic improvement in classification accuracy using the SVM algorithm, especially when the training set is small

Improving the Automatic Text Classification Algorithm of Siav, a Case Study

Author: Granada Geraldine King
Publication venue
Publication date: 01/01/2018
Field of study

Siav on ettevõte, mis pakub digitaalsete dokumentide haldamise- ja säilitamise- ning töövoogude juhtimisele keskenduvaid infotehnoloogiateenuseid. Üheks firma projektiks on ärilises kontekstis kasutatava automaatse tekstiklassifitseerimise teenuse loomine. Antud lõputöö eesmärgiks on parandada praeguse klassifikaatori täpsust ja usaldusväärsust läbi tehislike närvivõrkude. Olemasolevat lahendust analüüsitakse ja selle kitsaskohtade parandamiseks pakutakse välja mitu edasiarendust, mis kasutavad lingvistilist eeltöötlemist ja tehislikke närvivõrke. Pakutud lahendused teostatakse ja nende jõudlust võrreldakse olemasoleva lahendusega. Lõpetuseks arutletakse väljapakutud lahenduse ja selle konteksti sobimise üle.Siav is an IT service company that provides products for electronic document management, workflow management and the preservation of digital documents. One of their projects is to create an automatic text classifier suitable for use in business contexts. The primary aim of this thesis is to improve the current accuracy and confidence reliability of the text classifier using neural networks. In order to accomplish these goals, the baselined implementation is analysed and a number of approaches from linguistic processing and neural networks are proposed to address limitations in the current technology. The proposed techniques are then implemented and the performance results are compared against the existing metrics. Finally, observations are made regarding the proposed solution and its suitability for business use compared to the existing one

DSpace at Tartu University Library