6 research outputs found

    Attention is Not Always What You Need: Towards Efficient Classification of Domain-Specific Text

    Full text link
    For large-scale IT corpora with hundreds of classes organized in a hierarchy, the task of accurate classification of classes at the higher level in the hierarchies is crucial to avoid errors propagating to the lower levels. In the business world, an efficient and explainable ML model is preferred over an expensive black-box model, especially if the performance increase is marginal. A current trend in the Natural Language Processing (NLP) community is towards employing huge pre-trained language models (PLMs) or what is known as self-attention models (e.g., BERT) for almost any kind of NLP task (e.g., question-answering, sentiment analysis, text classification). Despite the widespread use of PLMs and the impressive performance in a broad range of NLP tasks, there is a lack of a clear and well-justified need to as why these models are being employed for domain-specific text classification (TC) tasks, given the monosemic nature of specialized words (i.e., jargon) found in domain-specific text which renders the purpose of contextualized embeddings (e.g., PLMs) futile. In this paper, we compare the accuracies of some state-of-the-art (SOTA) models reported in the literature against a Linear SVM classifier and TFIDF vectorization model on three TC datasets. Results show a comparable performance for the LinearSVM. The findings of this study show that for domain-specific TC tasks, a linear model can provide a comparable, cheap, reproducible, and interpretable alternative to attention-based models

    Mapping layperson medical terminology into the Human Phenotype Ontology using neural machine translation models

    Get PDF
    Supplementary material related to this article can be found online at https://doi.org/10.1016/j.eswa.2022.117446.In the medical domain there exists a terminological gap between patients and caregivers and the healthcare professionals. This gap may hinder the success of the communication between healthcare consumers and professionals in the field, with negative emotional and clinical consequences. In this work, we build a machine learning-based tool for the automatic translation between the terminology used by laypeople and that of the Human Phenotype Ontology (HPO). HPO is a structured vocabulary of phenotypic abnormalities found in human disease. Our method uses a vector space to represent an HPO-specific embedding as the output space for a neural network model trained on vector representations of layperson versions and other textual descriptors of medical terms. We explored different output embeddings coupled to different neural network architectures for the machine translation stage. We compute a similarity measure to evaluate the ability of the model to assign an HPO term to a layperson input. The best-performing models resulted with a similarity higher than 0.7 for more than 80% of the terms, with a median between 0.98 and 1. The translator model is made available in a web application at this link: https://hpotranslator.b2slab.upc.edu.This work was supported by the Spanish Ministry of Economy and Competitiveness (www.mineco.gob.es) TEC2014-60337-R, DPI2017-89827-R, Networking Biomedical Research Centre in the subject area of Bioengineering, Biomaterials and Nanomedicine (CIBER-BBN), initiatives of Instituto de Investigación Carlos III (ISCIII), and Share4Rare project (Grant Agreement 780262). This work was partially funded by ACCIÓ (Innotec ACE014/20/000018). B2SLab is certified as 2017 SGR 952. The authors thank the NVIDIA Corporation for the donation of a Titan Xp GPU used to run the models presented in this article. J. Fonollosa acknowledges the support from the Serra Húnter program.Peer ReviewedPostprint (published version

    A Hybrid Continual Machine Learning Model for Efficient Hierarchical Classification of Domain-Specific Text in The Presence of Class Overlap (Case Study: IT Support Tickets)

    Get PDF
    In today’s world, support ticketing systems are employed by a wide range of businesses. The ticketing system facilitates the interaction between customers and the support teams when the customer faces an issue with a product or a service. For large-scale IT companies with a large number of clients and a great volume of communications, the task of automating the classification of incoming tickets is key to guaranteeing long-term clients and ensuring business growth. Although the problem of text classification has been widely studied in the literature, the majority of the proposed approaches revolve around state-of-the-art deep learning models. This thesis addresses the following research questions: What are the reasons behind employing black box models (i.e., deep learning models) for text classification tasks? What is the level of polysemy (i.e., the coexistence of many possible meanings for a word or phrase) in a technical (i.e., specialized) text? How do static word embeddings like Word2vec fare against traditional TFIDF vectorization? How do dynamic word embeddings (e.g., PLMs) compare against a linear classifier such as Support Vector Machine (SVM) for classifying a domain-specific text? This integrated article thesis aims to investigate the aforementioned issues through five empirical studies that were conducted over the past four years. The observation of our studies is an emerging theory that demonstrates why traditional ML models offer a more efficient solution to domain-specific text classification compared to state-of-the-art DL language models (i.e., PLMs). Based on extensive experiments on a real-world dataset, we propose a novel Hybrid Online Offline Model (HOOM) that can efficiently classify IT Support Tickets in a real-time (i.e., dynamic) environment. Our classification model is anticipated to build trust and confidence when deployed into production as the model is interpretable, efficient, and can detect concept drifts in the data
    corecore