Search CORE

15 research outputs found

Review of Extreme Multilabel Classification

Author: Das Shrutimoy
Dasgupta Arpan
Katyan Siddhant
Kumar Pawan
Publication venue
Publication date: 26/03/2023
Field of study

Extreme multilabel classification or XML, is an active area of interest in machine learning. Compared to traditional multilabel classification, here the number of labels is extremely large, hence, the name extreme multilabel classification. Using classical one versus all classification wont scale in this case due to large number of labels, same is true for any other classifiers. Embedding of labels as well as features into smaller label space is an essential first step. Moreover, other issues include existence of head and tail labels, where tail labels are labels which exist in relatively smaller number of given samples. The existence of tail labels creates issues during embedding. This area has invited application of wide range of approaches ranging from bit compression motivated from compressed sensing, tree based embeddings, deep learning based latent space embedding including using attention weights, linear algebra based embeddings such as SVD, clustering, hashing, to name a few. The community has come up with a useful set of metrics to identify correctly the prediction for head or tail labels.Comment: 46 pages, 13 figure

arXiv.org e-Print Archive

The Emerging Trends of Multi-Label Learning

Author: Liu Weiwei
Shen Xiaobo
Tsang Ivor W.
Wang Haobo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/11/2021
Field of study

Exabytes of data are generated daily by humans, leading to the growing need for new efforts in dealing with the grand challenges for multi-label learning brought by big data. For example, extreme multi-label classification is an active and rapidly growing research area that deals with classification tasks with an extremely large number of classes or labels; utilizing massive data with limited supervision to build a multi-label classification model becomes valuable for practical applications, etc. Besides these, there are tremendous efforts on how to harvest the strong learning capability of deep learning to better capture the label dependencies in multi-label learning, which is the key for deep learning to address real-world classification tasks. However, it is noted that there has been a lack of systemic studies that focus explicitly on analyzing the emerging trends and new challenges of multi-label learning in the era of big data. It is imperative to call for a comprehensive survey to fulfill this mission and delineate future research directions and new applications.Comment: Accepted to TPAMI 202

arXiv.org e-Print Archive

OPUS - University of Technology Sydney

Light-weight Deep Extreme Multilabel Classification

Author: Dasgupta Arpan
Jawanpuria Pratik
Kumar Pawan
Mishra Bamdev
Mishra Istasis
Publication venue
Publication date: 20/04/2023
Field of study

Extreme multi-label (XML) classification refers to the task of supervised multi-label learning that involves a large number of labels. Hence, scalability of the classifier with increasing label dimension is an important consideration. In this paper, we develop a method called LightDXML which modifies the recently developed deep learning based XML framework by using label embeddings instead of feature embedding for negative sampling and iterating cyclically through three major phases: (1) proxy training of label embeddings (2) shortlisting of labels for negative sampling and (3) final classifier training using the negative samples. Consequently, LightDXML also removes the requirement of a re-ranker module, thereby, leading to further savings on time and memory requirements. The proposed method achieves the best of both worlds: while the training time, model size and prediction times are on par or better compared to the tree-based methods, it attains much better prediction accuracy that is on par with the deep learning based methods. Moreover, the proposed approach achieves the best tail-label prediction accuracy over most state-of-the-art XML methods on some of the large datasets\footnote{accepted in IJCNN 2023, partial funding from MAPG grant and IIIT Seed grant at IIIT, Hyderabad, India. Code: \url{https://github.com/misterpawan/LightDXML}Comment: 9 pages, 2 figures, 5 table

arXiv.org e-Print Archive

Clasificación multi-etiqueta de textos de licitaciones públicas en español

Author: Corcho Oscar
Garijo Daniel
Navas-Loro María
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/09/2022
Field of study

Public procurement accounts for a 14% of the annual budget of the different governments of the European Union. In Europe, contracting processes are classified using Common Procurement Vocabulary codes (CPVs), a taxonomy designed to facilitate statistical reporting, search and the creation of alerts that can be used by potential bidders. CPVs are commonly assigned manually by public employees in charge of contracting processes. However, CPV classification is not a trivial task, as there are more than 9,000 different CPV categories, which are often assigned following heterogeneous criteria. In this paper we have created a CPV classifier that uses as an input the textual description of the contracting process, and assigns CPVs from the 45 top-level CPV categories. We work only with texts in Spanish, although our approach may be easily extended to other languages. Our results improve the state of the art (10% F1-score improvement) and are available online.Las licitaciones públicas suponen el 14% del presupuesto anual de la Unión Europea. En Europa, los procesos de contratación se clasifican usando la taxonomía Common Procurement Vocabulary (CPVs), diseñada para facilitar la generación de estadísticas, las búsquedas y la creación de alertas que puedan utilizar los posibles licitadores. Los códigos CPV suelen ser asignados manualmente por los empleados públicos encargados del proceso de contratación. Sin embargo, la clasificación de textos de acuerdo con estos códigos no es trivial, pues existen más de 9000 CPVs y no siempre se siguen los mismos criterios para su asignación. En este artículo se propone un clasificador que utiliza como entrada la descripción textual del proceso de contratación, y produce códigos de entre las 45 categorías de CPV más generales de la jerarquía. Trabajamos solo con textos en español, aunque nuestro enfoque puede extenderse fácilmente a otros idiomas. Los resultados obtenidos superan el estado del arte (10% de mejora en F1), y se encuentran disponibles online.This work has been supported by NextProcurement European Action (grant agreement INEA/CEF/ICT/A2020/2373713-Action 2020-ES-IA-0255) and the Madrid Government (Comunidad de Madrid-Spain) under the Multiannual Agreement with Universidad Politécnica de Madrid in the line Support for R&D projects for Beatriz Galindo researchers, in the context of the V PRICIT (Regional Programme of Research and Technological Innovation)

Repositorio Institucional de la Universidad de Alicante

GUDN: A novel guide network with label reinforcement strategy for extreme multi-label text classification

Author: Asamoah Kwame Omono
Shi Jianyang
Shu Hongji
Wang Qing
Zhou Cong
Zhu Jia
Publication venue
Publication date: 21/11/2022
Field of study

In natural language processing, extreme multi-label text classification is an emerging but essential task. The problem of extreme multi-label text classification (XMTC) is to recall some of the most relevant labels for a text from an extremely large label set. Large-scale pre-trained models have brought a new trend to this problem. Though the large-scale pre-trained models have made significant achievements on this problem, the valuable fine-tuned methods have yet to be studied. Though label semantics have been introduced in XMTC, the vast semantic gap between texts and labels has yet to gain enough attention. This paper builds a new guide network (GUDN) to help fine-tune the pre-trained model to instruct classification later. Furthermore, GUDN uses raw label semantics combined with a helpful label reinforcement strategy to effectively explore the latent space between texts and labels, narrowing the semantic gap, which can further improve predicted accuracy. Experimental results demonstrate that GUDN outperforms state-of-the-art methods on Eurlex-4k and has competitive results on other popular datasets. In an additional experiment, we investigated the input lengths' influence on the Transformer-based model's accuracy. Our source code is released at https://t.hk.uy/aFSH.Comment: 12 pages, 6 figure

arXiv.org e-Print Archive

Directory of Open Access Journals

CascadeXML: Rethinking Transformers for End-to-end Multi-resolution Training in Extreme Multi-label Classification

Author: Babbar Rohit
Banerjee Atmadeep
Kharbanda Siddhant
Schultheis Erik
Publication venue
Publication date: 29/10/2022
Field of study

Extreme Multi-label Text Classification (XMC) involves learning a classifier that can assign an input with a subset of most relevant labels from millions of label choices. Recent approaches, such as XR-Transformer and LightXML, leverage a transformer instance to achieve state-of-the-art performance. However, in this process, these approaches need to make various trade-offs between performance and computational requirements. A major shortcoming, as compared to the Bi-LSTM based AttentionXML, is that they fail to keep separate feature representations for each resolution in a label tree. We thus propose CascadeXML, an end-to-end multi-resolution learning pipeline, which can harness the multi-layered architecture of a transformer model for attending to different label resolutions with separate feature representations. CascadeXML significantly outperforms all existing approaches with non-trivial gains obtained on benchmark datasets consisting of up to three million labels. Code for CascadeXML will be made publicly available at \url{https://github.com/xmc-aalto/cascadexml}

arXiv.org e-Print Archive

A Survey on Extreme Multi-label Learning

Author: Li Yu-Feng
Mao Zhen
Shi Jiang-Xin
Wei Tong
Zhang Min-Ling
Publication venue
Publication date: 08/10/2022
Field of study

Multi-label learning has attracted significant attention from both academic and industry field in recent decades. Although existing multi-label learning algorithms achieved good performance in various tasks, they implicitly assume the size of target label space is not huge, which can be restrictive for real-world scenarios. Moreover, it is infeasible to directly adapt them to extremely large label space because of the compute and memory overhead. Therefore, eXtreme Multi-label Learning (XML) is becoming an important task and many effective approaches are proposed. To fully understand XML, we conduct a survey study in this paper. We first clarify a formal definition for XML from the perspective of supervised learning. Then, based on different model architectures and challenges of the problem, we provide a thorough discussion of the advantages and disadvantages of each category of methods. For the benefit of conducting empirical studies, we collect abundant resources regarding XML, including code implementations, and useful tools. Lastly, we propose possible research directions in XML, such as new evaluation metrics, the tail label problem, and weakly supervised XML.Comment: A preliminary versio

arXiv.org e-Print Archive