15 research outputs found
Review of Extreme Multilabel Classification
Extreme multilabel classification or XML, is an active area of interest in
machine learning. Compared to traditional multilabel classification, here the
number of labels is extremely large, hence, the name extreme multilabel
classification. Using classical one versus all classification wont scale in
this case due to large number of labels, same is true for any other
classifiers. Embedding of labels as well as features into smaller label space
is an essential first step. Moreover, other issues include existence of head
and tail labels, where tail labels are labels which exist in relatively smaller
number of given samples. The existence of tail labels creates issues during
embedding. This area has invited application of wide range of approaches
ranging from bit compression motivated from compressed sensing, tree based
embeddings, deep learning based latent space embedding including using
attention weights, linear algebra based embeddings such as SVD, clustering,
hashing, to name a few. The community has come up with a useful set of metrics
to identify correctly the prediction for head or tail labels.Comment: 46 pages, 13 figure
The Emerging Trends of Multi-Label Learning
Exabytes of data are generated daily by humans, leading to the growing need
for new efforts in dealing with the grand challenges for multi-label learning
brought by big data. For example, extreme multi-label classification is an
active and rapidly growing research area that deals with classification tasks
with an extremely large number of classes or labels; utilizing massive data
with limited supervision to build a multi-label classification model becomes
valuable for practical applications, etc. Besides these, there are tremendous
efforts on how to harvest the strong learning capability of deep learning to
better capture the label dependencies in multi-label learning, which is the key
for deep learning to address real-world classification tasks. However, it is
noted that there has been a lack of systemic studies that focus explicitly on
analyzing the emerging trends and new challenges of multi-label learning in the
era of big data. It is imperative to call for a comprehensive survey to fulfill
this mission and delineate future research directions and new applications.Comment: Accepted to TPAMI 202
Light-weight Deep Extreme Multilabel Classification
Extreme multi-label (XML) classification refers to the task of supervised
multi-label learning that involves a large number of labels. Hence, scalability
of the classifier with increasing label dimension is an important
consideration. In this paper, we develop a method called LightDXML which
modifies the recently developed deep learning based XML framework by using
label embeddings instead of feature embedding for negative sampling and
iterating cyclically through three major phases: (1) proxy training of label
embeddings (2) shortlisting of labels for negative sampling and (3) final
classifier training using the negative samples. Consequently, LightDXML also
removes the requirement of a re-ranker module, thereby, leading to further
savings on time and memory requirements. The proposed method achieves the best
of both worlds: while the training time, model size and prediction times are on
par or better compared to the tree-based methods, it attains much better
prediction accuracy that is on par with the deep learning based methods.
Moreover, the proposed approach achieves the best tail-label prediction
accuracy over most state-of-the-art XML methods on some of the large
datasets\footnote{accepted in IJCNN 2023, partial funding from MAPG grant and
IIIT Seed grant at IIIT, Hyderabad, India. Code:
\url{https://github.com/misterpawan/LightDXML}Comment: 9 pages, 2 figures, 5 table
Clasificaci贸n multi-etiqueta de textos de licitaciones p煤blicas en espa帽ol
Public procurement accounts for a 14% of the annual budget of the different governments of the European Union. In Europe, contracting processes are classified using Common Procurement Vocabulary codes (CPVs), a taxonomy designed to facilitate statistical reporting, search and the creation of alerts that can be used by potential bidders. CPVs are commonly assigned manually by public employees in charge of contracting processes. However, CPV classification is not a trivial task, as there are more than 9,000 different CPV categories, which are often assigned following heterogeneous criteria. In this paper we have created a CPV classifier that uses as an input the textual description of the contracting process, and assigns CPVs from the 45 top-level CPV categories. We work only with texts in Spanish, although our approach may be easily extended to other languages. Our results improve the state of the art (10% F1-score improvement) and are available online.Las licitaciones p煤blicas suponen el 14% del presupuesto anual de la Uni贸n Europea. En Europa, los procesos de contrataci贸n se clasifican usando la taxonom铆a Common Procurement Vocabulary (CPVs), dise帽ada para facilitar la generaci贸n de estad铆sticas, las b煤squedas y la creaci贸n de alertas que puedan utilizar los posibles licitadores. Los c贸digos CPV suelen ser asignados manualmente por los empleados p煤blicos encargados del proceso de contrataci贸n. Sin embargo, la clasificaci贸n de textos de acuerdo con estos c贸digos no es trivial, pues existen m谩s de 9000 CPVs y no siempre se siguen los mismos criterios para su asignaci贸n. En este art铆culo se propone un clasificador que utiliza como entrada la descripci贸n textual del proceso de contrataci贸n, y produce c贸digos de entre las 45 categor铆as de CPV m谩s generales de la jerarqu铆a. Trabajamos solo con textos en espa帽ol, aunque nuestro enfoque puede extenderse f谩cilmente a otros idiomas. Los resultados obtenidos superan el estado del arte (10% de mejora en F1), y se encuentran disponibles online.This work has been supported by NextProcurement European Action (grant agreement INEA/CEF/ICT/A2020/2373713-Action 2020-ES-IA-0255) and the Madrid Government (Comunidad de Madrid-Spain) under the Multiannual Agreement with Universidad Polit茅cnica de Madrid in the line Support for R&D projects for Beatriz Galindo researchers, in the context of the V PRICIT (Regional Programme of Research and Technological Innovation)
GUDN: A novel guide network with label reinforcement strategy for extreme multi-label text classification
In natural language processing, extreme multi-label text classification is an
emerging but essential task. The problem of extreme multi-label text
classification (XMTC) is to recall some of the most relevant labels for a text
from an extremely large label set. Large-scale pre-trained models have brought
a new trend to this problem. Though the large-scale pre-trained models have
made significant achievements on this problem, the valuable fine-tuned methods
have yet to be studied. Though label semantics have been introduced in XMTC,
the vast semantic gap between texts and labels has yet to gain enough
attention. This paper builds a new guide network (GUDN) to help fine-tune the
pre-trained model to instruct classification later. Furthermore, GUDN uses raw
label semantics combined with a helpful label reinforcement strategy to
effectively explore the latent space between texts and labels, narrowing the
semantic gap, which can further improve predicted accuracy. Experimental
results demonstrate that GUDN outperforms state-of-the-art methods on Eurlex-4k
and has competitive results on other popular datasets. In an additional
experiment, we investigated the input lengths' influence on the
Transformer-based model's accuracy. Our source code is released at
https://t.hk.uy/aFSH.Comment: 12 pages, 6 figure
CascadeXML: Rethinking Transformers for End-to-end Multi-resolution Training in Extreme Multi-label Classification
Extreme Multi-label Text Classification (XMC) involves learning a classifier
that can assign an input with a subset of most relevant labels from millions of
label choices. Recent approaches, such as XR-Transformer and LightXML, leverage
a transformer instance to achieve state-of-the-art performance. However, in
this process, these approaches need to make various trade-offs between
performance and computational requirements. A major shortcoming, as compared to
the Bi-LSTM based AttentionXML, is that they fail to keep separate feature
representations for each resolution in a label tree. We thus propose
CascadeXML, an end-to-end multi-resolution learning pipeline, which can harness
the multi-layered architecture of a transformer model for attending to
different label resolutions with separate feature representations. CascadeXML
significantly outperforms all existing approaches with non-trivial gains
obtained on benchmark datasets consisting of up to three million labels. Code
for CascadeXML will be made publicly available at
\url{https://github.com/xmc-aalto/cascadexml}
A Survey on Extreme Multi-label Learning
Multi-label learning has attracted significant attention from both academic
and industry field in recent decades. Although existing multi-label learning
algorithms achieved good performance in various tasks, they implicitly assume
the size of target label space is not huge, which can be restrictive for
real-world scenarios. Moreover, it is infeasible to directly adapt them to
extremely large label space because of the compute and memory overhead.
Therefore, eXtreme Multi-label Learning (XML) is becoming an important task and
many effective approaches are proposed. To fully understand XML, we conduct a
survey study in this paper. We first clarify a formal definition for XML from
the perspective of supervised learning. Then, based on different model
architectures and challenges of the problem, we provide a thorough discussion
of the advantages and disadvantages of each category of methods. For the
benefit of conducting empirical studies, we collect abundant resources
regarding XML, including code implementations, and useful tools. Lastly, we
propose possible research directions in XML, such as new evaluation metrics,
the tail label problem, and weakly supervised XML.Comment: A preliminary versio