33,091 research outputs found
Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation
Prompt Tuning is emerging as a scalable and cost-effective method to
fine-tune Pretrained Language Models (PLMs), which are often referred to as
Large Language Models (LLMs). This study benchmarks the performance and
computational efficiency of Prompt Tuning and baselines for multi-label text
classification. This is applied to the challenging task of classifying
companies into an investment firm's proprietary industry taxonomy, supporting
their thematic investment strategy. Text-to-text classification is frequently
reported to outperform task-specific classification heads, but has several
limitations when applied to a multi-label classification problem where each
label consists of multiple tokens: (a) Generated labels may not match any label
in the label taxonomy; (b) The fine-tuning process lacks permutation invariance
and is sensitive to the order of the provided labels; (c) The model provides
binary decisions rather than appropriate confidence scores. Limitation (a) is
addressed by applying constrained decoding using Trie Search, which slightly
improves classification performance. All limitations (a), (b), and (c) are
addressed by replacing the PLM's language head with a classification head,
which is referred to as Prompt Tuned Embedding Classification (PTEC). This
improves performance significantly, while also reducing computational costs
during inference. In our industrial application, the training data is skewed
towards well-known companies. We confirm that the model's performance is
consistent across both well-known and less-known companies. Our overall results
indicate the continuing need to adapt state-of-the-art methods to
domain-specific tasks, even in the era of PLMs with strong generalization
abilities. We release our codebase and a benchmarking dataset at
https://github.com/EQTPartners/PTEC
Deep Extreme Multi-label Learning
Extreme multi-label learning (XML) or classification has been a practical and
important problem since the boom of big data. The main challenge lies in the
exponential label space which involves possible label sets especially
when the label dimension is huge, e.g., in millions for Wikipedia labels.
This paper is motivated to better explore the label space by originally
establishing an explicit label graph. In the meanwhile, deep learning has been
widely studied and used in various classification problems including
multi-label classification, however it has not been properly introduced to XML,
where the label space can be as large as in millions. In this paper, we propose
a practical deep embedding method for extreme multi-label classification, which
harvests the ideas of non-linear embedding and graph priors-based label space
modeling simultaneously. Extensive experiments on public datasets for XML show
that our method performs competitive against state-of-the-art result
- …