14,463 research outputs found
Hierarchical Text Classification with Reinforced Label Assignment
While existing hierarchical text classification (HTC) methods attempt to
capture label hierarchies for model training, they either make local decisions
regarding each label or completely ignore the hierarchy information during
inference. To solve the mismatch between training and inference as well as
modeling label dependencies in a more principled way, we formulate HTC as a
Markov decision process and propose to learn a Label Assignment Policy via deep
reinforcement learning to determine where to place an object and when to stop
the assignment process. The proposed method, HiLAP, explores the hierarchy
during both training and inference time in a consistent manner and makes
inter-dependent decisions. As a general framework, HiLAP can incorporate
different neural encoders as base models for end-to-end training. Experiments
on five public datasets and four base models show that HiLAP yields an average
improvement of 33.4% in Macro-F1 over flat classifiers and outperforms
state-of-the-art HTC methods by a large margin. Data and code can be found at
https://github.com/morningmoni/HiLAP.Comment: EMNLP 201
Interdisciplinary Fairness in Imbalanced Research Proposal Topic Inference: A Hierarchical Transformer-based Method with Selective Interpolation
The objective of topic inference in research proposals aims to obtain the
most suitable disciplinary division from the discipline system defined by a
funding agency. The agency will subsequently find appropriate peer review
experts from their database based on this division. Automated topic inference
can reduce human errors caused by manual topic filling, bridge the knowledge
gap between funding agencies and project applicants, and improve system
efficiency. Existing methods focus on modeling this as a hierarchical
multi-label classification problem, using generative models to iteratively
infer the most appropriate topic information. However, these methods overlook
the gap in scale between interdisciplinary research proposals and
non-interdisciplinary ones, leading to an unjust phenomenon where the automated
inference system categorizes interdisciplinary proposals as
non-interdisciplinary, causing unfairness during the expert assignment. How can
we address this data imbalance issue under a complex discipline system and
hence resolve this unfairness? In this paper, we implement a topic label
inference system based on a Transformer encoder-decoder architecture.
Furthermore, we utilize interpolation techniques to create a series of
pseudo-interdisciplinary proposals from non-interdisciplinary ones during
training based on non-parametric indicators such as cross-topic probabilities
and topic occurrence probabilities. This approach aims to reduce the bias of
the system during model training. Finally, we conduct extensive experiments on
a real-world dataset to verify the effectiveness of the proposed method. The
experimental results demonstrate that our training strategy can significantly
mitigate the unfairness generated in the topic inference task.Comment: 19 pages, Under review. arXiv admin note: text overlap with
arXiv:2209.1391
Hierarchical Metadata-Aware Document Categorization under Weak Supervision
Categorizing documents into a given label hierarchy is intuitively appealing
due to the ubiquity of hierarchical topic structures in massive text corpora.
Although related studies have achieved satisfying performance in fully
supervised hierarchical document classification, they usually require massive
human-annotated training data and only utilize text information. However, in
many domains, (1) annotations are quite expensive where very few training
samples can be acquired; (2) documents are accompanied by metadata information.
Hence, this paper studies how to integrate the label hierarchy, metadata, and
text signals for document categorization under weak supervision. We develop
HiMeCat, an embedding-based generative framework for our task. Specifically, we
propose a novel joint representation learning module that allows simultaneous
modeling of category dependencies, metadata information and textual semantics,
and we introduce a data augmentation module that hierarchically synthesizes
training documents to complement the original, small-scale training set. Our
experiments demonstrate a consistent improvement of HiMeCat over competitive
baselines and validate the contribution of our representation learning and data
augmentation modules.Comment: 9 pages; Accepted to WSDM 202
- …