864 research outputs found
LSHTC: A Benchmark for Large-Scale Text Classification
LSHTC is a series of challenges which aims to assess the performance of
classification systems in large-scale classification in a a large number of
classes (up to hundreds of thousands). This paper describes the dataset that
have been released along the LSHTC series. The paper details the construction
of the datsets and the design of the tracks as well as the evaluation measures
that we implemented and a quick overview of the results. All of these datasets
are available online and runs may still be submitted on the online server of
the challenges
A Route Confidence Evaluation Method for Reliable Hierarchical Text Categorization
Hierarchical Text Categorization (HTC) is becoming increasingly important
with the rapidly growing amount of text data available in the World Wide Web.
Among the different strategies proposed to cope with HTC, the Local Classifier
per Node (LCN) approach attains good performance by mirroring the underlying
class hierarchy while enforcing a top-down strategy in the testing step.
However, the problem of embedding hierarchical information (parent-child
relationship) to improve the performance of HTC systems still remains open. A
confidence evaluation method for a selected route in the hierarchy is proposed
to evaluate the reliability of the final candidate labels in an HTC system. In
order to take into account the information embedded in the hierarchy, weight
factors are used to take into account the importance of each level. An
acceptance/rejection strategy in the top-down decision making process is
proposed, which improves the overall categorization accuracy by rejecting a few
percentage of samples, i.e., those with low reliability score. Experimental
results on the Reuters benchmark dataset (RCV1- v2) confirm the effectiveness
of the proposed method, compared to other state-of-the art HTC methods
- …