418,830 research outputs found
Czech Text Document Corpus v 2.0
This paper introduces "Czech Text Document Corpus v 2.0", a collection of
text documents for automatic document classification in Czech language. It is
composed of the text documents provided by the Czech News Agency and is freely
available for research purposes at http://ctdc.kiv.zcu.cz/. This corpus was
created in order to facilitate a straightforward comparison of the document
classification approaches on Czech data. It is particularly dedicated to
evaluation of multi-label document classification approaches, because one
document is usually labelled with more than one label. Besides the information
about the document classes, the corpus is also annotated at the morphological
layer. This paper further shows the results of selected state-of-the-art
methods on this corpus to offer the possibility of an easy comparison with
these approaches.Comment: Accepted for LREC 201
Probability of Semantic Similarity and N-grams Pattern Learning for Data Classification
Semantic learning is an important mechanism for the document classification, but most classification approaches are only considered the content and words distribution. Traditional classification algorithms cannot accurately represent the meaning of a document because it does not take into account semantic relations between words. In this paper, we present an approach for classification of documents by incorporating two similarity computing score method. First, a semantic similarity method which computes the probable similarity based on the Bayes' method and second, n-grams pairs based on the frequent terms probability similarity score. Since, both semantic and N-grams pairs can play important roles in a separated views for the classification of the document, we design a semantic similarity learning (SSL) algorithm to improves the performance of document classification for a huge quantity of unclassified documents. The experiment evaluation shows an improvisation in accuracy and effectiveness of the proposal for the unclassified documents
GeoCLEF 2007: the CLEF 2007 cross-language geographic information retrieval track overview
GeoCLEF ran as a regular track for the second time within the Cross
Language Evaluation Forum (CLEF) 2007. The purpose of GeoCLEF is to test
and evaluate cross-language geographic information retrieval (GIR): retrieval
for topics with a geographic specification. GeoCLEF 2007 consisted of two sub
tasks. A search task ran for the third time and a query classification task was
organized for the first. For the GeoCLEF 2007 search task, twenty-five search
topics were defined by the organizing groups for searching English, German,
Portuguese and Spanish document collections. All topics were translated into
English, Indonesian, Portuguese, Spanish and German. Several topics in 2007
were geographically challenging. Thirteen groups submitted 108 runs. The
groups used a variety of approaches. For the classification task, a query log
from a search engine was provided and the groups needed to identify the
queries with a geographic scope and the geographic components within the
local queries
Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based Approaches
Text classification of unseen classes is a challenging Natural Language
Processing task and is mainly attempted using two different types of
approaches. Similarity-based approaches attempt to classify instances based on
similarities between text document representations and class description
representations. Zero-shot text classification approaches aim to generalize
knowledge gained from a training task by assigning appropriate labels of
unknown classes to text documents. Although existing studies have already
investigated individual approaches to these categories, the experiments in
literature do not provide a consistent comparison. This paper addresses this
gap by conducting a systematic evaluation of different similarity-based and
zero-shot approaches for text classification of unseen classes. Different
state-of-the-art approaches are benchmarked on four text classification
datasets, including a new dataset from the medical domain. Additionally, novel
SimCSE and SBERT-based baselines are proposed, as other baselines used in
existing work yield weak classification results and are easily outperformed.
Finally, the novel similarity-based Lbl2TransformerVec approach is presented,
which outperforms previous state-of-the-art approaches in unsupervised text
classification. Our experiments show that similarity-based approaches
significantly outperform zero-shot approaches in most cases. Additionally,
using SimCSE or SBERT embeddings instead of simpler text representations
increases similarity-based classification results even further.Comment: Accepted to 6th International Conference on Natural Language
Processing and Information Retrieval (NLPIR '22
COMPARATIVE ANALYSIS OF PARTICLE SWARM OPTIMIZATION ALGORITHMS FOR TEXT FEATURE SELECTION
With the rapid growth of Internet, more and more natural language text documents are available in electronic format, making automated text categorization a must in most fields. Due to the high dimensionality of text categorization tasks, feature selection is needed before executing document classification. There are basically two kinds of feature selection approaches: the filter approach and the wrapper approach. For the wrapper approach, a search algorithm for feature subsets and an evaluation algorithm for assessing the fitness of the selected feature subset are required. In this work, I focus on the comparison between two wrapper approaches. These two approaches use Particle Swarm Optimization (PSO) as the search algorithm. The first algorithm is PSO based K-Nearest Neighbors (KNN) algorithm, while the second is PSO based Rocchio algorithm. Three datasets are used in this study. The result shows that BPSO-KNN is slightly better in classification results than BPSO-Rocchio, while BPSO-Rocchio has far shorter computation time than BPSO-KNN
- …