1 research outputs found
Two Steps Feature Selection and Neural Network Classification for the TREC-8 Routing
For the TREC-8 routing, one specific filter is built for each topic. Each
filter is a classifier trained to recognize the documents that are relevant to
the topic. When presented with a document, each classifier estimates the
probability for the document to be relevant to the topic for which it has been
trained. Since the procedure for building a filter is topic-independent, the
system is fully automatic.
By making use of a sample of documents that have previously been evaluated as
relevant or not relevant to a particular topic, a term selection is performed,
and a neural network is trained. Each document is represented by a vector of
frequencies of a list of selected terms. This list depends on the topic to be
filtered; it is constructed in two steps. The first step defines the
characteristic words used in the relevant documents of the corpus; the second
one chooses, among the previous list, the most discriminant ones. The length of
the vector is optimized automatically for each topic. At the end of the term
selection, a vector of typically 25 words is defined for the topic, so that
each document which has to be processed is represented by a vector of term
frequencies.
This vector is subsequently input to a classifier that is trained from the
same sample. After training, the classifier estimates for each document of a
test set its probability of being relevant; for submission to TREC, the top
1000 documents are ranked in order of decreasing relevance.Comment: 5 pages, 1 figure, Eighth International Text REtrieval Conference
(TREC-8