2 research outputs found
Entity-Centric Stream Filtering and Ranking: Filtering and Unfilterable Documents
Cumulative Citation Recommendation (CCR) is defined as: given a stream of documents on one hand and Knowledge Base (KB) entities on the other, filter, rank and recommend citation-worthy documents. The pipeline encountered in systems that approach this problem involves four stages: filtering, classification, ranking (or scoring), and evaluation. Filtering is only an initial step that reduces the web-scale corpus into a working set of documents more manageable for the subsequent stages. Nevertheless, this step has a large impact on the recall that can be attained maximally. This study analyzes in-depth the main factors that affect recall in the filtering stage. We investigate the impact of choices for corpus cleansing, entity profile construction, entity type, document type, and relevance grade. Because failing on recall in this first step of the pipeline cannot be repaired later on, we identify and characterize the citation-worthy documents that do not pass the filtering stage by examining their contents
Entity-Centric Stream Filtering and Ranking: Filtering and Unfilterable Documents
htmlabstractCumulative Citation Recommendation (CCR) is defined as:
given a stream of documents on one hand and Knowledge Base (KB) entities on the other, filter, rank and recommend citation-worthy documents.
The pipeline encountered in systems that approach this problem involves
four stages: filtering, classification, ranking (or scoring), and evaluation.
Filtering is only an initial step that reduces the web-scale corpus into a
working set of documents more manageable for the subsequent stages.
Nevertheless, this step has a large impact on the recall that can be at-
tained maximally. This study analyzes in-depth the main factors that
affect recall in the filtering stage. We investigate the impact of choices
for corpus cleansing, entity profile construction, entity type, document
type, and relevance grade. Because failing on recall in this first step of
the pipeline cannot be repaired later on, we identify and characterize
the citation-worthy documents that do not pass the filtering stage by
examining their contents