15,837 research outputs found
Searching for Ground Truth: a stepping stone in automating genre classification
This paper examines genre classification of documents and
its role in enabling the effective automated management of digital documents by digital libraries and other repositories. We have previously presented genre classification as a valuable step toward achieving automated extraction of descriptive metadata for digital material. Here, we present results from experiments using human labellers, conducted to assist in genre characterisation and the prediction of obstacles which need to be overcome by an automated system, and to contribute to the process of creating a solid testbed corpus for extending automated genre classification and testing metadata extraction tools across genres. We also describe the performance of two classifiers based on image and stylistic modeling features in labelling the data resulting from the agreement of three human labellers across fifteen genre classes.
Supporting polyrepresentation in a quantum-inspired geometrical retrieval framework
The relevance of a document has many facets, going beyond the usual topical one, which have to be considered to satisfy a user's information need. Multiple representations of documents, like user-given reviews or the actual document content, can give evidence towards certain facets of relevance. In this respect polyrepresentation of documents, where such evidence is combined, is a crucial concept to estimate the relevance of a document. In this paper, we discuss how a geometrical retrieval framework inspired by quantum mechanics can be extended to support polyrepresentation. We show by example how different representations of a document can be modelled in a Hilbert space, similar to physical systems known from quantum mechanics. We further illustrate how these representations are combined by means of the tensor product to support polyrepresentation, and discuss the case that representations of documents are not independent from a user point of view. Besides giving a principled framework for polyrepresentation, the potential of this approach is to capture and formalise the complex interdependent relationships that the different representations can have between each other
Formulating representative features with respect to document genre classification
Genre classification (e.g. whether a document
is a scientific article or magazine article) is closely
bound to the physical and conceptual structure of document
as well as the level of depth involved in the text.
Hence, it provides a means of ranking documents retrieved
by search tools according to metrics other than
topical similarity. Moreover, the structural information
derived from genre classification can be used to locate
target information within the text. In previous studies,
the detection of genre classes has been attempted
by using some normalised frequency of terms or combinations
of terms in the document (here, we are using
term as a reference to words, phrases, syntactic
units, sentences and paragraphs, as well as other patterns
derived from deeper linguistic or semantic analysis).
These approaches largely neglect how the term is
distributed throughout the document. Here, we report
the results of automated experiments based on distributive
statistics of words in order to present evidence that
term distribution pattern is a better indicator of genre
class than term frequency.
Query Expansion for Survey Question Retrieval in the Social Sciences
In recent years, the importance of research data and the need to archive and
to share it in the scientific community have increased enormously. This
introduces a whole new set of challenges for digital libraries. In the social
sciences typical research data sets consist of surveys and questionnaires. In
this paper we focus on the use case of social science survey question reuse and
on mechanisms to support users in the query formulation for data sets. We
describe and evaluate thesaurus- and co-occurrence-based approaches for query
expansion to improve retrieval quality in digital libraries and research data
archives. The challenge here is to translate the information need and the
underlying sociological phenomena into proper queries. As we can show retrieval
quality can be improved by adding related terms to the queries. In a direct
comparison automatically expanded queries using extracted co-occurring terms
can provide better results than queries manually reformulated by a domain
expert and better results than a keyword-based BM25 baseline.Comment: to appear in Proceedings of 19th International Conference on Theory
and Practice of Digital Libraries 2015 (TPDL 2015
Using citation-context to reduce topic drifting on pure citation-based recommendation
Recent works in the area of academic recommender systems have demonstrated the effectiveness of co-citation and citation closeness in related-document recommendations. However, documents recommended from such systems may drift away from the main theme of the query document. In this work, we investigate whether incorporating the textual information in close proximity to a citation as well as the citation position could reduce such drifting and further increase the performance of the recommender system. To investigate this, we run experiments with several recommendation methods on a newly created and now publicly available dataset containing 53 million unique citation-based records. We then conduct a user-based evaluation with domain-knowledgeable participants. Our results show that a new method based on the combination of Citation Proximity Analysis (CPA), topic modelling and word embeddings achieves more than 20% improvement in Normalised Discounted Cumulative Gain (nDCG) compared to CPA
A probabilistic justification for using tf.idf term weighting in information retrieval
This paper presents a new probabilistic model of information retrieval. The most important modeling assumption made is that documents and queries are defined by an ordered sequence of single terms. This assumption is not made in well known existing models of information retrieval, but is essential in the field of statistical natural language processing. Advances already made in statistical natural language processing will be used in this paper to formulate a probabilistic justification for using tf.idf term weighting. The paper shows that the new probabilistic interpretation of tf.idf term weighting might lead to better understanding of statistical ranking mechanisms, for example by explaining how they relate to coordination level ranking. A pilot experiment on the TREC collection shows that the linguistically motivated weighting algorithm outperforms the popular BM25 weighting algorithm
Digital Image Access & Retrieval
The 33th Annual Clinic on Library Applications of Data Processing, held at the University of Illinois at Urbana-Champaign in March of 1996, addressed the theme of "Digital Image Access & Retrieval." The papers from this conference cover a wide range of topics concerning digital imaging technology for visual resource collections. Papers covered three general areas: (1) systems, planning, and implementation; (2) automatic and semi-automatic indexing; and (3) preservation with the bulk of the conference focusing on indexing and retrieval.published or submitted for publicatio
Feature Type Analysis in Automated Genre Classification
In this paper, we compare classifiers based on language model, image, and stylistic features for automated genre classification. The majority of previous studies in genre classification have created models based on an amalgamated representation of a document using a multitude of features. In these models, the inseparable roles of different features make it difficult to determine a means of improving the classifier when it exhibits poor performance in detecting selected genres. By independently modeling and comparing classifiers based on features belonging to three types, describing visual, stylistic, and topical properties, we demonstrate that different genres have distinctive feature strengths.
- …