Search CORE

18 research outputs found

Retrieval Models for Genre Classification

Author: Eissen Sven Meyer zu
Stein Benno
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/01/2008
Field of study

Genre provides a characterization of a document with respect to its form or functional trait. Genre is orthogonal to topic, rendering genre information a powerful filter technology for information seekers in digital libraries. However, an efficient means for genre classification is an open and controversially discussed issue. This paper gives an overview and presents new results related to automatic genre classification of text documents. We present a comprehensive survey which contrasts the genre retrieval models that have been developed for Web and non-Web corpora. With the concept of genre-specific core vocabularies the paper provides an original contribution related to computational aspects and classification performance of genre retrieval models: we show how such vocabularies are acquired automatically and introduce new concentration measures that quantify the vocabulary distribution in a sensible way. Based on these findings we construct lightweight genre retrieval models and evaluate their discriminative power and computational efficiency. The presented concepts go beyond the existing utilization of vocabulary-centered, genre-revealing features and open new possibilities for the construction of genre classifiers that operate in real-time

CiteSeerX

AIS Electronic Library (AISeL)

Service-orientierte Architekturen für Information Retrieval

Author: Meyer zu Eissen Sven
Stein Benno
Publication venue
Publication date: 28/04/2011
Field of study

Dieses Papier gibt eine Einführung in TIRA, einer Software-Architektur für die Erstellung maßgeschneiderter Information-Retrieval-Werkzeuge. TIRA ermöglicht Anwendern, den Verarbeitungsprozess eines gewünschten IR-Werkzeugs interaktiv als Graph zu spezifizieren: die Knoten des Graphen bezeichnen so genannte "IRBasisdienste", Kanten modellieren Kontroll- und Datenflüsse. TIRA bietet die Funktionalität eines Laufzeit-Containers, um die spezifizierten Verarbeitungsprozesse in einer verteilten Umgebung auszuführen. Motivation für unsere Forschung ist u. a. die Herausforderung der Personalisierung: Es gibt eine Diskrepanz zwischen der IR-Theorie und ihren Algorithmen und der – an persönlichen Wünschen angepassten – Implementierung, Verteilung und Ausführung entsprechender Programme. Diese Kluft kann mit adäquater Softwaretechnik verkleinert werden

University of Hildesheim

Near Similarity Search and Plagiarism Analysis

Author: Benno Stein
Sven Meyer Zu Eissen
Publication venue
Publication date
Field of study

Abstract. Existing methods to text plagiarism analysis mainly base on “chunking”, a process of grouping a text into meaningful units each of which gets encoded by an integer number. Together theses numbers form a document’s signature or fingerprint. An overlap of two documents ’ fingerprints indicate a possibly plagiarized text passage. Most approaches use MD5 hashes to construct fingerprints, which is bound up with two problems: (i) it is computationally expensive, (ii) a small chunk size must be chosen to identify matching passages, which additionally increases the effort for fingerprint computation, fingerprint comparison, and fingerprint storage. This paper proposes a new class of fingerprints that can be considered as an abstraction of the classical vector space model. These fingerprints operationalize the concept of “near similarity ” and enable one to quickly identify candidate passages for plagiarism. Experiments show that a plagiarism analysis based on our fingerprints leads to a speed-up by a factor of five and higher—without compromising the recall performance. 1 Plagiarism Analysi

CiteSeerX

Genre Classification of Web Pages: User Study and Feasibility Analysis

Author: Benno Stein
Sven Meyer zu Eissen
Publication venue: Springer
Publication date
Field of study

Genre classification means to discriminate between documents by means of their form, their style, or their targeted audience. Put another way, genre classification is orthogonal to a classification based on the documents’ contents. While most of the existing investigations of an automated genre classification are based on news articles corpora, the idea here is applied to arbitrary Web pages. We see genre classification as a powerful instrument to bring Web-based search services closer to a user’s information need. This objective raises two questions: (1) What are useful genres when searching the WWW? (2) Can these genres be reliably identified? The paper in hand presents results from a user study on Web genre usefulness as well as results from the construction of a genre classifier using discriminant analysis, neural network learning, and support vector machines. Particular attention is turned to a classifier’s underlying feature set: Aside from the standard feature types we introduce new features that are based on word frequency classes and that can be computed with minimum computational effort. They allow us to construct compact feature sets with few elements, with which a satisfactory genre diversification is achieved. About 70 % of the Web-documents are assigned to their true genre; note in this connection that no genre classification benchmark for Web pages has been published so far

CiteSeerX

Document Categorization with MajorClust

Author: Benno Stein
Sven Meyer Zu Eissen
Publication venue
Publication date
Field of study

Abstract This paper investigates the text categorization capabilities of two special clustering algorithms: Fuzzy k-Medoid and MAJORCLUST. Aside from quantifying the categorization performance of the mentioned algorithms, our experimental setting will also help to answer special questions related to clustering problems such as cluster number determination or cluster quality evaluation

CiteSeerX

Automatic Document Categorization: Interpreting the Performance of Clustering Algorithms

Author: Benno Stein
Sven Meyer Zu Eissen
Publication venue: Springer
Publication date
Field of study

Abstract Clustering a document collection is the current approach to automatically derive underlying document categories. The categorization performance of a document clustering algorithm can be captured by the F-Measure, which quantifies how close a human-defined categorization has been resembled. However, a bad F-Measure value tells us nothing about the reason why a clustering algorithm performs poorly. Among several possible explanations the most interesting question is the following: Are the implicit assumptions of the clustering algorithm admissible with respect to a document categorization task? Though the use of clustering algorithms for document categorization is widely accepted, no foundation or rationale has been stated for this admissibility question. The paper in hand is devoted to this gap. It presents considerations and a measure to quantify the sensibility of a clustering process with regard to geometric distortions of the data space. Along with the method of multidimensional scaling, this measure provides an instrument for accessing a clustering algorithm’s adequacy

CiteSeerX