52,352 research outputs found

    TERM WEIGHTING BASED ON INDEX OF GENRE FOR WEB PAGE GENRE CLASSIFICATION

    Get PDF
    Automating the identification of the genre of web pages becomes an important area in web pages classification, as it can be used to improve the quality of the web search result and to reduce search time. To index the terms used in classification, generally the selected type of weighting is the document-based TF-IDF. However, this method does not consider genre, whereas web page documents have a type of categorization called genre. With the existence of genre, the term appearing often in a genre should be more significant in document indexing compared to the term appearing frequently in many genres despites its high TF-IDF value. We proposed a new weighting method for web page documents indexing called inverse genre frequency (IGF). This method is based on genre, a manual categorization done semantically from previous research. Experimental results show that the term weighting based on index of genre (TF-IGF) performed better compared to term weighting based on index of document (TF-IDF), with the highest value of accuracy, precision, recall, and F-measure in case of excluding the genre-specific keywords were 78%, 80.2%, 78%, and 77.4% respectively, and in case of including the genre-specific keywords were 78.9%, 78.7%, 78.9%, and 78.1% respectively

    Retrieval Models for Genre Classification

    Get PDF
    Genre provides a characterization of a document with respect to its form or functional trait. Genre is orthogonal to topic, rendering genre information a powerful filter technology for information seekers in digital libraries. However, an efficient means for genre classification is an open and controversially discussed issue. This paper gives an overview and presents new results related to automatic genre classification of text documents. We present a comprehensive survey which contrasts the genre retrieval models that have been developed for Web and non-Web corpora. With the concept of genre-specific core vocabularies the paper provides an original contribution related to computational aspects and classification performance of genre retrieval models: we show how such vocabularies are acquired automatically and introduce new concentration measures that quantify the vocabulary distribution in a sensible way. Based on these findings we construct lightweight genre retrieval models and evaluate their discriminative power and computational efficiency. The presented concepts go beyond the existing utilization of vocabulary-centered, genre-revealing features and open new possibilities for the construction of genre classifiers that operate in real-time

    Human Annotation and Automatic Detection of Web Genres

    Get PDF
    Texts differ from each other in various dimensions such as topic, sentiment, authorship and genre. In this thesis, the dimension of text variation of interest is genre. Unlike topic classification, genre classification focuses on the functional purpose of documents and classifies them into categories such as news, review, online shop, personal home page and conversational forum. In other words, genre classification allows the identification of documents that are similar in terms of purpose, even they are topically very diverse. Research on web genres has been motivated by the idea that finding information on the web can be made easier and more effective by automatic classification techniques that differentiate among web documents with respect to their genres. Following this idea, during the past two decades, researchers have investigated the performance of various genre classification algorithms in order to enhance search engines. Therefore, current web automatic genre identification research has resulted in several genre annotated web-corpora as well as a variety of supervised machine learning algorithms on these corpora. However, previous research suffers from shortcomings in corpus collection and annotation (in particular, low human reliability in genre annotation), which then makes the supervised machine learning results hard to assess and compare to each other as no reliable benchmarks exist. This thesis addresses this shortcoming. First, we built the Leeds Web Genre Corpus Balanced-design (LWGC-B) which is the first reliably annotated corpus for web genres, using crowd-sourcing for genre annotation. This corpus which was compiled by focused search method, overcomes the drawbacks of previous genre annotation efforts such as low inter-coder agreement and false correlation between genre and topic classes. Second, we use this corpus as a benchmark to determine the best features for closed-set supervised machine learning of web genres. Third, we enhance the prevailing supervised machine learning paradigm by using semi-supervised graph-based approaches that make use of the graph-structure of the web to improve classification results. Forth, we expanded our annotation method successfully to Leeds Web Genre Corpus Random (LWGC-R) where the pages to be annotated are collected randomly by querying search engines. This randomly collected corpus also allowed us to investigate coverage of the underlying genre inventory. The result shows that our 15 genre categories are sufficient to cover the majority but not the vast majority of the random web pages. The unique property of the LWGC-R corpus (i.e. having web pages that do not belong to any of the predefined genre classes which we refer to as noise) allowed us to, for the first time, evaluate the performance of an open-set genre classification algorithm on a dataset with noise. The outcome of this experiment indicates that automatic open-set genre classification is a much more challenging task compared to closed-set genre classification due to noise. The results also show that automatic detection of some genre classes is more robust to noise compared to other genre classes

    Functional genre in Illinois State Government digital documents

    Get PDF
    Provisions for collecting or archiving digital documents can be informed by knowledge of the genres of documents likely to be encountered. Although different aspects of collecting and curation may classify documents into genres based on differing criteria (e.g., size, file format, subject), this document addresses classification based on the functional role the document plays in state government, akin to (Toms, 2001), but here specifically Illinois State Government (ISG). The classifications listed herein are based on an overview of ISG digital documents, encountered in over nine years of gathering and archiving work with and for the Illinois State Library (ISL), and on discussions with practitioners in cataloging and in government documents librarianship. This report states definitions, and including examples of each such genre. State government documents are interesting in this regard in that they are presumably somewhat comparable to both federal government documents and business documents. Perhaps surprisingly, there are also portions of the State Web that are somewhat less than businesslike, either in tone or in technological proficiency of implementation. In this respect state government digital documents may also be useful approximations to documents produced either personally or by small activities. Having a list of government document genres can inform work in information promulgation (e.g., through website design, or the design of a series of printed materials), and the grouping of documents for digital library or archival purposes.Library of Congress / NDIIPP-2 A6075unpublishednot peer reviewe

    Variation of word frequencies across genre classification tasks

    Get PDF
    This paper examines automated genre classification of text documents and its role in enabling the effective management of digital documents by digital libraries and other repositories. Genre classification, which narrows down the possible structure of a document, is a valuable step in realising the general automatic extraction of semantic metadata essential to the efficient management and use of digital objects. In the present report, we present an analysis of word frequencies in different genre classes in an effort to understand the distinction between independent classification tasks. In particular, we examine automated experiments on thirty-one genre classes to determine the relationship between the word frequency metrics and the degree of its significance in carrying out classification in varying environments

    Examining Variations of Prominent Features in Genre Classification.

    Get PDF
    This paper investigates the correlation between features of three types (visual, stylistic and topical types) and genre classes. The majority of previous studies in automated genre classification have created models based on an amalgamated representation of a document using a combination of features. In these models, the inseparable roles of different features make it difficult to determine a means of improving the classifier when it exhibits poor performance in detecting selected genres. In this paper we use classifiers independently modeled on three groups of features to examine six genre classes to show that the strongest features for making one classification is not necessarily the best features for carrying out another classification.

    Detecting Family Resemblance: Automated Genre Classification.

    Get PDF
    This paper presents results in automated genre classification of digital documents in PDF format. It describes genre classification as an important ingredient in contextualising scientific data and in retrieving targetted material for improving research. The current paper compares the role of visual layout, stylistic features and language model features in clustering documents and presents results in retrieving five selected genres (Scientific Article, Thesis, Periodicals, Business Report, and Form) from a pool of materials populated with documents of the nineteen most popular genres found in our experimental data set.

    Feature Type Analysis in Automated Genre Classification

    Get PDF
    In this paper, we compare classifiers based on language model, image, and stylistic features for automated genre classification. The majority of previous studies in genre classification have created models based on an amalgamated representation of a document using a multitude of features. In these models, the inseparable roles of different features make it difficult to determine a means of improving the classifier when it exhibits poor performance in detecting selected genres. By independently modeling and comparing classifiers based on features belonging to three types, describing visual, stylistic, and topical properties, we demonstrate that different genres have distinctive feature strengths.
    corecore