14 research outputs found

    Combination Methods for Automatic Document Organization

    Get PDF
    Automatic document classification and clustering are useful for a wide range of applications such as organizing Web, intranet, or portal pages into topic directories, filtering news feeds or mail, focused crawling on the Web or in intranets, and many more. This thesis presents ensemble-based meta methods for supervised learning (i.e., classification based on a small amount of hand-annotated training documents). In addition, we show how these techniques can be carried forward to clustering based on unsupervised learning (i.e., automatic structuring of document corpora without training data). The algorithms are applied in a restrictive manner, i.e., by leaving out some \u27uncertain\u27 documents (rather than assigning them to inappropriate topics or clusters with low confidence). We show how restrictive meta methods can be used to combine different document representations in the context of Web document classification and author recognition. As another application for meta methods we study the combination of difierent information sources in distributed environments, such as peer-to-peer information systems. Furthermore we address the problem of semi-supervised classification on document collections using retraining. A possible application is focused Web crawling which may start with very few, manually selected, training documents but can be enhanced by automatically adding initially unlabeled, positively classified Web pages for retraining. The results of our systematic evaluation on real world data show the viability of the proposed approaches.Automatische Dokumentklassifikation und Clustering sind für eine Vielzahl von Anwendungen von Bedeutung, wie beispielsweise Organisation von Web-, Intranet- oder Portalseiten in thematische Verzeichnisse, Filterung von Nachrichtenmeldungen oder Emails, fokussiertes Crawling im Web oder in Intranets und vieles mehr. Diese Arbeit untersucht Ensemble-basierte Metamethoden für Supervised Learning (d.h. Klassifikation basierend auf einer kleinen Anzahl von manuell annotierten Trainingsdokumenten). Weiterhin zeigen wir, wie sich diese Techniken auf Clustering basierend auf Unsupervised Learning (d.h. die automatische Strukturierung von Dokumentkorpora ohne Trainingsdaten) übertragen lassen. Dabei wenden wir die Algorithmen in restriktiver Form an, d.h. wir treffen keine Aussage über eine Teilmenge von "unsicheren" Dokumenten (anstatt sie mit niedriger Konfidenz ungeeigneten Themen oder Clustern zuzuordnen). Wir verwendenen restriktive Metamethoden um unterschiedliche Dokumentrepräsentationen, im Kontext der Klassifikation von Webdokumentem und der Autorenerkennung, miteinander zu kombinieren. Als weitere Anwendung von Metamethoden untersuchen wir die Kombination von unterschiedlichen Informationsquellen in verteilten Umgebungen wie Peer-to-Peer Informationssystemen. Weiterhin betrachten wir das Problem der Semi-Supervised Klassifikation von Dokumentsammlungen durch Retraining. Eine mögliche Anwendung ist fokussiertesWeb Crawling, wo wir mit sehr wenigen, manuell ausgewählten Trainingsdokumenten starten, die durch Hinzufugen von ursprünglich nicht klassifizierten Dokumenten ergänzt werden. Die Resultate unserer systematischen Evaluation auf realen Daten zeigen das gute Leistungsverhalten unserer Methoden

    Nomenclature and Contemporary Affirmation of the Unsupervised Learning in Text and Document Mining

    Get PDF
    Document clustering is primarily a method applied for an uncomplicated, document search, analysis and review of content or is a process of automatic classification of documents of similar type categorized to relevant clusters, in a clustering hierarchy. In this paper a review of the related work in the field of document clustering from the simple techniques of word and phrase to the present complex techniques of statistical analysis, machine learning etc are illustrated with their implications for future research work

    A neighborhood-based approach for clustering of linked document collections

    No full text
    This technical report addresses the problem of automatically structuring linked document collections by using clustering. In contrast to traditional clustering, we study the clustering problem in the light of available link structure information for the data set (e.g., hyperlinks among web documents or co-authorship among bibliographic data entries). Our approach is based on iterative relaxation of cluster assignments, and can be built on top of any clustering algorithm (e.g., k-means or DBSCAN). These techniques result in higher cluster purity, better overall accuracy, and make self-organization more robust. Our comprehensive experiments on three different real-world corpora demonstrate the benefits of our approach

    Clustering and its Application in Requirements Engineering

    Get PDF
    Large scale software systems challenge almost every activity in the software development life-cycle, including tasks related to eliciting, analyzing, and specifying requirements. Fortunately many of these complexities can be addressed through clustering the requirements in order to create abstractions that are meaningful to human stakeholders. For example, the requirements elicitation process can be supported through dynamically clustering incoming stakeholders’ requests into themes. Cross-cutting concerns, which have a significant impact on the architectural design, can be identified through the use of fuzzy clustering techniques and metrics designed to detect when a theme cross-cuts the dominant decomposition of the system. Finally, traceability techniques, required in critical software projects by many regulatory bodies, can be automated and enhanced by the use of cluster-based information retrieval methods. Unfortunately, despite a significant body of work describing document clustering techniques, there is almost no prior work which directly addresses the challenges, constraints, and nuances of requirements clustering. As a result, the effectiveness of software engineering tools and processes that depend on requirements clustering is severely limited. This report directly addresses the problem of clustering requirements through surveying standard clustering techniques and discussing their application to the requirements clustering process

    Unsupervised document clustering by weighted combination

    Get PDF
    This report proposes a novel unsupervised document clustering approach based on weighted combination of individual clusterings. Two non-weighted combination methods are adapted to work in a weighted fashion: a graph based method and a probability based one. The performance of the weighted approach is evaluated on real-world collections, and compared to that of individual clustering and non-weighted combination. The results of this evaluation confirm that graph based weighted combination consistently outperforms the other approaches.Postprint (published version

    Automatic Generation of Thematically Focused Information Portals from Web Data

    Get PDF
    Finding the desired information on the Web is often a hard and time-consuming task. This thesis presents the methodology of automatic generation of thematically focused portals from Web data. The key component of the proposed Web retrieval framework is the thematically focused Web crawler that is interested only in a specific, typically small, set of topics. The focused crawler uses classification methods for filtering of fetched documents and identifying most likely relevant Web sources for further downloads. We show that the human efforts for preparation of the focused crawl can be minimized by automatic extending of the training dataset using additional training samples coined archetypes. This thesis introduces the combining of classification results and link-based authority ranking methods for selecting archetypes, combined with periodical re-training of the classifier. We also explain the architecture of the focused Web retrieval framework and discuss results of comprehensive use-case studies and evaluations with a prototype system BINGO!. Furthermore, the thesis addresses aspects of crawl postprocessing, such as refinements of the topic structure and restrictive document filtering. We introduce postprocessing methods and meta methods that are applied in an restrictive manner, i.e. by leaving out some uncertain documents rather than assigning them to inappropriate topics or clusters with low confidence. We also introduce the methodology of collaborative crawl postprocessing for multiple cooperating users in a distributed environment, such as a peer-to-peer overlay network. An important aspect of the thematically focused Web portal is the ranking of search results. This thesis addresses the aspect of search personalization by aggregating explicit or implicit feedback from multiple users and capturing topic-specific search patterns by profiles. Furthermore, we consider advanced link-based authority ranking algorithms that exploit the crawl-specific information, such as classification confidence grades for particular documents. This goal is achieved by weighting of edges in the link graph of the crawl and by adding virtual links between highly relevant documents of the topic. The results of our systematic evaluation on multiple reference collections and real Web data show the viability of the proposed methodology
    corecore