172,985 research outputs found

    Combination Methods for Automatic Document Organization

    Get PDF
    Automatic document classification and clustering are useful for a wide range of applications such as organizing Web, intranet, or portal pages into topic directories, filtering news feeds or mail, focused crawling on the Web or in intranets, and many more. This thesis presents ensemble-based meta methods for supervised learning (i.e., classification based on a small amount of hand-annotated training documents). In addition, we show how these techniques can be carried forward to clustering based on unsupervised learning (i.e., automatic structuring of document corpora without training data). The algorithms are applied in a restrictive manner, i.e., by leaving out some \u27uncertain\u27 documents (rather than assigning them to inappropriate topics or clusters with low confidence). We show how restrictive meta methods can be used to combine different document representations in the context of Web document classification and author recognition. As another application for meta methods we study the combination of difierent information sources in distributed environments, such as peer-to-peer information systems. Furthermore we address the problem of semi-supervised classification on document collections using retraining. A possible application is focused Web crawling which may start with very few, manually selected, training documents but can be enhanced by automatically adding initially unlabeled, positively classified Web pages for retraining. The results of our systematic evaluation on real world data show the viability of the proposed approaches.Automatische Dokumentklassifikation und Clustering sind für eine Vielzahl von Anwendungen von Bedeutung, wie beispielsweise Organisation von Web-, Intranet- oder Portalseiten in thematische Verzeichnisse, Filterung von Nachrichtenmeldungen oder Emails, fokussiertes Crawling im Web oder in Intranets und vieles mehr. Diese Arbeit untersucht Ensemble-basierte Metamethoden für Supervised Learning (d.h. Klassifikation basierend auf einer kleinen Anzahl von manuell annotierten Trainingsdokumenten). Weiterhin zeigen wir, wie sich diese Techniken auf Clustering basierend auf Unsupervised Learning (d.h. die automatische Strukturierung von Dokumentkorpora ohne Trainingsdaten) übertragen lassen. Dabei wenden wir die Algorithmen in restriktiver Form an, d.h. wir treffen keine Aussage über eine Teilmenge von "unsicheren" Dokumenten (anstatt sie mit niedriger Konfidenz ungeeigneten Themen oder Clustern zuzuordnen). Wir verwendenen restriktive Metamethoden um unterschiedliche Dokumentrepräsentationen, im Kontext der Klassifikation von Webdokumentem und der Autorenerkennung, miteinander zu kombinieren. Als weitere Anwendung von Metamethoden untersuchen wir die Kombination von unterschiedlichen Informationsquellen in verteilten Umgebungen wie Peer-to-Peer Informationssystemen. Weiterhin betrachten wir das Problem der Semi-Supervised Klassifikation von Dokumentsammlungen durch Retraining. Eine mögliche Anwendung ist fokussiertesWeb Crawling, wo wir mit sehr wenigen, manuell ausgewählten Trainingsdokumenten starten, die durch Hinzufugen von ursprünglich nicht klassifizierten Dokumenten ergänzt werden. Die Resultate unserer systematischen Evaluation auf realen Daten zeigen das gute Leistungsverhalten unserer Methoden

    Dublin City University at CLEF 2007: Cross-Language Speech Retrieval Experiments

    Get PDF
    The Dublin City University participation in the CLEF 2007 CL-SR English task concentrated primarily on issues of topic translation. Our retrieval system used the BM25F model and pseudo relevance feedback. Topics were translated into English using the Yahoo! BabelFish free online service combined with domain-specific translation lexicons gathered automatically from Wikipedia. We explored alternative topic translation methods using these resources. Our results indicate that extending machine translation tools using automatically generated domainspecific translation lexicons can provide improved CLIR effectiveness for this task

    Machine Learning in Automated Text Categorization

    Full text link
    The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey

    Characterizing Question Facets for Complex Answer Retrieval

    Get PDF
    Complex answer retrieval (CAR) is the process of retrieving answers to questions that have multifaceted or nuanced answers. In this work, we present two novel approaches for CAR based on the observation that question facets can vary in utility: from structural (facets that can apply to many similar topics, such as 'History') to topical (facets that are specific to the question's topic, such as the 'Westward expansion' of the United States). We first explore a way to incorporate facet utility into ranking models during query term score combination. We then explore a general approach to reform the structure of ranking models to aid in learning of facet utility in the query-document term matching phase. When we use our techniques with a leading neural ranker on the TREC CAR dataset, our methods rank first in the 2017 TREC CAR benchmark, and yield up to 26% higher performance than the next best method.Comment: 4 pages; SIGIR 2018 Short Pape

    Generating indicative-informative summaries with SumUM

    Get PDF
    We present and evaluate SumUM, a text summarization system that takes a raw technical text as input and produces an indicative informative summary. The indicative part of the summary identifies the topics of the document, and the informative part elaborates on some of these topics according to the reader's interest. SumUM motivates the topics, describes entities, and defines concepts. It is a first step for exploring the issue of dynamic summarization. This is accomplished through a process of shallow syntactic and semantic analysis, concept identification, and text regeneration. Our method was developed through the study of a corpus of abstracts written by professional abstractors. Relying on human judgment, we have evaluated indicativeness, informativeness, and text acceptability of the automatic summaries. The results thus far indicate good performance when compared with other summarization technologies

    New learning strategies for automatic text categorization.

    Get PDF
    Lai Kwok-yin.Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.Includes bibliographical references (leaves 125-130).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Automatic Textual Document Categorization --- p.1Chapter 1.2 --- Meta-Learning Approach For Text Categorization --- p.3Chapter 1.3 --- Contributions --- p.6Chapter 1.4 --- Organization of the Thesis --- p.7Chapter 2 --- Related Work --- p.9Chapter 2.1 --- Existing Automatic Document Categorization Approaches --- p.9Chapter 2.2 --- Existing Meta-Learning Approaches For Information Retrieval --- p.14Chapter 2.3 --- Our Meta-Learning Approaches --- p.20Chapter 3 --- Document Pre-Processing --- p.22Chapter 3.1 --- Document Representation --- p.22Chapter 3.2 --- Classification Scheme Learning Strategy --- p.25Chapter 4 --- Linear Combination Approach --- p.30Chapter 4.1 --- Overview --- p.30Chapter 4.2 --- Linear Combination Approach - The Algorithm --- p.33Chapter 4.2.1 --- Equal Weighting Strategy --- p.34Chapter 4.2.2 --- Weighting Strategy Based On Utility Measure --- p.34Chapter 4.2.3 --- Weighting Strategy Based On Document Rank --- p.35Chapter 4.3 --- Comparisons of Linear Combination Approach and Existing Meta-Learning Methods --- p.36Chapter 4.3.1 --- LC versus Simple Majority Voting --- p.36Chapter 4.3.2 --- LC versus BORG --- p.38Chapter 4.3.3 --- LC versus Restricted Linear Combination Method --- p.38Chapter 5 --- The New Meta-Learning Model - MUDOF --- p.40Chapter 5.1 --- Overview --- p.41Chapter 5.2 --- Document Feature Characteristics --- p.42Chapter 5.3 --- Classification Errors --- p.44Chapter 5.4 --- Linear Regression Model --- p.45Chapter 5.5 --- The MUDOF Algorithm --- p.47Chapter 6 --- Incorporating MUDOF into Linear Combination approach --- p.52Chapter 6.1 --- Background --- p.52Chapter 6.2 --- Overview of MUDOF2 --- p.54Chapter 6.3 --- Major Components of the MUDOF2 --- p.57Chapter 6.4 --- The MUDOF2 Algorithm --- p.59Chapter 7 --- Experimental Setup --- p.66Chapter 7.1 --- Document Collection --- p.66Chapter 7.2 --- Evaluation Metric --- p.68Chapter 7.3 --- Component Classification Algorithms --- p.71Chapter 7.4 --- Categorical Document Feature Characteristics for MUDOF and MUDOF2 --- p.72Chapter 8 --- Experimental Results and Analysis --- p.74Chapter 8.1 --- Performance of Linear Combination Approach --- p.74Chapter 8.2 --- Performance of the MUDOF Approach --- p.78Chapter 8.3 --- Performance of MUDOF2 Approach --- p.87Chapter 9 --- Conclusions and Future Work --- p.96Chapter 9.1 --- Conclusions --- p.96Chapter 9.2 --- Future Work --- p.98Chapter A --- Details of Experimental Results for Reuters-21578 corpus --- p.99Chapter B --- Details of Experimental Results for OHSUMED corpus --- p.114Bibliography --- p.12

    Smartphone picture organization: a hierarchical approach

    Get PDF
    We live in a society where the large majority of the population has a camera-equipped smartphone. In addition, hard drives and cloud storage are getting cheaper and cheaper, leading to a tremendous growth in stored personal photos. Unlike photo collections captured by a digital camera, which typically are pre-processed by the user who organizes them into event-related folders, smartphone pictures are automatically stored in the cloud. As a consequence, photo collections captured by a smartphone are highly unstructured and because smartphones are ubiquitous, they present a larger variability compared to pictures captured by a digital camera. To solve the need of organizing large smartphone photo collections automatically, we propose here a new methodology for hierarchical photo organization into topics and topic-related categories. Our approach successfully estimates latent topics in the pictures by applying probabilistic Latent Semantic Analysis, and automatically assigns a name to each topic by relying on a lexical database. Topic-related categories are then estimated by using a set of topic-specific Convolutional Neuronal Networks. To validate our approach, we ensemble and make public a large dataset of more than 8,000 smartphone pictures from 40 persons. Experimental results demonstrate major user satisfaction with respect to state of the art solutions in terms of organization.Peer ReviewedPreprin
    corecore