960 research outputs found

    Minimally Supervised Categorization of Text with Metadata

    Full text link
    Document categorization, which aims to assign a topic label to each document, plays a fundamental role in a wide variety of applications. Despite the success of existing studies in conventional supervised document classification, they are less concerned with two real problems: (1) the presence of metadata: in many domains, text is accompanied by various additional information such as authors and tags. Such metadata serve as compelling topic indicators and should be leveraged into the categorization framework; (2) label scarcity: labeled training samples are expensive to obtain in some cases, where categorization needs to be performed using only a small set of annotated data. In recognition of these two challenges, we propose MetaCat, a minimally supervised framework to categorize text with metadata. Specifically, we develop a generative process describing the relationships between words, documents, labels, and metadata. Guided by the generative model, we embed text and metadata into the same semantic space to encode heterogeneous signals. Then, based on the same generative process, we synthesize training samples to address the bottleneck of label scarcity. We conduct a thorough evaluation on a wide range of datasets. Experimental results prove the effectiveness of MetaCat over many competitive baselines.Comment: 10 pages; Accepted to SIGIR 2020; Some typos fixe

    MotifClass: Weakly Supervised Text Classification with Higher-order Metadata Information

    Full text link
    We study the problem of weakly supervised text classification, which aims to classify text documents into a set of pre-defined categories with category surface names only and without any annotated training document provided. Most existing classifiers leverage textual information in each document. However, in many domains, documents are accompanied by various types of metadata (e.g., authors, venue, and year of a research paper). These metadata and their combinations may serve as strong category indicators in addition to textual contents. In this paper, we explore the potential of using metadata to help weakly supervised text classification. To be specific, we model the relationships between documents and metadata via a heterogeneous information network. To effectively capture higher-order structures in the network, we use motifs to describe metadata combinations. We propose a novel framework, named MotifClass, which (1) selects category-indicative motif instances, (2) retrieves and generates pseudo-labeled training samples based on category names and indicative motif instances, and (3) trains a text classifier using the pseudo training data. Extensive experiments on real-world datasets demonstrate the superior performance of MotifClass to existing weakly supervised text classification approaches. Further analysis shows the benefit of considering higher-order metadata information in our framework.Comment: 11 pages; Accepted to WSDM 202

    Hierarchical Metadata-Aware Document Categorization under Weak Supervision

    Full text link
    Categorizing documents into a given label hierarchy is intuitively appealing due to the ubiquity of hierarchical topic structures in massive text corpora. Although related studies have achieved satisfying performance in fully supervised hierarchical document classification, they usually require massive human-annotated training data and only utilize text information. However, in many domains, (1) annotations are quite expensive where very few training samples can be acquired; (2) documents are accompanied by metadata information. Hence, this paper studies how to integrate the label hierarchy, metadata, and text signals for document categorization under weak supervision. We develop HiMeCat, an embedding-based generative framework for our task. Specifically, we propose a novel joint representation learning module that allows simultaneous modeling of category dependencies, metadata information and textual semantics, and we introduce a data augmentation module that hierarchically synthesizes training documents to complement the original, small-scale training set. Our experiments demonstrate a consistent improvement of HiMeCat over competitive baselines and validate the contribution of our representation learning and data augmentation modules.Comment: 9 pages; Accepted to WSDM 202

    Final Presentation to the Library of Congress on Digital Libraries, Intelligent Data Analytics, and Augmented Description

    Get PDF
    This presentation to Library of Congress staff, delivered onsite on January 10, 2020, presents a tour through the demonstration project pursued by the Aida digital libraries research team with the Library of Congress in 2019-2020. In addition to providing an overview and analysis of the specific machine learning projects scoped and explored, this presentation includes a number of high-level take-aways and recommendations designed to influence and inform the Library of Congress\u27s machine learning efforts going forward

    Text Mining for Information Systems Researchers: An Annotated Topic Modeling Tutorial

    Get PDF
    Analysts have estimated that more than 80 percent of today’s data is stored in unstructured form (e.g., text, audio, image, video)—much of it expressed in rich and ambiguous natural language. Traditionally, to analyze natural language, one has used qualitative data-analysis approaches, such as manual coding. Yet, the size of text data sets obtained from the Internet makes manual analysis virtually impossible. In this tutorial, we discuss the challenges encountered when applying automated text-mining techniques in information systems research. In particular, we showcase how to use probabilistic topic modeling via Latent Dirichlet allocation, an unsupervised text-mining technique, with a LASSO multinomial logistic regression to explain user satisfaction with an IT artifact by automatically analyzing more than 12,000 online customer reviews. For fellow information systems researchers, this tutorial provides guidance for conducting text-mining studies on their own and for evaluating the quality of others

    Towards Adversarial Malware Detection: Lessons Learned from PDF-based Attacks

    Full text link
    Malware still constitutes a major threat in the cybersecurity landscape, also due to the widespread use of infection vectors such as documents. These infection vectors hide embedded malicious code to the victim users, facilitating the use of social engineering techniques to infect their machines. Research showed that machine-learning algorithms provide effective detection mechanisms against such threats, but the existence of an arms race in adversarial settings has recently challenged such systems. In this work, we focus on malware embedded in PDF files as a representative case of such an arms race. We start by providing a comprehensive taxonomy of the different approaches used to generate PDF malware, and of the corresponding learning-based detection systems. We then categorize threats specifically targeted against learning-based PDF malware detectors, using a well-established framework in the field of adversarial machine learning. This framework allows us to categorize known vulnerabilities of learning-based PDF malware detectors and to identify novel attacks that may threaten such systems, along with the potential defense mechanisms that can mitigate the impact of such threats. We conclude the paper by discussing how such findings highlight promising research directions towards tackling the more general challenge of designing robust malware detectors in adversarial settings

    Enriching product ads with Metadata from HTML annotations

    Full text link

    Computational Linguistics and Natural Language Processing

    Get PDF
    This chapter provides an introduction to computational linguistics methods, with focus on their applications to the practice and study of translation. It covers computational models, methods and tools for collection, storage, indexing and analysis of linguistic data in the context of translation, and discusses the main methodological issues and challenges in this field. While an exhaustive review of existing computational linguistics methods and tools is beyond the scope of this chapter, we describe the most representative approaches, and illustrate them with descriptions of typical applications.Comment: This is the unedited author's copy of a text which appeared as a chapter in "The Routledge Handbook of Translation and Methodology'', edited by F Zanettin and C Rundle (2022

    Do we still need gold standard for evaluation ?

    No full text
    Cet article traite de l'évaluation des ressources lexicales
    • …
    corecore