Search CORE

960 research outputs found

Minimally Supervised Categorization of Text with Metadata

Author: Chang Ming-Wei
Chen Xingyuan
Devlin Jacob
Gopal Siddharth
Kim Yoon
Lee Wang-Chien
Mekala Dheeraj
Mikolov Tomas
Rosen-Zvi Michal
Shi Chuan
Xiao Huiru
Zhang Yu
Publication venue
Publication date: 13/11/2021
Field of study

Document categorization, which aims to assign a topic label to each document, plays a fundamental role in a wide variety of applications. Despite the success of existing studies in conventional supervised document classification, they are less concerned with two real problems: (1) the presence of metadata: in many domains, text is accompanied by various additional information such as authors and tags. Such metadata serve as compelling topic indicators and should be leveraged into the categorization framework; (2) label scarcity: labeled training samples are expensive to obtain in some cases, where categorization needs to be performed using only a small set of annotated data. In recognition of these two challenges, we propose MetaCat, a minimally supervised framework to categorize text with metadata. Specifically, we develop a generative process describing the relationships between words, documents, labels, and metadata. Guided by the generative model, we embed text and metadata into the same semantic space to encode heterogeneous signals. Then, based on the same generative process, we synthesize training samples to address the bottleneck of label scarcity. We conduct a thorough evaluation on a wide range of datasets. Experimental results prove the effectiveness of MetaCat over many competitive baselines.Comment: 10 pages; Accepted to SIGIR 2020; Some typos fixe

arXiv.org e-Print Archive

Crossref

MotifClass: Weakly Supervised Text Classification with Higher-order Metadata Information

Author: Chen Xiusi
Garg Shweta
Han Jiawei
Meng Yu
Zhang Yu
Publication venue
Publication date: 11/01/2022
Field of study

We study the problem of weakly supervised text classification, which aims to classify text documents into a set of pre-defined categories with category surface names only and without any annotated training document provided. Most existing classifiers leverage textual information in each document. However, in many domains, documents are accompanied by various types of metadata (e.g., authors, venue, and year of a research paper). These metadata and their combinations may serve as strong category indicators in addition to textual contents. In this paper, we explore the potential of using metadata to help weakly supervised text classification. To be specific, we model the relationships between documents and metadata via a heterogeneous information network. To effectively capture higher-order structures in the network, we use motifs to describe metadata combinations. We propose a novel framework, named MotifClass, which (1) selects category-indicative motif instances, (2) retrieves and generates pseudo-labeled training samples based on category names and indicative motif instances, and (3) trains a text classifier using the pseudo training data. Extensive experiments on real-world datasets demonstrate the superior performance of MotifClass to existing weakly supervised text classification approaches. Further analysis shows the benefit of considering higher-order metadata information in our framework.Comment: 11 pages; Accepted to WSDM 202

arXiv.org e-Print Archive

Hierarchical Metadata-Aware Document Categorization under Weak Supervision

Author: Chen Xiusi
Han Jiawei
Meng Yu
Zhang Yu
Publication venue
Publication date: 19/12/2020
Field of study

Categorizing documents into a given label hierarchy is intuitively appealing due to the ubiquity of hierarchical topic structures in massive text corpora. Although related studies have achieved satisfying performance in fully supervised hierarchical document classification, they usually require massive human-annotated training data and only utilize text information. However, in many domains, (1) annotations are quite expensive where very few training samples can be acquired; (2) documents are accompanied by metadata information. Hence, this paper studies how to integrate the label hierarchy, metadata, and text signals for document categorization under weak supervision. We develop HiMeCat, an embedding-based generative framework for our task. Specifically, we propose a novel joint representation learning module that allows simultaneous modeling of category dependencies, metadata information and textual semantics, and we introduce a data augmentation module that hierarchically synthesizes training documents to complement the original, small-scale training set. Our experiments demonstrate a consistent improvement of HiMeCat over competitive baselines and validate the contribution of our representation learning and data augmentation modules.Comment: 9 pages; Accepted to WSDM 202

arXiv.org e-Print Archive

Final Presentation to the Library of Congress on Digital Libraries, Intelligent Data Analytics, and Augmented Description

Author: Liu Yi
Lorang Elizabeth
Pack Chulwoo
Soh Leen-Kiat
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 10/01/2020
Field of study

This presentation to Library of Congress staff, delivered onsite on January 10, 2020, presents a tour through the demonstration project pursued by the Aida digital libraries research team with the Library of Congress in 2019-2020. In addition to providing an overview and analysis of the specific machine learning projects scoped and explored, this presentation includes a number of high-level take-aways and recommendations designed to influence and inform the Library of Congress\u27s machine learning efforts going forward

Text Mining for Information Systems Researchers: An Annotated Topic Modeling Tutorial

Author: Debortoli Stefan
Junglas Iris
Müller Oliver
vom Brocke Jan
Publication venue: 'Association for Information Systems'
Publication date: 01/01/2016
Field of study

Analysts have estimated that more than 80 percent of today’s data is stored in unstructured form (e.g., text, audio, image, video)—much of it expressed in rich and ambiguous natural language. Traditionally, to analyze natural language, one has used qualitative data-analysis approaches, such as manual coding. Yet, the size of text data sets obtained from the Internet makes manual analysis virtually impossible. In this tutorial, we discuss the challenges encountered when applying automated text-mining techniques in information systems research. In particular, we showcase how to use probabilistic topic modeling via Latent Dirichlet allocation, an unsupervised text-mining technique, with a LASSO multinomial logistic regression to explain user satisfaction with an IT artifact by automatically analyzing more than 12,000 online customer reviews. For fellow information systems researchers, this tutorial provides guidance for conducting text-mining studies on their own and for evaluating the quality of others

Crossref

The IT University of Copenhagen's Repository

AIS Electronic Library (AISeL)

Towards Adversarial Malware Detection: Lessons Learned from PDF-based Attacks

Author: Biggio Battista
Giacinto Giorgio
Maiorca Davide
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Malware still constitutes a major threat in the cybersecurity landscape, also due to the widespread use of infection vectors such as documents. These infection vectors hide embedded malicious code to the victim users, facilitating the use of social engineering techniques to infect their machines. Research showed that machine-learning algorithms provide effective detection mechanisms against such threats, but the existence of an arms race in adversarial settings has recently challenged such systems. In this work, we focus on malware embedded in PDF files as a representative case of such an arms race. We start by providing a comprehensive taxonomy of the different approaches used to generate PDF malware, and of the corresponding learning-based detection systems. We then categorize threats specifically targeted against learning-based PDF malware detectors, using a well-established framework in the field of adversarial machine learning. This framework allows us to categorize known vulnerabilities of learning-based PDF malware detectors and to identify novel attacks that may threaten such systems, along with the potential defense mechanisms that can mitigate the impact of such threats. We conclude the paper by discussing how such findings highlight promising research directions towards tackling the more general challenge of designing robust malware detectors in adversarial settings

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Cagliari

Enriching product ads with Metadata from HTML annotations

Author: D Qiu
D Vandic
H Nguyen
M Bakker de
NV Chawla
R Ghani
R Meusel
R Meusel
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Crossref

MAnnheim DOCument Server

Computational Linguistics and Natural Language Processing

Author: Luz Saturnino
Publication venue: 'Informa UK Limited'
Publication date: 28/01/2022
Field of study

This chapter provides an introduction to computational linguistics methods, with focus on their applications to the practice and study of translation. It covers computational models, methods and tools for collection, storage, indexing and analysis of linguistic data in the context of translation, and discusses the main methodological issues and challenges in this field. While an exhaustive review of existing computational linguistics methods and tools is beyond the scope of this chapter, we describe the most representative approaches, and illustrate them with descriptions of typical applications.Comment: This is the unedited author's copy of a text which appeared as a chapter in "The Routledge Handbook of Translation and Methodology'', edited by F Zanettin and C Rundle (2022

arXiv.org e-Print Archive

Edinburgh Research Explorer

Do we still need gold standard for evaluation ?

Author: Messiant Cédric
Poibeau Thierry
Publication venue: HAL CCSD
Publication date: 01/01/2008
Field of study

Cet article traite de l'évaluation des ressources lexicales

HAL-Paris 13