5 research outputs found
Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding
Mining a set of meaningful topics organized into a hierarchy is intuitively
appealing since topic correlations are ubiquitous in massive text corpora. To
account for potential hierarchical topic structures, hierarchical topic models
generalize flat topic models by incorporating latent topic hierarchies into
their generative modeling process. However, due to their purely unsupervised
nature, the learned topic hierarchy often deviates from users' particular needs
or interests. To guide the hierarchical topic discovery process with minimal
user supervision, we propose a new task, Hierarchical Topic Mining, which takes
a category tree described by category names only, and aims to mine a set of
representative terms for each category from a text corpus to help a user
comprehend his/her interested topics. We develop a novel joint tree and text
embedding method along with a principled optimization procedure that allows
simultaneous modeling of the category tree structure and the corpus generative
process in the spherical space for effective category-representative term
discovery. Our comprehensive experiments show that our model, named JoSH, mines
a high-quality set of hierarchical topics with high efficiency and benefits
weakly-supervised hierarchical text classification tasks.Comment: KDD 2020 Research Track. (Code: https://github.com/yumeng5/JoSH
Disentangled Contrastive Learning for Learning Robust Textual Representations
Although the self-supervised pre-training of transformer models has resulted
in the revolutionizing of natural language processing (NLP) applications and
the achievement of state-of-the-art results with regard to various benchmarks,
this process is still vulnerable to small and imperceptible permutations
originating from legitimate inputs. Intuitively, the representations should be
similar in the feature space with subtle input permutations, while large
variations occur with different meanings. This motivates us to investigate the
learning of robust textual representation in a contrastive manner. However, it
is non-trivial to obtain opposing semantic instances for textual samples. In
this study, we propose a disentangled contrastive learning method that
separately optimizes the uniformity and alignment of representations without
negative sampling. Specifically, we introduce the concept of momentum
representation consistency to align features and leverage power normalization
while conforming the uniformity. Our experimental results for the NLP
benchmarks demonstrate that our approach can obtain better results compared
with the baselines, as well as achieve promising improvements with invariance
tests and adversarial attacks. The code is available in
https://github.com/zjunlp/DCL.Comment: Work in progres
Effective Seed-Guided Topic Discovery by Integrating Multiple Types of Contexts
Instead of mining coherent topics from a given text corpus in a completely
unsupervised manner, seed-guided topic discovery methods leverage user-provided
seed words to extract distinctive and coherent topics so that the mined topics
can better cater to the user's interest. To model the semantic correlation
between words and seeds for discovering topic-indicative terms, existing
seed-guided approaches utilize different types of context signals, such as
document-level word co-occurrences, sliding window-based local contexts, and
generic linguistic knowledge brought by pre-trained language models. In this
work, we analyze and show empirically that each type of context information has
its value and limitation in modeling word semantics under seed guidance, but
combining three types of contexts (i.e., word embeddings learned from local
contexts, pre-trained language model representations obtained from
general-domain training, and topic-indicative sentences retrieved based on seed
information) allows them to complement each other for discovering quality
topics. We propose an iterative framework, SeedTopicMine, which jointly learns
from the three types of contexts and gradually fuses their context signals via
an ensemble ranking process. Under various sets of seeds and on multiple
datasets, SeedTopicMine consistently yields more coherent and accurate topics
than existing seed-guided topic discovery approaches.Comment: 9 pages; Accepted to WSDM 202
Hierarchical Metadata-Aware Document Categorization under Weak Supervision
Categorizing documents into a given label hierarchy is intuitively appealing
due to the ubiquity of hierarchical topic structures in massive text corpora.
Although related studies have achieved satisfying performance in fully
supervised hierarchical document classification, they usually require massive
human-annotated training data and only utilize text information. However, in
many domains, (1) annotations are quite expensive where very few training
samples can be acquired; (2) documents are accompanied by metadata information.
Hence, this paper studies how to integrate the label hierarchy, metadata, and
text signals for document categorization under weak supervision. We develop
HiMeCat, an embedding-based generative framework for our task. Specifically, we
propose a novel joint representation learning module that allows simultaneous
modeling of category dependencies, metadata information and textual semantics,
and we introduce a data augmentation module that hierarchically synthesizes
training documents to complement the original, small-scale training set. Our
experiments demonstrate a consistent improvement of HiMeCat over competitive
baselines and validate the contribution of our representation learning and data
augmentation modules.Comment: 9 pages; Accepted to WSDM 202
Near Real-Time Sentiment and Topic Analysis of Sport Events
Sport events’ media consumption patterns have started transitioning to a multi-screen paradigm, where, through multitasking, viewers are able to search for additional information about the event they are watching live, as well as contribute with their perspective of the event to other viewers. The audiovisual and multimedia industries, however, are failing to capitalize on this by not providing the sports’ teams and those in charge of the audiovisual production with insights on the final consumers perspective of sport events. As a result of this opportunity, this document focuses on presenting the development of a near real-time sentiment analysis tool and a near real-time topic analysis tool for the analysis of sports events’ related social media content that was published during the transmission of the respective events, thus enabling, in near real-time, the understanding of the sentiment of the viewers and the topics being discussed through each event.Os padrões de consumo de media, têm vindo a mudar para um paradigma de ecrãs múltiplos, onde, através de multitasking, os telespetadores podem pesquisar informações adicionais sobre o evento que estão a assistir, bem como partilhar a sua perspetiva do evento. As indústrias do setor audiovisual e multimédia, no entanto, não estão a aproveitar esta oportunidade, falhando em fornecer às equipas desportivas e aos responsáveis pela produção audiovisual uma visão sobre a perspetiva dos consumidores finais dos eventos desportivos. Como resultado desta oportunidade, este documento foca-se em apresentar o desenvolvimento de uma ferramenta de análise de sentimento e uma ferramenta de análise de tópicos para a análise, em perto de tempo real, de conteúdo das redes sociais relacionado com eventos esportivos e publicado durante a transmissão dos respetivos eventos, permitindo assim, em perto de tempo real, perceber o sentimento dos espectadores e os tópicos mais falados durante cada evento