5 research outputs found

    Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding

    Full text link
    Mining a set of meaningful topics organized into a hierarchy is intuitively appealing since topic correlations are ubiquitous in massive text corpora. To account for potential hierarchical topic structures, hierarchical topic models generalize flat topic models by incorporating latent topic hierarchies into their generative modeling process. However, due to their purely unsupervised nature, the learned topic hierarchy often deviates from users' particular needs or interests. To guide the hierarchical topic discovery process with minimal user supervision, we propose a new task, Hierarchical Topic Mining, which takes a category tree described by category names only, and aims to mine a set of representative terms for each category from a text corpus to help a user comprehend his/her interested topics. We develop a novel joint tree and text embedding method along with a principled optimization procedure that allows simultaneous modeling of the category tree structure and the corpus generative process in the spherical space for effective category-representative term discovery. Our comprehensive experiments show that our model, named JoSH, mines a high-quality set of hierarchical topics with high efficiency and benefits weakly-supervised hierarchical text classification tasks.Comment: KDD 2020 Research Track. (Code: https://github.com/yumeng5/JoSH

    Disentangled Contrastive Learning for Learning Robust Textual Representations

    Full text link
    Although the self-supervised pre-training of transformer models has resulted in the revolutionizing of natural language processing (NLP) applications and the achievement of state-of-the-art results with regard to various benchmarks, this process is still vulnerable to small and imperceptible permutations originating from legitimate inputs. Intuitively, the representations should be similar in the feature space with subtle input permutations, while large variations occur with different meanings. This motivates us to investigate the learning of robust textual representation in a contrastive manner. However, it is non-trivial to obtain opposing semantic instances for textual samples. In this study, we propose a disentangled contrastive learning method that separately optimizes the uniformity and alignment of representations without negative sampling. Specifically, we introduce the concept of momentum representation consistency to align features and leverage power normalization while conforming the uniformity. Our experimental results for the NLP benchmarks demonstrate that our approach can obtain better results compared with the baselines, as well as achieve promising improvements with invariance tests and adversarial attacks. The code is available in https://github.com/zjunlp/DCL.Comment: Work in progres

    Effective Seed-Guided Topic Discovery by Integrating Multiple Types of Contexts

    Full text link
    Instead of mining coherent topics from a given text corpus in a completely unsupervised manner, seed-guided topic discovery methods leverage user-provided seed words to extract distinctive and coherent topics so that the mined topics can better cater to the user's interest. To model the semantic correlation between words and seeds for discovering topic-indicative terms, existing seed-guided approaches utilize different types of context signals, such as document-level word co-occurrences, sliding window-based local contexts, and generic linguistic knowledge brought by pre-trained language models. In this work, we analyze and show empirically that each type of context information has its value and limitation in modeling word semantics under seed guidance, but combining three types of contexts (i.e., word embeddings learned from local contexts, pre-trained language model representations obtained from general-domain training, and topic-indicative sentences retrieved based on seed information) allows them to complement each other for discovering quality topics. We propose an iterative framework, SeedTopicMine, which jointly learns from the three types of contexts and gradually fuses their context signals via an ensemble ranking process. Under various sets of seeds and on multiple datasets, SeedTopicMine consistently yields more coherent and accurate topics than existing seed-guided topic discovery approaches.Comment: 9 pages; Accepted to WSDM 202

    Hierarchical Metadata-Aware Document Categorization under Weak Supervision

    Full text link
    Categorizing documents into a given label hierarchy is intuitively appealing due to the ubiquity of hierarchical topic structures in massive text corpora. Although related studies have achieved satisfying performance in fully supervised hierarchical document classification, they usually require massive human-annotated training data and only utilize text information. However, in many domains, (1) annotations are quite expensive where very few training samples can be acquired; (2) documents are accompanied by metadata information. Hence, this paper studies how to integrate the label hierarchy, metadata, and text signals for document categorization under weak supervision. We develop HiMeCat, an embedding-based generative framework for our task. Specifically, we propose a novel joint representation learning module that allows simultaneous modeling of category dependencies, metadata information and textual semantics, and we introduce a data augmentation module that hierarchically synthesizes training documents to complement the original, small-scale training set. Our experiments demonstrate a consistent improvement of HiMeCat over competitive baselines and validate the contribution of our representation learning and data augmentation modules.Comment: 9 pages; Accepted to WSDM 202

    Near Real-Time Sentiment and Topic Analysis of Sport Events

    Get PDF
    Sport events’ media consumption patterns have started transitioning to a multi-screen paradigm, where, through multitasking, viewers are able to search for additional information about the event they are watching live, as well as contribute with their perspective of the event to other viewers. The audiovisual and multimedia industries, however, are failing to capitalize on this by not providing the sports’ teams and those in charge of the audiovisual production with insights on the final consumers perspective of sport events. As a result of this opportunity, this document focuses on presenting the development of a near real-time sentiment analysis tool and a near real-time topic analysis tool for the analysis of sports events’ related social media content that was published during the transmission of the respective events, thus enabling, in near real-time, the understanding of the sentiment of the viewers and the topics being discussed through each event.Os padrões de consumo de media, têm vindo a mudar para um paradigma de ecrãs múltiplos, onde, através de multitasking, os telespetadores podem pesquisar informações adicionais sobre o evento que estão a assistir, bem como partilhar a sua perspetiva do evento. As indústrias do setor audiovisual e multimédia, no entanto, não estão a aproveitar esta oportunidade, falhando em fornecer às equipas desportivas e aos responsáveis pela produção audiovisual uma visão sobre a perspetiva dos consumidores finais dos eventos desportivos. Como resultado desta oportunidade, este documento foca-se em apresentar o desenvolvimento de uma ferramenta de análise de sentimento e uma ferramenta de análise de tópicos para a análise, em perto de tempo real, de conteúdo das redes sociais relacionado com eventos esportivos e publicado durante a transmissão dos respetivos eventos, permitindo assim, em perto de tempo real, perceber o sentimento dos espectadores e os tópicos mais falados durante cada evento
    corecore