9 research outputs found

    Transfer Topic Labeling with Domain-Specific Knowledge Base: An Analysis of UK House of Commons Speeches 1935-2014

    Get PDF
    Topic models are widely used in natural language processing, allowing researchers to estimate the underlying themes in a collection of documents. Most topic models use unsupervised methods and hence require the additional step of attaching meaningful labels to estimated topics. This process of manual labeling is not scalable and suffers from human bias. We present a semi-automatic transfer topic labeling method that seeks to remedy these problems. Domain-specific codebooks form the knowledge-base for automated topic labeling. We demonstrate our approach with a dynamic topic model analysis of the complete corpus of UK House of Commons speeches 1935-2014, using the coding instructions of the Comparative Agendas Project to label topics. We show that our method works well for a majority of the topics we estimate; but we also find that institution-specific topics, in particular on subnational governance, require manual input. We validate our results using human expert coding

    Reducing the effort for systematic reviews in software engineering

    Get PDF
    Context: Systematic Reviews (SRs) are means for collecting and synthesizing evidence from the identification and analysis of relevant studies from multiple sources. To this aim, they use a well-defined methodology meant to mitigate the risks of biases and ensure repeatability for later updates. SRs, however, involve significant effort. Goal: The goal of this paper is to introduce a novel methodology that reduces the amount of manual tedious tasks involved in SRs while taking advantage of the value provided by human expertise. Method: Starting from current methodologies for SRs, we replaced the steps of keywording and data extraction with an automatic methodology for generating a domain ontology and classifying the primary studies. This methodology has been applied in the Software Engineering sub-area of Software Architecture and evaluated by human annotators. Results: The result is a novel Expert-Driven Automatic Methodology, EDAM, for assisting researchers in performing SRs. EDAM combines ontology-learning techniques and semantic technologies with the human-in-the-loop. The first (thanks to automation) fosters scalability, objectivity, reproducibility and granularity of the studies; the second allows tailoring to the specific focus of the study at hand and knowledge reuse from domain experts. We evaluated EDAM on the field of Software Architecture against six senior researchers. As a result, we found that the performance of the senior researchers in classifying papers was not statistically significantly different from EDAM. Conclusions: Thanks to automation of the less-creative steps in SRs, our methodology allows researchers to skip the tedious tasks of keywording and manually classifying primary studies, thus freeing effort for the analysis and the discussion

    Topic modelling: Going beyond token outputs

    Get PDF
    Topic modelling is a text mining technique for identifying salient themes from a number of documents. The output is commonly a set of topics consisting of isolated tokens that often co-occur in such documents. Manual effort is often associated with interpreting a topic's description from such tokens. However, from a human's perspective, such outputs may not adequately provide enough information to infer the meaning of the topics; thus, their interpretability is often inaccurately understood. Although several studies have attempted to automatically extend topic descriptions as a means of enhancing the interpretation of topic models, they rely on external language sources that may become unavailable, must be kept up-to-date to generate relevant results, and present privacy issues when training on or processing data. This paper presents a novel approach towards extending the output of traditional topic modelling methods beyond a list of isolated tokens. This approach removes the dependence on external sources by using the textual data itself by extracting high-scoring keywords and mapping them to the topic model's token outputs. To compare how the proposed method benchmarks against the state-of-the-art, a comparative analysis against results produced by Large Language Models (LLMs) is presented. Such results report that the proposed method resonates with the thematic coverage found in LLMs, and often surpasses such models by bridging the gap between broad thematic elements and granular details. In addition, to demonstrate and reinforce the generalisation of the proposed method, the approach was further evaluated using two other topic modelling methods as the underlying models and when using a heterogeneous unseen dataset. To measure the interpretability of the proposed outputs against those of the traditional topic modelling approach, independent annotators manually scored each output based on their quality and usefulness, as well as the efficiency of the annotation task. The proposed approach demonstrated higher quality and usefulness, as well as higher efficiency in the annotation task, in comparison to the outputs of a traditional topic modelling method, demonstrating an increase in their interpretability

    Automatic Labeled Dialogue Generation for Nursing Record Systems

    Get PDF
    The integration of digital voice assistants in nursing residences is becoming increasingly important to facilitate nursing productivity with documentation. A key idea behind this system is training natural language understanding (NLU) modules that enable the machine to classify the purpose of the user utterance (intent) and extract pieces of valuable information present in the utterance (entity). One of the main obstacles when creating robust NLU is the lack of sufficient labeled data, which generally relies on human labeling. This process is cost-intensive and time-consuming, particularly in the high-level nursing care domain, which requires abstract knowledge. In this paper, we propose an automatic dialogue labeling framework of NLU tasks, specifically for nursing record systems. First, we apply data augmentation techniques to create a collection of variant sample utterances. The individual evaluation result strongly shows a stratification rate, with regard to both fluency and accuracy in utterances. We also investigate the possibility of applying deep generative models for our augmented dataset. The preliminary character-based model based on long short-term memory (LSTM) obtains an accuracy of 90% and generates various reasonable texts with BLEU scores of 0.76. Secondly, we introduce an idea for intent and entity labeling by using feature embeddings and semantic similarity-based clustering. We also empirically evaluate different embedding methods for learning good representations that are most suitable to use with our data and clustering tasks. Experimental results show that fastText embeddings produce strong performances both for intent labeling and on entity labeling, which achieves an accuracy level of 0.79 and 0.78 f1-scores and 0.67 and 0.61 silhouette scores, respectively

    ASPECT EXTRACTION PADA ULASAN MENGGUNAKAN PENGGABUNGAN LATENT DIRICHLET ALLOCATION DAN GLOBAL VECTOR FOR WORD REPRESENTATION

    Get PDF
    "Di era digital saat ini, semua informasi dapat ditemukan di berbagai jejaring sosial, seperti Facebook dan Twitter yang menjadi tempat beropini atau memberi ulasan. Biasanya masyarakat mengekspresikan ulasannya tidak secara keseluruhan tetapi hanya sebagian fitur saja pada setiap ulasan. Fitur dalam ulasan tersebut berisi aspek yang harus diekstraksi menggunakan aspek, dikumpulkan dalam beberapa kategori, dan dibagi menjadi polaritas yang berbeda. Sentiment Analysis berbasiskan aspek dapat membantu mengatasi hal tersebut. Aspect extraction merupakan task yang penting dalam pendekatan ini. Penelitian ini berfokus pada Aspect Extraction dan Latent Topic Identification menggunakan unsupervised learning. LDA (Latent Dirichlet Allocation) adalah pendekatan yang paling umum digunakan dalam unsupervised learning yang baik untuk menemukan topik dalam dokumen berukuran besar. Namun, LDA kurang efektif untuk melakukan aspect extraction terutama pada ulasan atau teks berukuran pendek karena mempengaruhi data sparsity, sehingga terjadi aspek dan topic yang tidak koheren dan tidak kompatibel. Untuk mengatasinya, kita mengusulkan LDA yang digabungkan dengan word embedding. GloVe (Global Vector for Word Representation) merupakan word embedding yang memiliki perpaduan kelebihan dari word embedding yang berdasarkan prediksi dan perhitungan. Pendekatan yang diajukan ini akan menguji pengaruh GloVe sebagai word embedding terhadap LDA sebagai topic modelling. Data penelitian menggunakan ulasan edukasi, e- Commerce, dan game. Data diolah menggunakan seleksi fitur dan dikelompokkan menggunakan LDA. Hasil pengujian menunjukkan bahwa penggabungan LDA- GloVe mempunyai nilai koheren yang tinggi daripada metode lain dengan peningkatan mencapai 79,6%. Hasil tersebut mengindikasikan bahwa word embedding mempunyai pengaruh yang signifikan terhadap LDA. Kata Kunci: Aspect Extraction, Review, Latent Dirichlet Allocation, GloVe
    corecore