14 research outputs found

    Combining Thesaurus Knowledge and Probabilistic Topic Models

    Full text link
    In this paper we present the approach of introducing thesaurus knowledge into probabilistic topic models. The main idea of the approach is based on the assumption that the frequencies of semantically related words and phrases, which are met in the same texts, should be enhanced: this action leads to their larger contribution into topics found in these texts. We have conducted experiments with several thesauri and found that for improving topic models, it is useful to utilize domain-specific knowledge. If a general thesaurus, such as WordNet, is used, the thesaurus-based improvement of topic models can be achieved with excluding hyponymy relations in combined topic models.Comment: Accepted to AIST-2017 conference (http://aistconf.ru/). The final publication will be available at link.springer.co

    Social Media Mining in Drug Development Decision Making: Prioritizing Multiple Sclerosis Patients’ Unmet Medical Needs

    Get PDF
    Pharmaceutical companies increasingly must consider patients’ needs in drug development. Since patients’ needs are often difficult to measure, especially in rare diseases, information in drug development decision-making is limited. In the proposed study, we employ the opportunity algorithm to identify and prioritize unmet medical needs of multiple sclerosis patients shared in social media posts. Using topic modeling and sentiment analysis features of the opportunity algorithm are generated. The result implies that sensory problems, pain, mental health problems, fatigue and sleep disturbances represent the highest unmet medical needs of the samples population. The present study suggests a promising potential of this method to provide relevant insights into rare disease populations to promote patient-centered drug development

    How can semantic annotation help us to analyse the discourse of climate change in online user comments?

    Get PDF
    User comments in response to newspaper articles published online offer a unique resource for studying online discourse. The number of comments that articles often elicit poses many methodological challenges and analyses of online user comments have inevitably been cursory when limited to a manual content or thematic analysis. Corpus analysis tools can systematically identify features such as keywords in large datasets. This article reports on the semantic annotation feature of the corpus analysis tool Wmatrix which also allows us to identify key semantic domains. Building on this feature, I introduce a novel method of sampling key comments through an examination of user comment threads taken from The Guardian website on the topic of climate change

    My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections

    Full text link
    Comparative text mining extends from genre analysis and political bias detection to the revelation of cultural and geographic differences, through to the search for prior art across patents and scientific papers. These applications use cross-collection topic modeling for the exploration, clustering, and comparison of large sets of documents, such as digital libraries. However, topic modeling on documents from different collections is challenging because of domain-specific vocabulary. We present a cross-collection topic model combined with automatic domain term extraction and phrase segmentation. This model distinguishes collection-specific and collection-independent words based on information entropy and reveals commonalities and differences of multiple text collections. We evaluate our model on patents, scientific papers, newspaper articles, forum posts, and Wikipedia articles. In comparison to state-of-the-art cross-collection topic modeling, our model achieves up to 13% higher topic coherence, up to 4% lower perplexity, and up to 31% higher document classification accuracy. More importantly, our approach is the first topic model that ensures disjunct general and specific word distributions, resulting in clear-cut topic representations

    Identificando o assunto dos documentos em coleções textuais utilizando termos compostos

    Get PDF
    Diferentemente dos problemas de recuperação de informação, nos quais o usuário conhece o que ele está procurando, às vezes o usuário precisa compreender de forma mais geral os assuntos abordados na coleção para explorar os documentos de interesse. Para cada grupo ou tópico obtido, um conjunto de descritores é selecionado entre os termos da coleção e cabe ao usuário identificar o assunto de cada grupo a partir da lista de descritores apresentada. Normalmente, o conjunto de descritores é composto por termos simples. Entretanto, muitos termos possuem significado próprio quando combinados entre si. Produzir uma lista de termos que já considere na sua construção o uso de termos compostos pode diminuir o esforço necessário para a compreensão dos assuntos identificados. Neste artigo é proposta uma abordagem para identificação de assuntos em coleções de documentos que combina técnicas de regras de associação e de agrupamento de dados. As regras de associação são aplicadas para extrair termos compostos formando o contexto local da relação entre os termos. Essas regras são representadas em uma estrutura bag-of-words cujas dimensões são as mesmas da bag-of-words produzida pela coleção de documentos e são\ud agrupadas, formando o contexto geral das relações. A ideia é que a informação da vizinhança dos termos compostos extraídos ajudam a identificar (a) termos diferentes utilizados em um mesmo contexto ou com mesmo sentido e (b) termos idênticos mas que são usados em contextos diferentes ou com significados diferentes. Os resultados da avaliação indicam que o uso de termos compostos com a abordagem proposta melhora a identificação de assuntos nas coleções de documentos avaliadas.CAPES (processo DS-6345378/D)FAPESP (processo número 2014/08996-0

    Feature Augmentation for Improved Topic Modeling of Youtube Lecture Videos using Latent Dirichlet Allocation

    Get PDF
    Application of Topic Models in text mining of educational data and more specifically, the text data obtained from lecture videos, is an area of research which is largely unexplored yet holds great potential. This work seeks to find empirical evidence for an improvement in Topic Modeling by pre- extracting bigram tokens and adding them as additional features in the Latent Dirichlet Allocation (LDA) algorithm, a widely-recognized topic modeling technique. The dataset considered for analysis is a collection of transcripts of video lectures on Machine Learning scraped from YouTube. Using the cosine similarity distance measure as a metric, the experiment showed a statistically significant improvement in topic model performance against the baseline topic model which did not use extra features, thus confirming the hypothesis. By introducing explainable features before modeling and using deep learning based text representation only at the post-modeling evaluation stage, the overall model interpretability is retained. This empowers educators and researchers alike to not only benefit from the LDA model in their own fields but also to play a substantial role in eorts to improve model performance. It also sets the direction for future work which could use the feature augmented topic model as the input to other more common text mining tasks like document categorization and information retrieval

    Understanding barriers to novel data linkages : topic modeling of the results of the LifeInfo survey

    Get PDF
    Novel consumer and lifestyle data, such as those collected by supermarket loyalty cards or mobile phone exercise tracking apps, offer numerous benefits for researchers seeking to understand diet- and exercise-related risk factors for diseases. However, limited research has addressed public attitudes toward linking these data with individual health records for research purposes. Data linkage, combining data from multiple sources, provides the opportunity to enhance preexisting data sets to gain new insights. The aim of this study is to identify key barriers to data linkage and recommend safeguards and procedures that would encourage individuals to share such data for potential future research. The LifeInfo Survey consulted the public on their attitudes toward sharing consumer and lifestyle data for research purposes. Where barriers to data sharing existed, participants provided unstructured survey responses detailing what would make them more likely to share data for linkage with their health records in the future. The topic modeling technique latent Dirichlet allocation was used to analyze these textual responses to uncover common thematic topics within the texts. Participants provided responses related to sharing their store loyalty card data (n=2338) and health and fitness app data (n=1531). Key barriers to data sharing identified through topic modeling included data safety and security, personal privacy, requirements of further information, fear of data being accessed by others, problems with data accuracy, not understanding the reason for data linkage, and not using services that produce these data. We provide recommendations for addressing these issues to establish the best practice for future researchers interested in using these data. This study formulates a large-scale consultation of public attitudes toward this kind of data linkage, which is an important first step in understanding and addressing barriers to participation in research using novel consumer and lifestyle data. [Abstract copyright: ©Holly Clarke, Stephen Clark, Mark Birkin, Heather Iles-Smith, Adam Glaser, Michelle A Morris. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 17.05.2021.
    corecore