100,900 research outputs found

    Effective pattern discovery for text mining

    Get PDF
    Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase) based approaches should perform better than the term-based ones, but many experiments did not support this hypothesis. This paper presents an innovative technique, effective pattern discovery which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Substantial experiments on RCV1 data collection and TREC topics demonstrate that the proposed solution achieves encouraging performance

    Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding

    Full text link
    Mining a set of meaningful topics organized into a hierarchy is intuitively appealing since topic correlations are ubiquitous in massive text corpora. To account for potential hierarchical topic structures, hierarchical topic models generalize flat topic models by incorporating latent topic hierarchies into their generative modeling process. However, due to their purely unsupervised nature, the learned topic hierarchy often deviates from users' particular needs or interests. To guide the hierarchical topic discovery process with minimal user supervision, we propose a new task, Hierarchical Topic Mining, which takes a category tree described by category names only, and aims to mine a set of representative terms for each category from a text corpus to help a user comprehend his/her interested topics. We develop a novel joint tree and text embedding method along with a principled optimization procedure that allows simultaneous modeling of the category tree structure and the corpus generative process in the spherical space for effective category-representative term discovery. Our comprehensive experiments show that our model, named JoSH, mines a high-quality set of hierarchical topics with high efficiency and benefits weakly-supervised hierarchical text classification tasks.Comment: KDD 2020 Research Track. (Code: https://github.com/yumeng5/JoSH

    Mining Helpdesk Databases For Professional Development Topic Discovery

    Get PDF
    This single-site, instrumental case study created and tested a methodological road map by which academic institutions can use text data mining techniques to derive technology skillset weaknesses and professional development topics from the site’s technical support helpdesk database. The methods employed were described in detail and applied to the helpdesk database of an independent, co-educational boarding high school in the northeastern United States. Standard text data mining procedures, including the formation of a wordlist (frequently occurring terms), and the creation and application of clustering (automated data grouping) and classification (automated data labeling) models generated meaningful and revealing themes from the helpdesk database. The results of the text mining procedures were bolstered and analyzed using human interpretation and spreadsheet-based summaries. Major findings included the discovery of four prominent technologies that warranted professional development at the site and a universally-applicable approach to undertaking successful helpdesk data mining endeavors. The case study’s conclusions included a call to action for researchers to leverage the methodology at other locations. Future data mining studies may yield practical and applicable knowledge at research sites. Shared methods, approaches, and findings from such studies will advance the field of helpdesk data mining used to glean professional development topics for the very people who have submitted technological support requests to helpdesk providers

    Relevance of Health-Related Hashtags on Twitter: A Text Mining Approach

    Get PDF
    BACKGROUND Social media platforms facilitate user interaction and impact decision making. Users prefer to use hashtags while sharing posts. Knowing the sentiment towards diabetes, bloodpressure, and obesity is fundamental to understanding the impact of these information on patients and their families. The study seeks to determine the relevance of health-related hashtags on Twitter and analyze sentiments about diabetes, obesity, blood pressure. METHOD Tweets were retrieved using synonyms for “diabetes”, “hypertension” and “obesity”. The extended knowledge discovery in data mining (KDDM) model guided our research with research objectives defined in the ‘research problem understanding’ phase. The ‘information seeking’ from Uses and Gratifications Theory (UGT) determined the success and text mining assessment criteria. Text pre-processing was done using tokenization, stop word removal, and stemming. The research objectives, text mining goals, and success criteria were answered using ‘Uses and Gratifications Theory’ (UGT). RESULTS Total 6749 tweets were extracted using RStudio. 36.41% were about blood pressure, 0.25%- diabetes, 24.43% -obesity and 6.99% -combination of two or more terms. Additional topics such as cholesterol, chia seeds, postpartum, diet, exercise were identified. Upcoming conferences like ‘#ipna’, ‘#review’, ‘#APCH2019’, ‘#cardiotwitter’ were identified. Increased user engagement – about managing blood pressure, diabetes, obesity across different age groups, as well as the consequences of increased cardio exercise for obese and diabetic users were encouraging. Tweets about advertisements specific to clothing for oversized individuals-initiated conversation among users about monitoring self-health. CONCLUSIONS Sentiment analysis can thus increase our understanding about user engagement on such platforms and potentially help improve managing public health strategically.https://scholarscompass.vcu.edu/gradposters/1105/thumbnail.jp

    Data Science and Knowledge Discovery

    Get PDF
    Data Science (DS) is gaining significant importance in the decision process due to a mix of various areas, including Computer Science, Machine Learning, Math and Statistics, domain/business knowledge, software development, and traditional research. In the business field, DS's application allows using scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data to support the decision process. After collecting the data, it is crucial to discover the knowledge. In this step, Knowledge Discovery (KD) tasks are used to create knowledge from structured and unstructured sources (e.g., text, data, and images). The output needs to be in a readable and interpretable format. It must represent knowledge in a manner that facilitates inferencing. KD is applied in several areas, such as education, health, accounting, energy, and public administration. This book includes fourteen excellent articles which discuss this trending topic and present innovative solutions to show the importance of Data Science and Knowledge Discovery to researchers, managers, industry, society, and other communities. The chapters address several topics like Data mining, Deep Learning, Data Visualization and Analytics, Semantic data, Geospatial and Spatio-Temporal Data, Data Augmentation and Text Mining

    Using Perilog to Explore "Decision Making at NASA"

    Get PDF
    Perilog, a context intensive text mining system, is used as a discovery tool to explore topics and concerns in "Decision Making at NASA," chapter 6 of the Columbia Accident Investigation Board (CAIB) Report, Volume I. Two examples illustrate how Perilog can be used to discover highly significant safety-related information in the text without prior knowledge of the contents of the document. A third example illustrates how "if-then" statements found by Perilog can be used in logical analysis of decision making. In addition, in order to serve as a guide for future work, the technical details of preparing a PDF document for input to Perilog are included in an appendix

    Effective Seed-Guided Topic Discovery by Integrating Multiple Types of Contexts

    Full text link
    Instead of mining coherent topics from a given text corpus in a completely unsupervised manner, seed-guided topic discovery methods leverage user-provided seed words to extract distinctive and coherent topics so that the mined topics can better cater to the user's interest. To model the semantic correlation between words and seeds for discovering topic-indicative terms, existing seed-guided approaches utilize different types of context signals, such as document-level word co-occurrences, sliding window-based local contexts, and generic linguistic knowledge brought by pre-trained language models. In this work, we analyze and show empirically that each type of context information has its value and limitation in modeling word semantics under seed guidance, but combining three types of contexts (i.e., word embeddings learned from local contexts, pre-trained language model representations obtained from general-domain training, and topic-indicative sentences retrieved based on seed information) allows them to complement each other for discovering quality topics. We propose an iterative framework, SeedTopicMine, which jointly learns from the three types of contexts and gradually fuses their context signals via an ensemble ranking process. Under various sets of seeds and on multiple datasets, SeedTopicMine consistently yields more coherent and accurate topics than existing seed-guided topic discovery approaches.Comment: 9 pages; Accepted to WSDM 202