100,900 research outputs found
Effective pattern discovery for text mining
Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase) based approaches should perform better than the term-based ones, but many experiments did not support this hypothesis. This paper presents an innovative technique, effective pattern discovery which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Substantial experiments on RCV1 data collection and TREC topics demonstrate that the proposed solution achieves encouraging performance
Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding
Mining a set of meaningful topics organized into a hierarchy is intuitively
appealing since topic correlations are ubiquitous in massive text corpora. To
account for potential hierarchical topic structures, hierarchical topic models
generalize flat topic models by incorporating latent topic hierarchies into
their generative modeling process. However, due to their purely unsupervised
nature, the learned topic hierarchy often deviates from users' particular needs
or interests. To guide the hierarchical topic discovery process with minimal
user supervision, we propose a new task, Hierarchical Topic Mining, which takes
a category tree described by category names only, and aims to mine a set of
representative terms for each category from a text corpus to help a user
comprehend his/her interested topics. We develop a novel joint tree and text
embedding method along with a principled optimization procedure that allows
simultaneous modeling of the category tree structure and the corpus generative
process in the spherical space for effective category-representative term
discovery. Our comprehensive experiments show that our model, named JoSH, mines
a high-quality set of hierarchical topics with high efficiency and benefits
weakly-supervised hierarchical text classification tasks.Comment: KDD 2020 Research Track. (Code: https://github.com/yumeng5/JoSH
Mining Helpdesk Databases For Professional Development Topic Discovery
This single-site, instrumental case study created and tested a methodological road map by which academic institutions can use text data mining techniques to derive technology skillset weaknesses and professional development topics from the site’s technical support helpdesk database. The methods employed were described in detail and applied to the helpdesk database of an independent, co-educational boarding high school in the northeastern United States. Standard text data mining procedures, including the formation of a wordlist (frequently occurring terms), and the creation and application of clustering (automated data grouping) and classification (automated data labeling) models generated meaningful and revealing themes from the helpdesk database. The results of the text mining procedures were bolstered and analyzed using human interpretation and spreadsheet-based summaries. Major findings included the discovery of four prominent technologies that warranted professional development at the site and a universally-applicable approach to undertaking successful helpdesk data mining endeavors. The case study’s conclusions included a call to action for researchers to leverage the methodology at other locations. Future data mining studies may yield practical and applicable knowledge at research sites. Shared methods, approaches, and findings from such studies will advance the field of helpdesk data mining used to glean professional development topics for the very people who have submitted technological support requests to helpdesk providers
Relevance of Health-Related Hashtags on Twitter: A Text Mining Approach
BACKGROUND
Social media platforms facilitate user interaction and impact decision making. Users prefer to use hashtags while sharing posts. Knowing the sentiment towards diabetes, bloodpressure, and obesity is fundamental to understanding the impact of these information on patients and their families. The study seeks to determine the relevance of health-related hashtags on Twitter and analyze sentiments about diabetes, obesity, blood pressure.
METHOD
Tweets were retrieved using synonyms for “diabetes”, “hypertension” and “obesity”. The extended knowledge discovery in data mining (KDDM) model guided our research with research objectives defined in the ‘research problem understanding’ phase. The ‘information seeking’ from Uses and Gratifications Theory (UGT) determined the success and text mining assessment criteria. Text pre-processing was done using tokenization, stop word removal, and stemming. The research objectives, text mining goals, and success criteria were answered using ‘Uses and Gratifications Theory’ (UGT).
RESULTS
Total 6749 tweets were extracted using RStudio. 36.41% were about blood pressure, 0.25%- diabetes, 24.43% -obesity and 6.99% -combination of two or more terms. Additional topics such as cholesterol, chia seeds, postpartum, diet, exercise were identified. Upcoming conferences like ‘#ipna’, ‘#review’, ‘#APCH2019’, ‘#cardiotwitter’ were identified. Increased user engagement – about managing blood pressure, diabetes, obesity across different age groups, as well as the consequences of increased cardio exercise for obese and diabetic users were encouraging. Tweets about advertisements specific to clothing for oversized individuals-initiated conversation among users about monitoring self-health.
CONCLUSIONS
Sentiment analysis can thus increase our understanding about user engagement on such platforms and potentially help improve managing public health strategically.https://scholarscompass.vcu.edu/gradposters/1105/thumbnail.jp
Data Science and Knowledge Discovery
Data Science (DS) is gaining significant importance in the decision process due to a mix of various areas, including Computer Science, Machine Learning, Math and Statistics, domain/business knowledge, software development, and traditional research. In the business field, DS's application allows using scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data to support the decision process. After collecting the data, it is crucial to discover the knowledge. In this step, Knowledge Discovery (KD) tasks are used to create knowledge from structured and unstructured sources (e.g., text, data, and images). The output needs to be in a readable and interpretable format. It must represent knowledge in a manner that facilitates inferencing. KD is applied in several areas, such as education, health, accounting, energy, and public administration. This book includes fourteen excellent articles which discuss this trending topic and present innovative solutions to show the importance of Data Science and Knowledge Discovery to researchers, managers, industry, society, and other communities. The chapters address several topics like Data mining, Deep Learning, Data Visualization and Analytics, Semantic data, Geospatial and Spatio-Temporal Data, Data Augmentation and Text Mining
Using Perilog to Explore "Decision Making at NASA"
Perilog, a context intensive text mining system, is used as a discovery tool to explore topics and concerns in "Decision Making at NASA," chapter 6 of the Columbia Accident Investigation Board (CAIB) Report, Volume I. Two examples illustrate how Perilog can be used to discover highly significant safety-related information in the text without prior knowledge of the contents of the document. A third example illustrates how "if-then" statements found by Perilog can be used in logical analysis of decision making. In addition, in order to serve as a guide for future work, the technical details of preparing a PDF document for input to Perilog are included in an appendix
Effective Seed-Guided Topic Discovery by Integrating Multiple Types of Contexts
Instead of mining coherent topics from a given text corpus in a completely
unsupervised manner, seed-guided topic discovery methods leverage user-provided
seed words to extract distinctive and coherent topics so that the mined topics
can better cater to the user's interest. To model the semantic correlation
between words and seeds for discovering topic-indicative terms, existing
seed-guided approaches utilize different types of context signals, such as
document-level word co-occurrences, sliding window-based local contexts, and
generic linguistic knowledge brought by pre-trained language models. In this
work, we analyze and show empirically that each type of context information has
its value and limitation in modeling word semantics under seed guidance, but
combining three types of contexts (i.e., word embeddings learned from local
contexts, pre-trained language model representations obtained from
general-domain training, and topic-indicative sentences retrieved based on seed
information) allows them to complement each other for discovering quality
topics. We propose an iterative framework, SeedTopicMine, which jointly learns
from the three types of contexts and gradually fuses their context signals via
an ensemble ranking process. Under various sets of seeds and on multiple
datasets, SeedTopicMine consistently yields more coherent and accurate topics
than existing seed-guided topic discovery approaches.Comment: 9 pages; Accepted to WSDM 202
- …