71 research outputs found
Social media mining for toxicovigilance of prescription medications: End-to-end pipeline, challenges and future work
Substance use, substance use disorder, and overdoses related to substance use
are major public health problems globally and in the United States. A key
aspect of addressing these problems from a public health standpoint is improved
surveillance. Traditional surveillance systems are laggy, and social media are
potentially useful sources of timely data. However, mining knowledge from
social media is challenging, and requires the development of advanced
artificial intelligence, specifically natural language processing (NLP) and
machine learning methods. We developed a sophisticated end-to-end pipeline for
mining information about nonmedical prescription medication use from social
media, namely Twitter and Reddit. Our pipeline employs supervised machine
learning and NLP for filtering out noise and characterizing the chatter. In
this paper, we describe our end-to-end pipeline developed over four years. In
addition to describing our data mining infrastructure, we discuss existing
challenges in social media mining for toxicovigilance, and possible future
research directions
Enhancing Drug Overdose Mortality Surveillance through Natural Language Processing and Machine Learning
Epidemiological surveillance is key to monitoring and assessing the health of populations. Drug overdose surveillance has become an increasingly important part of public health practice as overdose morbidity and mortality has increased due in large part to the opioid crisis. Monitoring drug overdose mortality relies on death certificate data, which has several limitations including timeliness and the coding structure used to identify specific substances that caused death. These limitations stem from the need to analyze the free-text cause-of-death sections of the death certificate that are completed by the medical certifier during death investigation. Other fields, including clinical sciences, have utilized natural language processing (NLP) methods to gain insight from free-text data, but thus far, adoption of NLP methods in epidemiological surveillance has been limited. Through a narrative review of NLP methods currently used in public health surveillance and the integration of two NLP tasks, classification and named entity recognition, this dissertation enhances the capabilities of public health practitioners and researchers to perform drug overdose mortality surveillance. This dissertation advances both surveillance science and public health practice by integrating methods from bioinformatics into the surveillance pipeline which provides more timely and increased quality overdose mortality surveillance, which is essential to guiding effective public health response to the continuing drug overdose epidemic
Systematic review on the prevalence, frequency and comparative value of adverse events data in social media
Aim: The aim of this review was to summarize the prevalence, frequency and comparative value of information on the adverse events of healthcare interventions from user comments and videos in social media. Methods: A systematic review of assessments of the prevalence or type of information on adverse events in social media was undertaken. Sixteen databases and two internet search engines were searched in addition to handsearching, reference checking and contacting experts. The results were sifted independently by two researchers. Data extraction and quality assessment were carried out by one researcher and checked by a second. The quality assessment tool was devised in-house and a narrative synthesis of the results followed. Results: From 3064 records, 51 studies met the inclusion criteria. The studies assessed over 174 social media sites with discussion forums (71%) being the most popular. The overall prevalence of adverse events reports in social media varied from 0.2% to 8% of posts. Twenty-nine studies compared the results from searching social media with using other data sources to identify adverse events. There was general agreement that a higher frequency of adverse events was found in social media and that this was particularly true for ‘symptom’ related and ‘mild’ adverse events. Those adverse events that were under-represented in social media were laboratory-based and serious adverse events. Conclusions: Reports of adverse events are identifiable within social media. However, there is considerable heterogeneity in the frequency and type of events reported, and the reliability or validity of the data has not been thoroughly evaluated
Characterizing Information Seeking Events in Health-Related Social Discourse
Social media sites have become a popular platform for individuals to seek and
share health information. Despite the progress in natural language processing
for social media mining, a gap remains in analyzing health-related texts on
social discourse in the context of events. Event-driven analysis can offer
insights into different facets of healthcare at an individual and collective
level, including treatment options, misconceptions, knowledge gaps, etc. This
paper presents a paradigm to characterize health-related information-seeking in
social discourse through the lens of events. Events here are board categories
defined with domain experts that capture the trajectory of the
treatment/medication. To illustrate the value of this approach, we analyze
Reddit posts regarding medications for Opioid Use Disorder (OUD), a critical
global health concern. To the best of our knowledge, this is the first attempt
to define event categories for characterizing information-seeking in OUD social
discourse. Guided by domain experts, we develop TREAT-ISE, a novel multilabel
treatment information-seeking event dataset to analyze online discourse on an
event-based framework. This dataset contains Reddit posts on
information-seeking events related to recovery from OUD, where each post is
annotated based on the type of events. We also establish a strong performance
benchmark (77.4% F1 score) for the task by employing several machine learning
and deep learning classifiers. Finally, we thoroughly investigate the
performance and errors of ChatGPT on this task, providing valuable insights
into the LLM's capabilities and ongoing characterization efforts.Comment: Under review AAAI-2024. 10 pages, 6 tables, 2 figue
Data and systems for medication-related text classification and concept normalization from Twitter: insights from the Social Media Mining for Health (SMM4H)-2017 shared task
Objective: We executed the Social Media Mining for Health (SMM4H) 2017 shared tasks to enable the community-driven development and large-scale evaluation of automatic text processing methods for the classification and normalization of health-related text from social media. An additional objective was to publicly release manually annotated data.Materials and Methods: We organized 3 independent subtasks: automatic classification of self-reports of 1) adverse drug reactions (ADRs) and 2) medication consumption, from medication-mentioning tweets, and 3) normalization of ADR expressions. Training data consisted of 15 717 annotated tweets for (1), 10 260 for (2), and 6650 ADR phrases and identifiers for (3); and exhibited typical properties of social-media-based health-related texts. Systems were evaluated using 9961, 7513, and 2500 instances for the 3 subtasks, respectively. We evaluated performances of classes of methods and ensembles of system combinations following the shared tasks.Results: Among 55 system runs, the best system scores for the 3 subtasks were 0.435 (ADR class F1-score) for subtask-1, 0.693 (micro-averaged F1-score over two classes) for subtask-2, and 88.5% (accuracy) for subtask-3. Ensembles of system combinations obtained best scores of 0.476, 0.702, and 88.7%, outperforming individual systems.Discussion: Among individual systems, support vector machines and convolutional neural networks showed high performance. Performance gains achieved by ensembles of system combinations suggest that such strategies may be suitable for operational systems relying on difficult text classification tasks (eg, subtask-1).Conclusions: Data imbalance and lack of context remain challenges for natural language processing of social media text. Annotated data from the shared task have been made available as reference standards for future studies (http://dx.doi.org/10.17632/rxwfb3tysd.1).</div
Front-Line Physicians' Satisfaction with Information Systems in Hospitals
Day-to-day operations management in hospital units is difficult due to continuously varying situations, several actors involved and a vast number of information systems in use. The aim of this study was to describe front-line physicians' satisfaction with existing information systems needed to support the day-to-day operations management in hospitals. A cross-sectional survey was used and data chosen with stratified random sampling were collected in nine hospitals. Data were analyzed with descriptive and inferential statistical methods. The response rate was 65 % (n = 111). The physicians reported that information systems support their decision making to some extent, but they do not improve access to information nor are they tailored for physicians. The respondents also reported that they need to use several information systems to support decision making and that they would prefer one information system to access important information. Improved information access would better support physicians' decision making and has the potential to improve the quality of decisions and speed up the decision making process.Peer reviewe
Biomedical Information Extraction Pipelines for Public Health in the Age of Deep Learning
abstract: Unstructured texts containing biomedical information from sources such as electronic health records, scientific literature, discussion forums, and social media offer an opportunity to extract information for a wide range of applications in biomedical informatics. Building scalable and efficient pipelines for natural language processing and extraction of biomedical information plays an important role in the implementation and adoption of applications in areas such as public health. Advancements in machine learning and deep learning techniques have enabled rapid development of such pipelines. This dissertation presents entity extraction pipelines for two public health applications: virus phylogeography and pharmacovigilance. For virus phylogeography, geographical locations are extracted from biomedical scientific texts for metadata enrichment in the GenBank database containing 2.9 million virus nucleotide sequences. For pharmacovigilance, tools are developed to extract adverse drug reactions from social media posts to open avenues for post-market drug surveillance from non-traditional sources. Across these pipelines, high variance is observed in extraction performance among the entities of interest while using state-of-the-art neural network architectures. To explain the variation, linguistic measures are proposed to serve as indicators for entity extraction performance and to provide deeper insight into the domain complexity and the challenges associated with entity extraction. For both the phylogeography and pharmacovigilance pipelines presented in this work the annotated datasets and applications are open source and freely available to the public to foster further research in public health.Dissertation/ThesisDoctoral Dissertation Biomedical Informatics 201
- …