12,357 research outputs found

    LCCT: a semisupervised model for sentiment classification

    Get PDF
    Conference Theme: Human Language TechnologiesAnalyzing public opinions towards products, services and social events is an important but challenging task. An accurate sentiment analyzer should take both lexicon-level information and corpus-level information into account. It also needs to exploit the domain-specific knowledge and utilize the common knowledge shared across domains. In addition, we want the algorithm being able to deal with missing labels and learning from incomplete sentiment lexicons. This paper presents a LCCT (Lexicon-based and Corpus-based, Co-Training) model for semi-supervised sentiment classification. The proposed method combines the idea of lexicon-based learning and corpus-based learning in a unified co-training framework. It is capable of incorporating both domain-specific and domain-independent knowledge. Extensive experiments show that it achieves very competitive classification accuracy, even with a small portion of labeled data. Comparing to state-of-the-art sentiment classification methods, the LCCT approach exhibits significantly better performances on a variety of datasets in both English and Chinese. © 2015 Association for Computational Linguisticspublished_or_final_versio

    Medical Image Data and Datasets in the Era of Machine Learning-Whitepaper from the 2016 C-MIMI Meeting Dataset Session.

    Get PDF
    At the first annual Conference on Machine Intelligence in Medical Imaging (C-MIMI), held in September 2016, a conference session on medical image data and datasets for machine learning identified multiple issues. The common theme from attendees was that everyone participating in medical image evaluation with machine learning is data starved. There is an urgent need to find better ways to collect, annotate, and reuse medical imaging data. Unique domain issues with medical image datasets require further study, development, and dissemination of best practices and standards, and a coordinated effort among medical imaging domain experts, medical imaging informaticists, government and industry data scientists, and interested commercial, academic, and government entities. High-level attributes of reusable medical image datasets suitable to train, test, validate, verify, and regulate ML products should be better described. NIH and other government agencies should promote and, where applicable, enforce, access to medical image datasets. We should improve communication among medical imaging domain experts, medical imaging informaticists, academic clinical and basic science researchers, government and industry data scientists, and interested commercial entities

    Health Misinformation in Search and Social Media

    Get PDF
    People increasingly rely on the Internet in order to search for and share health-related information. Indeed, searching for and sharing information about medical treatments are among the most frequent uses of online data. While this is a convenient and fast method to collect information, online sources may contain incorrect information that has the potential to cause harm, especially if people believe what they read without further research or professional medical advice. The goal of this thesis is to address the misinformation problem in two of the most commonly used online services: search engines and social media platforms. We examined how people use these platforms to search for and share health information. To achieve this, we designed controlled laboratory user studies and employed large-scale social media data analysis tools. The solutions proposed in this thesis can be used to build systems that better support people's health-related decisions. The techniques described in this thesis addressed online searching and social media sharing in the following manner. First, with respect to search engines, we aimed to determine the extent to which people can be influenced by search engine results when trying to learn about the efficacy of various medical treatments. We conducted a controlled laboratory study wherein we biased the search results towards either correct or incorrect information. We then asked participants to determine the efficacy of different medical treatments. Results showed that people were significantly influenced both positively and negatively by search results bias. More importantly, when the subjects were exposed to incorrect information, they made more incorrect decisions than when they had no interaction with the search results. Following from this work, we extended the study to gain insights into strategies people use during this decision-making process, via the think-aloud method. We found that, even with verbalization, people were strongly influenced by the search results bias. We also noted that people paid attention to what the majority states, authoritativeness, and content quality when evaluating online content. Understanding the effects of cognitive biases that can arise during online search is a complex undertaking because of the presence of unconscious biases (such as the search results ranking) that the think-aloud method fails to show. Moving to social media, we first proposed a solution to detect and track misinformation in social media. Using Zika as a case study, we developed a tool for tracking misinformation on Twitter. We collected 13 million tweets regarding the Zika outbreak and tracked rumors outlined by the World Health Organization and the Snopes fact-checking website. We incorporated health professionals, crowdsourcing, and machine learning to capture health-related rumors as well as clarification communications. In this way, we illustrated insights that the proposed tools provide into potentially harmful information on social media, allowing public health researchers and practitioners to respond with targeted and timely action. From identifying rumor-bearing tweets, we examined individuals on social media who are posting questionable health-related information, in particular those promoting cancer treatments that have been shown to be ineffective. Specifically, we studied 4,212 Twitter users who have posted about one of 139 ineffective ``treatments'' and compared them to a baseline of users generally interested in cancer. Considering features that capture user attributes, writing style, and sentiment, we built a classifier that is able to identify users prone to propagating such misinformation. This classifier achieved an accuracy of over 90%, providing a potential tool for public health officials to identify such individuals for preventive intervention

    Interpretable Word-Level Sentiment Analysis With Attention-Based Multiple Instance Classification Models

    Get PDF
    In this study, our main objective is to tackle the black-box nature of popular machine learning models in sentiment analysis and enhance model interpretability. We aim to gain more insight into the decision-making process of sentiment analysis models, which is often obscure in those complex models. To achieve this goal, we introduce two word-level sentiment analysis models. The first model is called the attention-based multiple instance classification (AMIC) model. It combines the transparent model structure of multiple instance classification and the self-attention mechanism in deep learning to incorporate the contextual information from documents. As demonstrated by a wine review dataset application, AMIC can achieve state-of-the-art performance compared to a number of machine learning methods, while providing much improved interpretability. The second model, AMIC 2.0, improves AMIC in two key aspects. Notably, AMIC is limited in integrating positional information in text because it ignores the order of words in documents. AMIC 2.0 comes up with a novel approach to incorporate relative positional information in the self-attention mechanism, enabling the model to capture more accurate sentiment that is position-sensitive. This modification enables the model to better understand how word order and proximity influence sentiment expressions. Secondly, AMIC 2.0 takes a step further by decomposing the sentiment score in AMIC into a context-independent score and a context-dependent score. This decomposition, along with the incorporation of two sentiment shifters linking these scores in a global environment and a local environment of text respectively, elucidate how context of document influences sentiment of words, leading to more interpretable results in sentiment analysis. The utility of AMIC 2.0 is demonstrated by an application to a Twitter dataset. AMIC 2.0 has improved the overall performance of AMIC, with the additional capability of handling more intricate language subtleties, such as different types of negations. Both AMIC and AMIC 2.0 are trained without having to use pre-trained sentiment word dictionary or seeded sentiment words. Compared to some other big language models, their computation cost is relatively low and they are versatile to use conventional datasets to generate domain-specific sentiment dictionary and provide interpretable sentiment analysis results

    Classification of socially generated medical data

    Get PDF
    The growth of online health communities, particularly those involving socially generated content, can provide considerable value for society. Participants can gain knowledge of medical information or interact with peers on medical forum platforms. However, the sheer volume of information so generated – and the consequent ‘noise’ associated with large data volumes – can create difficulties for information consumers. We propose a solution to this problem by applying high-level analytics to the data – primarily sentiment analysis, but also content and topic analysis - for accurate classification. We believe that such analysis can be of significant value to data users, such as identifying a particular aspect of an information space, determining themes that predominate among a large dataset, and allowing people to summarize topics within a big dataset. In this thesis, we apply machine learning strategies to identify sentiments expressed in online medical forums that discuss Lyme Disease. As part of this process, we distinguish a complete and relevant set of categories that can be used to characterize Lyme Disease discourse. We present a feature-based model that employs supervised learning algorithms and assess the feasibility and accuracy of this sentiment classification model. We further evaluate our model by assessing its ability to adapt to an online medical forum discussing a disease with similar characteristics, Lupus. The experimental results demonstrate the effectiveness of our approach. In many sentiment analysis applications, the labelled training datasets are expensive to obtain, whereas unlabelled datasets are readily available. Therefore, we present an adaptation of a well-known semi-supervised learning technique, in which co-training is implemented by combining labelled and unlabelled data. Our results would suggest the ability to learn even with limited labelled data. In addition, we investigate complementary analytic techniques – content and topic analysis – to leverage best used of the data for various consumer groups. Within the work described in this thesis, some particular research issues are addressed, specifically when applied to socially generated medical/health datasets: • When applying binary sentiment analysis to short-form text data (e.g. Twitter), could meta-level features improve performance of classification? • When applying more complex multi-class sentiment analysis to classification of long-form content-rich text data, would meta-level features be a useful addition to more conventional features? • Can this multi-class analysis approach be generalised to other medical/health domains? • How would alternative classification strategies benefit different groups of information consumers

    CREATE: Concept Representation and Extraction from Heterogeneous Evidence

    Get PDF
    Traditional information retrieval methodology is guided by document retrieval paradigm, where relevant documents are returned in response to user queries. This paradigm faces serious drawback if the desired result is not explicitly present in a single document. The problem becomes more obvious when a user tries to obtain complete information about a real world entity, such as person, company, location etc. In such cases, various facts about the target entity or concept need to be gathered from multiple document sources. In this work, we present a method to extract information about a target entity based on the concept retrieval paradigm that focuses on extracting and blending information related to a concept from multiple sources if necessary. The paradigm is built around a generic notion of concept which is defined as any item that can be thought of as a topic of interest. Concepts may correspond to any real world entity such as restaurant, person, city, organization, etc, or any abstract item such as news topic, event, theory, etc. Web is a heterogeneous collection of data in different forms such as facts, news, opinions etc. We propose different models for different forms of data, all of which work towards the same goal of concept centric retrieval. We motivate our work based on studies about current trends and demands for information seeking. The framework helps in understanding the intent of content, i.e. opinion versus fact. Our work has been conducted on free text data in English. Nevertheless, our framework can be easily transferred to other languages

    INFORMATIONAL SUPPORT OR EMOTIONAL SUPPORT: PRELIMINARY STUDY OF AN AUTOMATED APPROACH TO ANALYZE ONLINE SUPPORT COMMUNITY CONTENTS

    Get PDF
    Recognizing the need for analyzing large amounts of data in the study of online support communities, an automated content analysis method is introduced in this article. By adopting machine learning techniques and tools, this method requires minimal manual intervention while capable of analyzing large amounts of data automatically. Through this method, contents of messages from online support communities spanning over years are categorized as either informational support or emotional support. A case study on the analysis of online breast cancer and prostate cancer message boards is presented to demonstrate that the proposed method generates results comparable to results concluded from traditional manual qualitative content analysis methods
    • …
    corecore