353 research outputs found

    Social analytics for health integration, intelligence, and monitoring

    Get PDF
    Nowadays, patient-generated social health data are abundant and Healthcare is changing from the authoritative provider-centric model to collaborative and patient-oriented care. The aim of this dissertation is to provide a Social Health Analytics framework to utilize social data to solve the interdisciplinary research challenges of Big Data Science and Health Informatics. Specific research issues and objectives are described below. The first objective is semantic integration of heterogeneous health data sources, which can vary from structured to unstructured and include patient-generated social data as well as authoritative data. An information seeker has to spend time selecting information from many websites and integrating it into a coherent mental model. An integrated health data model is designed to allow accommodating data features from different sources. The model utilizes semantic linked data for lightweight integration and allows a set of analytics and inferences over data sources. A prototype analytical and reasoning tool called “Social InfoButtons” that can be linked from existing EHR systems is developed to allow doctors to understand and take into consideration the behaviors, patterns or trends of patients’ healthcare practices during a patient’s care. The tool can also shed insights for public health officials to make better-informed policy decisions. The second objective is near-real time monitoring of disease outbreaks using social media. The research for epidemics detection based on search query terms entered by millions of users is limited by the fact that query terms are not easily accessible by non-affiliated researchers. Publically available Twitter data is exploited to develop the Epidemics Outbreak and Spread Detection System (EOSDS). EOSDS provides four visual analytics tools for monitoring epidemics, i.e., Instance Map, Distribution Map, Filter Map, and Sentiment Trend to investigate public health threats in space and time. The third objective is to capture, analyze and quantify public health concerns through sentiment classifications on Twitter data. For traditional public health surveillance systems, it is hard to detect and monitor health related concerns and changes in public attitudes to health-related issues, due to their expenses and significant time delays. A two-step sentiment classification model is built to measure the concern. In the first step, Personal tweets are distinguished from Non-Personal tweets. In the second step, Personal Negative tweets are further separated from Personal Non-Negative tweets. In the proposed classification, training data is labeled by an emotion-oriented, clue-based method, and three Machine Learning models are trained and tested. Measure of Concern (MOC) is computed based on the number of Personal Negative sentiment tweets. A timeline trend of the MOC is also generated to monitor public concern levels, which is important for health emergency resource allocations and policy making. The fourth objective is predicting medical condition incidence and progression trajectories by using patients’ self-reported data on PatientsLikeMe. Some medical conditions are correlated with each other to a measureable degree (“comorbidities”). A prediction model is provided to predict the comorbidities and rank future conditions by their likelihood and to predict the possible progression trajectories given an observed medical condition. The novel models for trajectory prediction of medical conditions are validated to cover the comorbidities reported in the medical literature

    Image Sentiment Analysis of Social Media Data

    Get PDF
    Often a picture is worth a thousand words, and this is a small statement that represents one of the biggest challenges in the Image Sentiment Analysis area. The main theme of this dissertation is the Image Sentiment Analysis of social media, mainly from Twitter, so that it is identified as situations that represent risks (identification of negative situations) or that become a risk (prediction of negative situations). Despite the diversity of work done in the area of image sentiment analysis, it is still a challenging task. Several factors contribute to the difficulty, both more global factors likewise sociocultural issues, and issues within the scope of the analysis of feeling in images, such as the difficulty in finding reliable and properly labeled data to be used, as well as factors faced during the classification, for example, it is normal to associate images with darker colors and low brightness to negative feelings, after all, most are like that, but some cases escape this rule, and it is these cases that affect the accuracy of the developed models. However, in order to overcome these problems faced in classification, a multitasking model was developed, which will consider the entire image information, information from the salient areas in the images, and the facial expressions of faces contained in the images, and textual information, so that each component complements the other during classification. During the experiments it was possible to observe that the use of the proposed models can bring advantages for the classification of feeling in images and even work around some problems evidenced in existing works, such as the irony of the text. Therefore, this work aims to present the state of the art and the study carried out, in order to enable the presentation and implementation of the proposed model and carrying out the experiments and discussion of the results obtained, in order to verify the effectiveness of what was proposed. Finally, conclusions about the work done and future work will be presented.Muitas vezes uma imagem vale mais que mil palavras, e esta é uma pequena afirmação que representa um dos maiores desafios da área de classificação do sentimento contido nas imagens. O principal tema desta dissertação é a realização da análise do sentimento contido em imagens das mídias sociais, principalmente do Twitter, de modo que possam ser identificadas as situações que representam riscos (identificação de situações negativas) ou as quais possam se tornar um (previsão de situações negativas). Apesar da diversidade de trabalhos feitos na área da análise de sentimento em imagens, ainda é uma tarefa desafiante. Diversos fatores contribuem para a dificuldade , tantos fatores mais globais como questões socioculturais, quanto questões do próprio âmbito de análise de sentimento em imagens, como a dificuldade em achar dados confiáveis e devidamente etiquetados para serem utilizados, quanto fatores enfrentados durante a classificação, como por exemplo, é normal associar imagens com cores mais escuras e pouco brilho à sentimentos negativos, afinal a maioria é assim, entretanto há casos que fogem dessa regra, e são esses casos que afetam a precisão dos modelos desenvolvidos. Porém, visando contornar esses problemas enfrentados na classificação, foi desenvolvido um modelo multitarefas, o qual irá considerar informações globais, áreas salientes nas imagens, expressões faciais de rostos contidos nas imagens e informação textual, de modo que cada componente se complemente durante a classificação. Durante os experimentos foi possível observar que o uso dos modelos propostos podem trazer vantagens para a classificação do sentimento em imagens e até mesmo contornar alguns problemas evidenciados nos trabalhos já existentes, como por exemplo a ironia do texto. Assim sendo, este trabalho tem como objetivo apresentar o estado da arte e o estudo realizado, de modo a possibilitar a apresentação e implementação do modelo multitarefas proposto e realização das experiências e discussão dos resultados obtidos, de forma a verificar a eficácia do método proposto. Por fim, as conclusões sobre o trabalho feito e trabalho futuro serão apresentados

    Perceiving University Student's Opinions from Google App Reviews

    Full text link
    Google app market captures the school of thought of users from every corner of the globe via ratings and text reviews, in a multilinguistic arena. The potential information from the reviews cannot be extracted manually, due to its exponential growth. So, Sentiment analysis, by machine learning and deep learning algorithms employing NLP, explicitly uncovers and interprets the emotions. This study performs the sentiment classification of the app reviews and identifies the university student's behavior towards the app market via exploratory analysis. We applied machine learning algorithms using the TP, TF, and TF IDF text representation scheme and evaluated its performance on Bagging, an ensemble learning method. We used word embedding, Glove, on the deep learning paradigms. Our model was trained on Google app reviews and tested on Student's App Reviews(SAR). The various combinations of these algorithms were compared amongst each other using F score and accuracy and inferences were highlighted graphically. SVM, amongst other classifiers, gave fruitful accuracy(93.41%), F score(89%) on bigram and TF IDF scheme. Bagging enhanced the performance of LR and NB with accuracy of 87.88% and 86.69% and F score of 86% and 78% respectively. Overall, LSTM on Glove embedding recorded the highest accuracy(95.2%) and F score(88%).Comment: Accepted in Concurrency and Computation Practice and Experienc

    The text classification pipeline: Starting shallow, going deeper

    Get PDF
    An increasingly relevant and crucial subfield of Natural Language Processing (NLP), tackled in this PhD thesis from a computer science and engineering perspective, is the Text Classification (TC). Also in this field, the exceptional success of deep learning has sparked a boom over the past ten years. Text retrieval and categorization, information extraction and summarization all rely heavily on TC. The literature has presented numerous datasets, models, and evaluation criteria. Even if languages as Arabic, Chinese, Hindi and others are employed in several works, from a computer science perspective the most used and referred language in the literature concerning TC is English. This is also the language mainly referenced in the rest of this PhD thesis. Even if numerous machine learning techniques have shown outstanding results, the classifier effectiveness depends on the capability to comprehend intricate relations and non-linear correlations in texts. In order to achieve this level of understanding, it is necessary to pay attention not only to the architecture of a model but also to other stages of the TC pipeline. In an NLP framework, a range of text representation techniques and model designs have emerged, including the large language models. These models are capable of turning massive amounts of text into useful vector representations that effectively capture semantically significant information. The fact that this field has been investigated by numerous communities, including data mining, linguistics, and information retrieval, is an aspect of crucial interest. These communities frequently have some overlap, but are mostly separate and do their research on their own. Bringing researchers from other groups together to improve the multidisciplinary comprehension of this field is one of the objectives of this dissertation. Additionally, this dissertation makes an effort to examine text mining from both a traditional and modern perspective. This thesis covers the whole TC pipeline in detail. However, the main contribution is to investigate the impact of every element in the TC pipeline to evaluate the impact on the final performance of a TC model. It is discussed the TC pipeline, including the traditional and the most recent deep learning-based models. This pipeline consists of State-Of-The-Art (SOTA) datasets used in the literature as benchmark, text preprocessing, text representation, machine learning models for TC, evaluation metrics and current SOTA results. In each chapter of this dissertation, I go over each of these steps, covering both the technical advancements and my most significant and recent findings while performing experiments and introducing novel models. The advantages and disadvantages of various options are also listed, along with a thorough comparison of the various approaches. At the end of each chapter, there are my contributions with experimental evaluations and discussions on the results that I have obtained during my three years PhD course. The experiments and the analysis related to each chapter (i.e., each element of the TC pipeline) are the main contributions that I provide, extending the basic knowledge of a regular survey on the matter of TC.An increasingly relevant and crucial subfield of Natural Language Processing (NLP), tackled in this PhD thesis from a computer science and engineering perspective, is the Text Classification (TC). Also in this field, the exceptional success of deep learning has sparked a boom over the past ten years. Text retrieval and categorization, information extraction and summarization all rely heavily on TC. The literature has presented numerous datasets, models, and evaluation criteria. Even if languages as Arabic, Chinese, Hindi and others are employed in several works, from a computer science perspective the most used and referred language in the literature concerning TC is English. This is also the language mainly referenced in the rest of this PhD thesis. Even if numerous machine learning techniques have shown outstanding results, the classifier effectiveness depends on the capability to comprehend intricate relations and non-linear correlations in texts. In order to achieve this level of understanding, it is necessary to pay attention not only to the architecture of a model but also to other stages of the TC pipeline. In an NLP framework, a range of text representation techniques and model designs have emerged, including the large language models. These models are capable of turning massive amounts of text into useful vector representations that effectively capture semantically significant information. The fact that this field has been investigated by numerous communities, including data mining, linguistics, and information retrieval, is an aspect of crucial interest. These communities frequently have some overlap, but are mostly separate and do their research on their own. Bringing researchers from other groups together to improve the multidisciplinary comprehension of this field is one of the objectives of this dissertation. Additionally, this dissertation makes an effort to examine text mining from both a traditional and modern perspective. This thesis covers the whole TC pipeline in detail. However, the main contribution is to investigate the impact of every element in the TC pipeline to evaluate the impact on the final performance of a TC model. It is discussed the TC pipeline, including the traditional and the most recent deep learning-based models. This pipeline consists of State-Of-The-Art (SOTA) datasets used in the literature as benchmark, text preprocessing, text representation, machine learning models for TC, evaluation metrics and current SOTA results. In each chapter of this dissertation, I go over each of these steps, covering both the technical advancements and my most significant and recent findings while performing experiments and introducing novel models. The advantages and disadvantages of various options are also listed, along with a thorough comparison of the various approaches. At the end of each chapter, there are my contributions with experimental evaluations and discussions on the results that I have obtained during my three years PhD course. The experiments and the analysis related to each chapter (i.e., each element of the TC pipeline) are the main contributions that I provide, extending the basic knowledge of a regular survey on the matter of TC

    Just-in-time Sentiment Analysis for Multilingual Streams

    Get PDF
    Στις μέρες μας οι πλατφόρμες κοινωνικής δικτύωσης έχουν σημειώσει τεράστια ανάπτυξη τόσο στο πλήθος των χρηστών που έχουν καταφέρει να προσελκύσουν όσο και στον όγκο των δεδομένων που παράγουν. Όλο και περισσότεροι άνθρωποι τείνουν να εκφράζουν ελεύθερα τις αντιλήψεις και τα συναισθήματά τους σε πλατφόρμες όπως το Twitter, κα- θιστώντας τες κυρίαρχα μέσα έγκαιρης ενημέρωσης. Για το λόγο αυτό, η ανάλυση συναι- σθήματος σε τέτοιες πλατφόρμες αποτελεί όργανο συλλογής μαζικών τάσεων ενώ παράλ- ληλα δημιουργεί προκλήσεις για το χειρσμό του όγκου των πληροφοριών. Στην παρούσα πτυχιακή εργασία παρουσιάζουμε ένα μοντέλο ανάλυσης συναισθήματος βασισμένο στο Apache Spark για πολύγλωσσα δεδομένα πραγματικού χρόνου. Πιο συγκεκριμένα το σύ- στημά μας: i) χρησιμοποιεί την βιβλιοθήκη μηχανικής μάθησης του Spark με σκοπό να κατηγοριοποιήσει tweets στα Ελληνικά, Γαλλικά και Αγγλικά σε αμελητέο χρόνο, ii) διαχει- ρίζεται προσεγμένα τη ροή των δεδομένων χρησιμοποιώντας σύχρονες ουρές δρομολό- γησης μηνυμάτων, και iii) αποφαίνεται με υψηλή ακρίβεια για το αν ένα tweet προέρχεται από αληθινό λογαριασμό.The growth of social-media platforms has been remarkable in terms of both number of users and volume of content generated. As citizens tend to freely express their sentiments on social platforms, Twitter has inherently become an indispensable source for the public discourse in a wide variety of topics. Carrying out sentiment analysis on a timely manner on streamed tweets is undoubtedly a demanding endeavor. In this thesis, we propose a Spark-based Twitter sentiment analysis software architecture that receives online multilingual streamed messages and compiles analytics. We outline the main elements of our proposal and discuss how they collectively help address the challenges involved in this big-data processing task. In particular, our framework: i) exploits the Spark machine-learning library to classify Greek, French and English tweets in a timely-manner, ii) manages streamed tweets in synergy with contemporary queuing and in-memory data systems, and iii) determines with high accuracy whether a sentiment is expressed by a genuine account

    Doctor of Philosophy in Computer Science

    Get PDF
    dissertationOver the last decade, social media has emerged as a revolutionary platform for informal communication and social interactions among people. Publicly expressing thoughts, opinions, and feelings is one of the key characteristics of social media. In this dissertation, I present research on automatically acquiring knowledge from social media that can be used to recognize people's affective state (i.e., what someone feels at a given time) in text. This research addresses two types of affective knowledge: 1) hashtag indicators of emotion consisting of emotion hashtags and emotion hashtag patterns, and 2) affective understanding of similes (a form of figurative comparison). My research introduces a bootstrapped learning algorithm for learning hashtag in- dicators of emotions from tweets with respect to five emotion categories: Affection, Anger/Rage, Fear/Anxiety, Joy, and Sadness/Disappointment. With a few seed emotion hashtags per emotion category, the bootstrapping algorithm iteratively learns new hashtags and more generalized hashtag patterns by analyzing emotion in tweets that contain these indicators. Emotion phrases are also harvested from the learned indicators to train additional classifiers that use the surrounding word context of the phrases as features. This is the first work to learn hashtag indicators of emotions. My research also presents a supervised classification method for classifying affective polarity of similes in Twitter. Using lexical, semantic, and sentiment properties of different simile components as features, supervised classifiers are trained to classify a simile into a positive or negative affective polarity class. The property of comparison is also fundamental to the affective understanding of similes. My research introduces a novel framework for inferring implicit properties that 1) uses syntactic constructions, statistical association, dictionary definitions and word embedding vector similarity to generate and rank candidate properties, 2) re-ranks the top properties using influence from multiple simile components, and 3) aggregates the ranks of each property from different methods to create a final ranked list of properties. The inferred properties are used to derive additional features for the supervised classifiers to further improve affective polarity recognition. Experimental results show substantial improvements in affective understanding of similes over the use of existing sentiment resources

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    Get PDF
    Welcome to EVALITA 2020! EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for Italian. EVALITA is an initiative of the Italian Association for Computational Linguistics (AILC, http://www.ai-lc.it) and it is endorsed by the Italian Association for Artificial Intelligence (AIxIA, http://www.aixia.it) and the Italian Association for Speech Sciences (AISV, http://www.aisv.it)
    corecore