71 research outputs found

    Assessing the Effects of Lemmatisation and Spell Checking on Sentiment Analysis of Online Reviews

    Get PDF
    With many options for text preprocessing techniques, choosing the most efficient methodology is important for both accuracy and computational expense. Online text often contains non-standard English, spelling errors, colloquialisms, emojis, slang and many other variations that affect current natural language processing tools, with no clear guidelines for preprocessing this type of text. In this work we analyse text preprocessing techniques using a data set of online reviews scraped from iTunes and Google Play store. The objective is to measure the efficacy of different combinations of these techniques to maximise the amount of detected sentiment in a dataset of 438,157 reviews. Sentiment detection was performed by two state-of-the-art sentiment analysers (RoBERTa and VADER). Statistical analysis of the results suggest preprocessing strategies for maximising sentiment detected within mental health app reviews and similar text formats

    Methods for improving entity linking and exploiting social media messages across crises

    Get PDF
    Entity Linking (EL) is the task of automatically identifying entity mentions in texts and resolving them to a corresponding entity in a reference knowledge base (KB). There is a large number of tools available for different types of documents and domains, however the literature in entity linking has shown the quality of a tool varies across different corpus and depends on specific characteristics of the corpus it is applied to. Moreover the lack of precision on particularly ambiguous mentions often spoils the usefulness of automated disambiguation results in real world applications. In the first part of this thesis I explore an approximation of the difficulty to link entity mentions and frame it as a supervised classification task. Classifying difficult to disambiguate entity mentions can facilitate identifying critical cases as part of a semi-automated system, while detecting latent corpus characteristics that affect the entity linking performance. Moreover, despiteless the large number of entity linking tools that have been proposed throughout the past years, some tools work better on short mentions while others perform better when there is more contextual information. To this end, I proposed a solution by exploiting results from distinct entity linking tools on the same corpus by leveraging their individual strengths on a per-mention basis. The proposed solution demonstrated to be effective and outperformed the individual entity systems employed in a series of experiments. An important component in the majority of the entity linking tools is the probability that a mentions links to one entity in a reference knowledge base, and the computation of this probability is usually done over a static snapshot of a reference KB. However, an entity’s popularity is temporally sensitive and may change due to short term events. Moreover, these changes might be then reflected in a KB and EL tools can produce different results for a given mention at different times. I investigated the prior probability change over time and the overall disambiguation performance using different KB from different time periods. The second part of this thesis is mainly concerned with short texts. Social media has become an integral part of the modern society. Twitter, for instance, is one of the most popular social media platforms around the world that enables people to share their opinions and post short messages about any subject on a daily basis. At first I presented one approach to identifying informative messages during catastrophic events using deep learning techniques. By automatically detecting informative messages posted by users during major events, it can enable professionals involved in crisis management to better estimate damages with only relevant information posted on social media channels, as well as to act immediately. Moreover I have also performed an analysis study on Twitter messages posted during the Covid-19 pandemic. Initially I collected 4 million tweets posted in Portuguese since the begining of the pandemic and provided an analysis of the debate aroud the pandemic. I used topic modeling, sentiment analysis and hashtags recomendation techniques to provide isights around the online discussion of the Covid-19 pandemic

    On the development of an information system for monitoring user opinion and its role for the public

    Get PDF
    Social media services and analytics platforms are rapidly growing. A large number of various events happen mostly every day, and the role of social media monitoring tools is also increasing. Social networks are widely used for managing and promoting brands and different services. Thus, most popular social analytics platforms aim for business purposes while monitoring various social, economic, and political problems remains underrepresented and not covered by thorough research. Moreover, most of them focus on resource-rich languages such as the English language, whereas texts and comments in other low-resource languages, such as the Russian and Kazakh languages in social media, are not represented well enough. So, this work is devoted to developing and applying the information system called the OMSystem for analyzing users' opinions on news portals, blogs, and social networks in Kazakhstan. The system uses sentiment dictionaries of the Russian and Kazakh languages and machine learning algorithms to determine the sentiment of social media texts. The whole structure and functionalities of the system are also presented. The experimental part is devoted to building machine learning models for sentiment analysis on the Russian and Kazakh datasets. Then the performance of the models is evaluated with accuracy, precision, recall, and F1-score metrics. The models with the highest scores are selected for implementation in the OMSystem. Then the OMSystem's social analytics module is used to thoroughly analyze the healthcare, political and social aspects of the most relevant topics connected with the vaccination against the coronavirus disease. The analysis allowed us to discover the public social mood in the cities of Almaty and Nur-Sultan and other large regional cities of Kazakhstan. The system's study included two extensive periods: 10-01-2021 to 30-05-2021 and 01-07-2021 to 12-08-2021. In the obtained results, people's moods and attitudes to the Government's policies and actions were studied by such social network indicators as the level of topic discussion activity in society, the level of interest in the topic in society, and the mood level of society. These indicators calculated by the OMSystem allowed careful identification of alarming factors of the public (negative attitude to the government regulations, vaccination policies, trust in vaccination, etc.) and assessment of the social mood

    3rd International Conference on Advanced Research Methods and Analytics (CARMA 2020)

    Full text link
    Research methods in economics and social sciences are evolving with the increasing availability of Internet and Big Data sources of information.As these sources, methods, and applications become more interdisciplinary, the 3rd International Conference on Advanced Research Methods and Analytics (CARMA) is an excellent forum for researchers and practitioners to exchange ideas and advances on how emerging research methods and sources are applied to different fields of social sciences as well as to discuss current and future challenges.Doménech I De Soria, J.; Vicente Cuervo, MR. (2020). 3rd International Conference on Advanced Research Methods and Analytics (CARMA 2020). Editorial Universitat Politècnica de València. http://hdl.handle.net/10251/149510EDITORIA

    Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

    Get PDF
    Peer reviewe

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    Get PDF
    Welcome to EVALITA 2020! EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for Italian. EVALITA is an initiative of the Italian Association for Computational Linguistics (AILC, http://www.ai-lc.it) and it is endorsed by the Italian Association for Artificial Intelligence (AIxIA, http://www.aixia.it) and the Italian Association for Speech Sciences (AISV, http://www.aisv.it)

    Proceedings of the Making Sense of Microposts Workshop (#Microposts2015) at the World Wide Web Conference

    Get PDF

    Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

    Get PDF
    Peer reviewe

    Real Time Crime Prediction Using Social Media

    Get PDF
    There is no doubt that crime is on the increase and has a detrimental influence on a nation's economy despite several attempts of studies on crime prediction to minimise crime rates. Historically, data mining techniques for crime prediction models often rely on historical information and its mostly country specific. In fact, only a few of the earlier studies on crime prediction follow standard data mining procedure. Hence, considering the current worldwide crime trend in which criminals routinely publish their criminal intent on social media and ask others to see and/or engage in different crimes, an alternative, and more dynamic strategy is needed. The goal of this research is to improve the performance of crime prediction models. Thus, this thesis explores the potential of using information on social media (Twitter) for crime prediction in combination with historical crime data. It also figures out, using data mining techniques, the most relevant feature engineering needed for United Kingdom dataset which could improve crime prediction model performance. Additionally, this study presents a function that could be used by every state in the United Kingdom for data cleansing, pre-processing and feature engineering. A shinny App was also use to display the tweets sentiment trends to prevent crime in near-real time.Exploratory analysis is essential for revealing the necessary data pre-processing and feature engineering needed prior to feeding the data into the machine learning model for efficient result. Based on earlier documented studies available, this is the first research to do a full exploratory analysis of historical British crime statistics using stop and search historical dataset. Also, based on the findings from the exploratory study, an algorithm was created to clean the data, and prepare it for further analysis and model creation. This is an enormous success because it provides a perfect dataset for future research, particularly for non-experts to utilise in constructing models to forecast crime or conducting investigations in around 32 police districts of the United Kingdom.Moreover, this study is the first study to present a complete collection of geo-spatial parameters for training a crime prediction model by combining demographic data from the same source in the United Kingdom with hourly sentiment polarity that was not restricted to Twitter keyword search. Six unique base models that were frequently mentioned in the previous literature was selected and used to train stop-and-search historical crime dataset and evaluated on test data and finally validated with dataset from London and Kent crime datasets.Two different datasets were created from twitter and historical data (historical crime data with twitter sentiment score and historical data without twitter sentiment score). Six of the most prevalent machine learning classifiers (Random Forest, Decision Tree, K-nearest model, support vector machine, neural network and naïve bayes) were trained and tested on these datasets. Additionally, hyperparameters of each of the six models developed were tweaked using random grid search. Voting classifiers and logistic regression stacked ensemble of different models were also trained and tested on the same datasets to enhance the individual model performance.In addition, two combinations of stack ensembles of multiple models were constructed to enhance and choose the most suitable models for crime prediction, and based on their performance, the appropriate prediction model for the UK dataset would be selected. In terms of how the research may be interpreted, it differs from most earlier studies that employed Twitter data in that several methodologies were used to show how each attribute contributed to the construction of the model, and the findings were discussed and interpreted in the context of the study. Further, a shiny app visualisation tool was designed to display the tweets’ sentiment score, the text, the users’ screen name, and the tweets’ vicinity which allows the investigation of any criminal actions in near-real time. The evaluation of the models revealed that Random Forest, Decision Tree, and K nearest neighbour outperformed other models. However, decision trees and Random Forests perform better consistently when evaluated on test data
    • …
    corecore