7,041 research outputs found

    A survey of data mining techniques for social media analysis

    Get PDF
    Social network has gained remarkable attention in the last decade. Accessing social network sites such as Twitter, Facebook LinkedIn and Google+ through the internet and the web 2.0 technologies has become more affordable. People are becoming more interested in and relying on social network for information, news and opinion of other users on diverse subject matters. The heavy reliance on social network sites causes them to generate massive data characterised by three computational issues namely; size, noise and dynamism. These issues often make social network data very complex to analyse manually, resulting in the pertinent use of computational means of analysing them. Data mining provides a wide range of techniques for detecting useful knowledge from massive datasets like trends, patterns and rules [44]. Data mining techniques are used for information retrieval, statistical modelling and machine learning. These techniques employ data pre-processing, data analysis, and data interpretation processes in the course of data analysis. This survey discusses different data mining techniques used in mining diverse aspects of the social network over decades going from the historical techniques to the up-to-date models, including our novel technique named TRCM. All the techniques covered in this survey are listed in the Table.1 including the tools employed as well as names of their authors

    Sentiment analysis of electronic word of mouth (E-WoM) on e-learning.

    Get PDF
    The proliferation of social media and the internet has given people many opportunities to air their views and to be at liberty to say what they feel without hindrance. This is beneficial to commercial organizations and the general well-being of the populace. However, the cost of this freedom is that spamming is practiced with little or no control. This chapter focuses on the electronic word of mouth (eWOM) of opinion holders and the sentiments expressed in eWOM. One of the areas of life impacted by sentiment is electronic learning because it has become a prevalent mode of learning. The study aims to analyze eWOM on e-learning which can help in identifying learners' sentiments. Findings from three thousand tweets show more neutral sentiments, followed by positive sentiments. Suggestions and recommendations as well as the future directions for sentiment analysis of eWOM on e-learning are also discussed in this chapter

    Utilizing Multi-modal Weak Signals to Improve User Stance Inference in Social Media

    Get PDF
    Social media has become an integral component of the daily life. There are millions of various types of content being released into social networks daily. This allows for an interesting view into a users\u27 view on everyday life. Exploring the opinions of users in social media networks has always been an interesting subject for the Natural Language Processing researchers. Knowing the social opinions of a mass will allow anyone to make informed policy or marketing related decisions. This is exactly why it is desirable to find comprehensive social opinions. The nature of social media is complex and therefore obtaining the social opinion becomes a challenging task. Because of how diverse and complex social media networks are, they typically resonate with the actual social connections but in a digital platform. Similar to how users make friends and companions in the real world, the digital platforms enable users to mimic similar social connections. This work mainly looks at how to obtain a comprehensive social opinion out of social media network. Typical social opinion quantifiers will look at text contributions made by users to find the opinions. Currently, it is challenging because the majority of users on social media will be consuming content rather than expressing their opinions out into the world. This makes natural language processing based methods impractical due to not having linguistic features. In our work we look to improve a method named stance inference which can utilize multi-domain features to extract the social opinion. We also introduce a method which can expose users opinions even though they do not have on-topical content. We also note how by introducing weak supervision to an unsupervised task of stance inference we can improve the performance. The weak supervision we bring into the pipeline is through hashtags. We show how hashtags are contextual indicators added by humans which will be much likelier to be related than a topic model. Lastly we introduce disentanglement methods for chronological social media networks which allows one to utilize the methods we introduce above to be applied in these type of platforms

    Data analytics 2016: proceedings of the fifth international conference on data analytics

    Get PDF

    Mining Behavior of Citizen Sensor Communities to Improve Cooperation with Organizational Actors

    Get PDF
    Web 2.0 (social media) provides a natural platform for dynamic emergence of citizen (as) sensor communities, where the citizens generate content for sharing information and engaging in discussions. Such a citizen sensor community (CSC) has stated or implied goals that are helpful in the work of formal organizations, such as an emergency management unit, for prioritizing their response needs. This research addresses questions related to design of a cooperative system of organizations and citizens in CSC. Prior research by social scientists in a limited offline and online environment has provided a foundation for research on cooperative behavior challenges, including \u27articulation\u27 and \u27awareness\u27, but Web 2.0 supported CSC offers new challenges as well as opportunities. A CSC presents information overload for the organizational actors, especially in finding reliable information providers (for awareness), and finding actionable information from the data generated by citizens (for articulation). Also, we note three data level challenges: ambiguity in interpreting unconstrained natural language text, sparsity of user behaviors, and diversity of user demographics. Interdisciplinary research involving social and computer sciences is essential to address these socio-technical issues. I present a novel web information-processing framework, called the Identify-Match- Engage (IME) framework. IME allows operationalizing computation in design problems of awareness and articulation of the cooperative system between citizens and organizations, by addressing data problems of group engagement modeling and intent mining. The IME framework includes: a.) Identification of cooperation-assistive intent (seeking-offering) from short, unstructured messages using a classification model with declarative, social and contrast pattern knowledge, b.) Facilitation of coordination modeling using bipartite matching of complementary intent (seeking-offering), and c.) Identification of user groups to prioritize for engagement by defining a content-driven measure of \u27group discussion divergence\u27. The use of prior knowledge and interplay of features of users, content, and network structures efficiently captures context for computing cooperation-assistive behavior (intent and engagement) from unstructured social data in the online socio-technical systems. Our evaluation of a use-case of the crisis response domain shows improvement in performance for both intent classification and group engagement prioritization. Real world applications of this work include use of the engagement interface tool during various recent crises including the 2014 Jammu and Kashmir floods, and intent classification as a service integrated by the crisis mapping pioneer Ushahidi\u27s CrisisNET project for broader impact

    Themes and Participants’ Role in Online Health Discussion: Evidence From Reddit

    Get PDF
    Health-related topics are discussed widely on different social networking sites. These discussions and their related aspects can reveal significant insights and patterns that are worth studying and understanding. In this dissertation, we explore the patterns of mandatory and voluntary vaccine online discussions including the topics discussed, the words correlated with each of them, and the sentiment expressed. Moreover, we explore the role opinion leaders play in the health discussion and their impact on participation in a particular discussion. Opinion leaders are determined, and their impact on discussion participation is differentiated based on their different characteristics such as their connections and locations in the social network, their content, and their sentiment. We apply social network analysis, topic modeling, sentiment analysis, machine learning, econometric analysis, and other techniques to analyze the collected data from Reddit. The results of our analyses show that sentiment is an important factor in health discussion, and it varies between different types of discussions. In addition, we identified the main topics discussed for each vaccine. Furthermore, the results of our study found that global opinion leaders have more influence compared to local opinion leaders in elevating the health discussion. Our study has important theoretical and practical implications

    A Framework to Categorize Shill and Normal Reviews by Measuring it’s Linguistic Features

    Get PDF
    Shill reviews detection has attracted significant attention from both business and research communities. Shill reviews are increasingly used to influence the reputation of products sold on websites in positive or negative manner. The spammers may create shill reviews which mislead readers to artificially promote or devalue some target products or services. Different methods which work according to linguistic features have been adopted and implemented effectively. Surprisingly, review manipulation was found on reputable e-commerce websites also. This is the reason why linguistic-feature based methods have gained more and more popularity. Lingual features of shill reviews are examined in this study and then a tool has been developed for extracting product features from the text used in the product review under analysis. Fake reviews, fake comments, fake blogs, fake social network postings and deceptive texts are some forms of shill reviews. By extracting linguistic features like informativeness, subjectivity and readability, an attempt is made to find difference between shill and normal reviews. On the basis of these three characteristics, hypotheses are formed and generalized. These hypotheses help to compare shill and normal reviews in analytical terms. Proposed work is for based on polarity of the text (positive or negative), as shill reviewer tend to use a definite polarity based on their intention, positive or negative

    Exploring the value of big data analysis of Twitter tweets and share prices

    Get PDF
    Over the past decade, the use of social media (SM) such as Facebook, Twitter, Pinterest and Tumblr has dramatically increased. Using SM, millions of users are creating large amounts of data every day. According to some estimates ninety per cent of the content on the Internet is now user generated. Social Media (SM) can be seen as a distributed content creation and sharing platform based on Web 2.0 technologies. SM sites make it very easy for its users to publish text, pictures, links, messages or videos without the need to be able to program. Users post reviews on products and services they bought, write about their interests and intentions or give their opinions and views on political subjects. SM has also been a key factor in mass movements such as the Arab Spring and the Occupy Wall Street protests and is used for human aid and disaster relief (HADR). There is a growing interest in SM analysis from organisations for detecting new trends, getting user opinions on their products and services or finding out about their online reputation. Companies such as Amazon or eBay use SM data for their recommendation engines and to generate more business. TV stations buy data about opinions on their TV programs from Facebook to find out what the popularity of a certain TV show is. Companies such as Topsy, Gnip, DataSift and Zoomph have built their entire business models around SM analysis. The purpose of this thesis is to explore the economic value of Twitter tweets. The economic value is determined by trying to predict the share price of a company. If the share price of a company can be predicted using SM data, it should be possible to deduce a monetary value. There is limited research on determining the economic value of SM data for “nowcasting”, predicting the present, and for forecasting. This study aims to determine the monetary value of Twitter by correlating the daily frequencies of positive and negative Tweets about the Apple company and some of its most popular products with the development of the Apple Inc. share price. If the number of positive tweets about Apple increases and the share price follows this development, the tweets have predictive information about the share price. A literature review has found that there is a growing interest in analysing SM data from different industries. A lot of research is conducted studying SM from various perspectives. Many studies try to determine the impact of online marketing campaigns or try to quantify the value of social capital. Others, in the area of behavioural economics, focus on the influence of SM on decision-making. There are studies trying to predict financial indicators such as the Dow Jones Industrial Average (DJIA). However, the literature review has indicated that there is no study correlating sentiment polarity on products and companies in tweets with the share price of the company. The theoretical framework used in this study is based on Computational Social Science (CSS) and Big Data. Supporting theories of CSS are Social Media Mining (SMM) and sentiment analysis. Supporting theories of Big Data are Data Mining (DM) and Predictive Analysis (PA). Machine learning (ML) techniques have been adopted to analyse and classify the tweets. In the first stage of the study, a body of tweets was collected and pre-processed, and then analysed for their sentiment polarity towards Apple Inc., the iPad and the iPhone. Several datasets were created using different pre-processing and analysis methods. The tweet frequencies were then represented as time series. The time series were analysed against the share price time series using the Granger causality test to determine if one time series has predictive information about the share price time series over the same period of time. For this study, several Predictive Analytics (PA) techniques on tweets were evaluated to predict the Apple share price. To collect and analyse the data, a framework has been developed based on the LingPipe (LingPipe 2015) Natural Language Processing (NLP) tool kit for sentiment analysis, and using R, the functional language and environment for statistical computing, for correlation analysis. Twitter provides an API (Application Programming Interface) to access and collect its data programmatically. Whereas no clear correlation could be determined, at least one dataset was showed to have some predictive information on the development of the Apple share price. The other datasets did not show to have any predictive capabilities. There are many data analysis and PA techniques. The techniques applied in this study did not indicate a direct correlation. However, some results suggest that this is due to noise or asymmetric distributions in the datasets. The study contributes to the literature by providing a quantitative analysis of SM data, for example tweets about Apple and its most popular products, the iPad and iPhone. It shows how SM data can be used for PA. It contributes to the literature on Big Data and SMM by showing how SM data can be collected, analysed and classified and explore if the share price of a company can be determined based on sentiment time series. It may ultimately lead to better decision making, for instance for investments or share buyback

    Breadth analysis of Online Social Networks

    Get PDF
    This thesis is mainly motivated by the analysis, understanding, and prediction of human behaviour by means of the study of their digital fingeprints. Unlike a classical PhD thesis, where you choose a topic and go further on a deep analysis on a research topic, we carried out a breadth analysis on the research topic of complex networks, such as those that humans create themselves with their relationships and interactions. These kinds of digital communities where humans interact and create relationships are commonly called Online Social Networks. Then, (i) we have collected their interactions, as text messages they share among each other, in order to analyze the sentiment and topic of such messages. We have basically applied the state-of-the-art techniques for Natural Language Processing, widely developed and tested on English texts, in a collection of Spanish Tweets and we compare the results. Next, (ii) we focused on Topic Detection, creating our own classifier and applying it to the former Tweets dataset. The breakthroughs are two: our classifier relies on text-graphs from the input text and we achieved a figure of 70% accuracy, outperforming previous results. After that, (iii) we moved to analyze the network structure (or topology) and their data values to detect outliers. We hypothesize that in social networks there is a large mass of users that behaves similarly, while a reduced set of them behave in a different way. However, specially among this last group, we try to separate those with high activity, or low activity, or any other paramater/feature that make them belong to different kind of outliers. We aim to detect influential users in one of these outliers set. We propose a new unsupervised method, Massive Unsupervised Outlier Detection (MUOD), labeling the outliers detected os of shape, magnitude, amplitude or combination of those. We applied this method to a subset of roughly 400 million Google+ users, identifying and discriminating automatically sets of outlier users. Finally, (iv) we find interesting to address the monitorization of real complex networks. We created a framework to dynamically adapt the temporality of large-scale dynamic networks, reducing compute overhead by at least 76%, data volume by 60% and overall cloud costs by at least 54%, while always maintaining accuracy above 88%.PublicadoPrograma de Doctorado en Ingeniería Matemåtica por la Universidad Carlos III de MadridPresidente: Rosa María Benito Zafrilla.- Secretario: Ángel Cuevas Rumín.- Vocal: José Ernesto Jiménez Merin

    A Survey on Data-Driven Evaluation of Competencies and Capabilities Across Multimedia Environments

    Get PDF
    The rapid evolution of technology directly impacts the skills and jobs needed in the next decade. Users can, intentionally or unintentionally, develop different skills by creating, interacting with, and consuming the content from online environments and portals where informal learning can emerge. These environments generate large amounts of data; therefore, big data can have a significant impact on education. Moreover, the educational landscape has been shifting from a focus on contents to a focus on competencies and capabilities that will prepare our society for an unknown future during the 21st century. Therefore, the main goal of this literature survey is to examine diverse technology-mediated environments that can generate rich data sets through the users’ interaction and where data can be used to explicitly or implicitly perform a data-driven evaluation of different competencies and capabilities. We thoroughly and comprehensively surveyed the state of the art to identify and analyse digital environments, the data they are producing and the capabilities they can measure and/or develop. Our survey revealed four key multimedia environments that include sites for content sharing & consumption, video games, online learning and social networks that fulfilled our goal. Moreover, different methods were used to measure a large array of diverse capabilities such as expertise, language proficiency and soft skills. Our results prove the potential of the data from diverse digital environments to support the development of lifelong and lifewide 21st-century capabilities for the future society
    • 

    corecore