53 research outputs found
Bridging the Gap Between the Least and the Most Influential Twitter Users
Social networks play an increasingly important role in shaping the behaviour of users of the Web. Conceivably Twitter stands out from the others, not only for the platform's simplicity but also for the great influence that the messages sent over the network can have. The impact of such messages determines the influence of a Twitter user and is what tools such as Klout, PeerIndex or TwitterGrader aim to calculate. Reducing all the factors that make a person influential into a single number is not an easy task, and the effort involved could become useless if the Twitter users do not know how to improve it. In this paper we identify what specific actions should be carried out for a Twitterer to increase their influence in each of above-mentioned tools applying, for this purpose, data mining techniques based on classification and regression algorithms to the information collected from a set of Twitter users.This work has been partially founded by the European Commission Project ”SiSOB: An Observatorium for Science
in Society based in Social Models” (http://sisob.lcc.uma.es) (Contract no.: FP7 266588), ”Sistemas Inalámbricos
de Gestión de Información Crítica” (with code number TIN2011-23795 and granted by the MEC, Spain) and ”3DTUTOR:
Sistema Interoperable de Asistencia y Tutoría Virtual e Inteligente 3D” (with code number IPT-2011-0889-
900000 and granted by the MINECO, Spain
MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!
Hadoop is currently the large-scale data analysis "hammer" of choice, but
there exist classes of algorithms that aren't "nails", in the sense that they
are not particularly amenable to the MapReduce programming model. To address
this, researchers have proposed MapReduce extensions or alternative programming
models in which these algorithms can be elegantly expressed. This essay
espouses a very different position: that MapReduce is "good enough", and that
instead of trying to invent screwdrivers, we should simply get rid of
everything that's not a nail. To be more specific, much discussion in the
literature surrounds the fact that iterative algorithms are a poor fit for
MapReduce: the simple solution is to find alternative non-iterative algorithms
that solve the same problem. This essay captures my personal experiences as an
academic researcher as well as a software engineer in a "real-world" production
analytics environment. From this combined perspective I reflect on the current
state and future of "big data" research
Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture
We present the architecture behind Twitter's real-time related query
suggestion and spelling correction service. Although these tasks have received
much attention in the web search literature, the Twitter context introduces a
real-time "twist": after significant breaking news events, we aim to provide
relevant results within minutes. This paper provides a case study illustrating
the challenges of real-time data processing in the era of "big data". We tell
the story of how our system was built twice: our first implementation was built
on a typical Hadoop-based analytics stack, but was later replaced because it
did not meet the latency requirements necessary to generate meaningful
real-time results. The second implementation, which is the system deployed in
production, is a custom in-memory processing engine specifically designed for
the task. This experience taught us that the current typical usage of Hadoop as
a "big data" platform, while great for experimentation, is not well suited to
low-latency processing, and points the way to future work on data analytics
platforms that can handle "big" as well as "fast" data
Tripartite Graph Clustering for Dynamic Sentiment Analysis on Social Media
The growing popularity of social media (e.g, Twitter) allows users to easily
share information with each other and influence others by expressing their own
sentiments on various subjects. In this work, we propose an unsupervised
\emph{tri-clustering} framework, which analyzes both user-level and tweet-level
sentiments through co-clustering of a tripartite graph. A compelling feature of
the proposed framework is that the quality of sentiment clustering of tweets,
users, and features can be mutually improved by joint clustering. We further
investigate the evolution of user-level sentiments and latent feature vectors
in an online framework and devise an efficient online algorithm to sequentially
update the clustering of tweets, users and features with newly arrived data.
The online framework not only provides better quality of both dynamic
user-level and tweet-level sentiment analysis, but also improves the
computational and storage efficiency. We verified the effectiveness and
efficiency of the proposed approaches on the November 2012 California ballot
Twitter data.Comment: A short version is in Proceeding of the 2014 ACM SIGMOD International
Conference on Management of dat
Preparation of Improved Turkish DataSet for Sentiment Analysis in Social Media
A public dataset, with a variety of properties suitable for sentiment
analysis [1], event prediction, trend detection and other text mining
applications, is needed in order to be able to successfully perform analysis
studies. The vast majority of data on social media is text-based and it is not
possible to directly apply machine learning processes into these raw data,
since several different processes are required to prepare the data before the
implementation of the algorithms. For example, different misspellings of same
word enlarge the word vector space unnecessarily, thereby it leads to reduce
the success of the algorithm and increase the computational power requirement.
This paper presents an improved Turkish dataset with an effective spelling
correction algorithm based on Hadoop [2]. The collected data is recorded on the
Hadoop Distributed File System and the text based data is processed by
MapReduce programming model. This method is suitable for the storage and
processing of large sized text based social media data. In this study, movie
reviews have been automatically recorded with Apache ManifoldCF (MCF) [3] and
data clusters have been created. Various methods compared such as Levenshtein
and Fuzzy String Matching have been proposed to create a public dataset from
collected data. Experimental results show that the proposed algorithm, which
can be used as an open source dataset in sentiment analysis studies, have been
performed successfully to the detection and correction of spelling errors.Comment: Presented at CMES201
A Scalable Supervised Subsemble Prediction Algorithm
Subsemble is a flexible ensemble method that partitions a full data set into subsets of observations, fits the same algorithm on each subset, and uses a tailored form of V-fold cross-validation to construct a prediction function that combines the subset-specific fits with a second metalearner algorithm. Previous work studied the performance of Subsemble with subsets created randomly, and showed that these types of Subsembles often result in better prediction performance than the underlying algorithm fit just once on the full dataset. Since the final Subsemble estimator varies depending on the data used to create the subset-specific fits, different strategies for creating the subsets used in Subsemble result in different Subsembles. We propose supervised partitioning of the covariate space to create the subsets used in Subsemble, and using a form of histogram regression as the metalearner used to combine the subset-specific fits. We discuss applications to large-scale data sets, and develop a practical Supervised Subsemble method using regression trees to both create the covariate space partitioning, and select the number of subsets used in Subsemble. Through simulations and real data analysis, we show that this subset creation method can have better prediction performance than the random subset version
- …