2,050 research outputs found
Seminar Users in the Arabic Twitter Sphere
We introduce the notion of "seminar users", who are social media users
engaged in propaganda in support of a political entity. We develop a framework
that can identify such users with 84.4% precision and 76.1% recall. While our
dataset is from the Arab region, omitting language-specific features has only a
minor impact on classification performance, and thus, our approach could work
for detecting seminar users in other parts of the world and in other languages.
We further explored a controversial political topic to observe the prevalence
and potential potency of such users. In our case study, we found that 25% of
the users engaged in the topic are in fact seminar users and their tweets make
nearly a third of the on-topic tweets. Moreover, they are often successful in
affecting mainstream discourse with coordinated hashtag campaigns.Comment: to appear in SocInfo 201
Simulated evaluation of faceted browsing based on feature selection
In this paper we explore the limitations of facet based browsing which uses sub-needs of an information need for querying and organising the search process in video retrieval. The underlying assumption of this approach is that the search effectiveness will be enhanced if such an approach is employed for interactive video retrieval using textual and visual features. We explore the performance bounds of a faceted system by carrying out a simulated user evaluation on TRECVid data sets, and also on the logs of a prior user experiment with the system. We first present a methodology to reduce the dimensionality of features by selecting the most important ones. Then, we discuss the simulated evaluation strategies employed in our evaluation and the effect on the use of both textual and visual features. Facets created by users are simulated by clustering video shots using textual and visual features. The experimental results of our study demonstrate that the faceted browser can potentially improve the search effectiveness
Classifying the suras by their lexical semantics :an exploratory multivariate analysis approach to understanding the Qur'an
PhD ThesisThe Qur'an is at the heart of Islamic culture. Careful, well-informed interpretation of
it is fundamental both to the faith of millions of Muslims throughout the world, and
also to the non-Islamic world's understanding of their religion. There is a long and
venerable tradition of Qur'anic interpretation, and it has necessarily been based on
literary-historical methods for exegesis of hand-written and printed text.
Developments in electronic text representation and analysis since the second half of
the twentieth century now offer the opportunity to supplement traditional techniques
by applying the newly-emergent computational technology of exploratory
multivariate analysis to interpretation of the Qur'an. The general aim of the present
discussion is to take up that opportunity.
Specifically, the discussion develops and applies a methodology for discovering the
thematic structure of the Qur'an based on a fundamental idea in a range of
computationally oriented disciplines: that, with respect to some collection of texts, the
lexical frequency profiles of the individual texts are a good indicator of their semantic
content, and thus provide a reliable criterion for their conceptual categorization
relative to one another. This idea is applied to the discovery of thematic
interrelationships among the suras that constitute the Qur'an by abstracting lexical
frequency data from them and then analyzing that data using exploratory multivariate
methods in the hope that this will generate hypotheses about the thematic structure of
the Qur'an.
The discussion is in eight main parts. The first part introduces the discussion. The
second gives an overview of the structure and thematic content of the Qur'an and of
the tradition of Qur'anic scholarship devoted to its interpretation. The third part
xvi
defines the research question to be addressed together with a methodology for doing
so. The fourth reviews the existing literature on the research question. The fifth
outlines general principles of data creation and applies them to creation of the data on
which the analysis of the Qur'an in this study is based. The sixth outlines general
principles of exploratory multivariate analysis, describes in detail the analytical
methods selected for use, and applies them to the data created in part five. The
seventh part interprets the results of the analyses conducted in part six with reference
to the existing results in Qur'anic interpretation described in part two. And, finally, the
eighth part draws conclusions relative to the research question and identifies
directions along which the work presented in this study can be developed
mARC: Memory by Association and Reinforcement of Contexts
This paper introduces the memory by Association and Reinforcement of Contexts
(mARC). mARC is a novel data modeling technology rooted in the second
quantization formulation of quantum mechanics. It is an all-purpose incremental
and unsupervised data storage and retrieval system which can be applied to all
types of signal or data, structured or unstructured, textual or not. mARC can
be applied to a wide range of information clas-sification and retrieval
problems like e-Discovery or contextual navigation. It can also for-mulated in
the artificial life framework a.k.a Conway "Game Of Life" Theory. In contrast
to Conway approach, the objects evolve in a massively multidimensional space.
In order to start evaluating the potential of mARC we have built a mARC-based
Internet search en-gine demonstrator with contextual functionality. We compare
the behavior of the mARC demonstrator with Google search both in terms of
performance and relevance. In the study we find that the mARC search engine
demonstrator outperforms Google search by an order of magnitude in response
time while providing more relevant results for some classes of queries
Theory and Applications for Advanced Text Mining
Due to the growth of computer technologies and web technologies, we can easily collect and store large amounts of text data. We can believe that the data include useful knowledge. Text mining techniques have been studied aggressively in order to extract the knowledge from the data since late 1990s. Even if many important techniques have been developed, the text mining research field continues to expand for the needs arising from various application fields. This book is composed of 9 chapters introducing advanced text mining techniques. They are various techniques from relation extraction to under or less resourced language. I believe that this book will give new knowledge in the text mining field and help many readers open their new research fields
Emotional Tendency Analysis of Twitter Data Streams
The web now seems to be an alive and dynamic arena in which billions of people across the globe connect, share, publish, and engage in a broad range of everyday activities. Using social media, individuals may connect and communicate with each other at any time and from any location. More than 500 million individuals across the globe post their thoughts and opinions on the internet every day. There is a huge amount of information created from a variety of social media platforms in a variety of formats and languages throughout the globe. Individuals define emotions as powerful feelings directed toward something or someone as a result of internal or external events that have a personal meaning. Emotional recognition in text has several applications in human-computer interface and natural language processing (NLP). Emotion classification has previously been studied using bag-of words classifiers or deep learning methods on static Twitter data. For real-time textual emotion identification, the proposed model combines a mix of keyword-based and learning-based models, as well as a real-time Emotional Tendency Analysi
A systematic survey of online data mining technology intended for law enforcement
As an increasing amount of crime takes on a digital aspect, law enforcement bodies must tackle an online environment generating huge volumes of data. With manual inspections becoming increasingly infeasible, law enforcement bodies are optimising online investigations through data-mining technologies. Such technologies must be well designed and rigorously grounded, yet no survey of the online data-mining literature exists which examines their techniques, applications and rigour. This article remedies this gap through a systematic mapping study describing online data-mining literature which visibly targets law enforcement applications, using evidence-based practices in survey making to produce a replicable analysis which can be methodologically examined for deficiencies
A COMPARISON OF MACHINE LEARNING TECHNIQUES: E-MAIL SPAM FILTERING FROM COMBINED SWAHILI AND ENGLISH EMAIL MESSAGES
The speed of technology change is faster now compared to the past ten to fifteen years. It changes the way people live and force them to use the latest devices to match with the speed. In communication perspectives nowadays, use of electronic mail (e-mail) for people who want to communicate with friends, companies or even the universities cannot be avoided. This makes it to be the most targeted by the spammer and hackers and other bad people who want to get the benefit by sending spam emails. The report shows that the amount of emails sent through the internet in a day can be more than 10 billion among these 45% are spams. The amount is not constant as sometimes it goes higher than what is noted here. This indicates clearly the magnitude of the problem and calls for the need for more efforts to be applied to reduce this amount and also minimize the effects from the spam messages.
Various measures have been taken to eliminate this problem. Once people used social methods, that is legislative means of control and now they are using technological methods which are more effective and timely in catching spams as these work by analyzing the messages content. In this paper we compare the performance of machine learning algorithms by doing the experiment for testing English language dataset, Swahili language dataset individual and combined two dataset to form one, and results from combined dataset compared them with the Gmail classifier. The classifiers which the researcher used are Naïve Bayes (NB), Sequential Minimal Optimization (SMO) and k-Nearest Neighbour (k-NN).
The results for combined dataset shows that SMO classifier lead the others by achieve 98.60% of accuracy, followed by k-NN classifier which has 97.20% accuracy, and Naïve Bayes classifier has 92.89% accuracy. From this result the researcher concludes that SMO classifier can work better in dataset that combined English and Swahili languages. In English dataset shows that SMO classifier leads other algorism, it achieved 97.51% of accuracy, followed by k-NN with average accuracy of 93.52% and the last but also good accuracy is Naïve Bayes that come with 87.78%. Swahili dataset Naïve Bayes lead others by getting 99.12% accuracy followed by SMO which has 98.69% and the last was k-NN which has 98.47%
- …