363 research outputs found
A large multilingual and multi-domain dataset for recommender systems
This paper presents a multi-domain interests dataset to train and test Recommender Systems, and the methodology to create the dataset
from Twitter messages in English and Italian. The English dataset includes an average of 90 preferences per user on music, books,
movies, celebrities, sport, politics and much more, for about half million users. Preferences are either extracted from messages of
users who use Spotify, Goodreads and other similar content sharing platforms, or induced from their ”topical” friends, i.e., followees
representing an interest rather than a social relation between peers. In addition, preferred items are matched with Wikipedia articles
describing them. This unique feature of our dataset provides a mean to derive a semantic categorization of the preferred items, exploiting
available semantic resources linked to Wikipedia such as the Wikipedia Category Graph, DBpedia, BabelNet and others
Inferring user interests in microblogging social networks: a survey
With the growing popularity of microblogging services such as Twitter in recent years,
an increasing number of users are using these services in their daily lives. The huge volume of information generated by users raises new opportunities in various applications
and areas. Inferring user interests plays a significant role in providing personalized
recommendations on microblogging services, and also on third-party applications
providing social logins via these services, especially in cold-start situations. In this
survey, we review user modeling strategies with respect to inferring user interests
from previous studies. To this end, we focus on four dimensions of inferring user
interest profiles: (1) data collection, (2) representation of user interest profiles, (3)
construction and enhancement of user interest profiles, and (4) the evaluation of the
constructed profiles. Through this survey, we aim to provide an overview of state-of-the-art user modeling strategies for inferring user interest profiles on microblogging
social networks with respect to the four dimensions. For each dimension, we review
and summarize previous studies based on specified criteria. Finally, we discuss some
challenges and opportunities for future work in this research domain
Harnessing the power of the general public for crowdsourced business intelligence: a survey
International audienceCrowdsourced business intelligence (CrowdBI), which leverages the crowdsourced user-generated data to extract useful knowledge about business and create marketing intelligence to excel in the business environment, has become a surging research topic in recent years. Compared with the traditional business intelligence that is based on the firm-owned data and survey data, CrowdBI faces numerous unique issues, such as customer behavior analysis, brand tracking, and product improvement, demand forecasting and trend analysis, competitive intelligence, business popularity analysis and site recommendation, and urban commercial analysis. This paper first characterizes the concept model and unique features and presents a generic framework for CrowdBI. It also investigates novel application areas as well as the key challenges and techniques of CrowdBI. Furthermore, we make discussions about the future research directions of CrowdBI
Trustworthiness in Social Big Data Incorporating Semantic Analysis, Machine Learning and Distributed Data Processing
This thesis presents several state-of-the-art approaches constructed for the purpose of (i) studying the trustworthiness of users in Online Social Network platforms, (ii) deriving concealed knowledge from their textual content, and (iii) classifying and predicting the domain knowledge of users and their content. The developed approaches are refined through proof-of-concept experiments, several benchmark comparisons, and appropriate and rigorous evaluation metrics to verify and validate their effectiveness and efficiency, and hence, those of the applied frameworks
Data Mining Algorithms for Internet Data: from Transport to Application Layer
Nowadays we live in a data-driven world. Advances in data generation, collection and storage technology have enabled organizations to gather data sets of massive size. Data mining is a discipline that blends traditional data analysis methods with sophisticated algorithms to handle the challenges posed by these new types of data sets.
The Internet is a complex and dynamic system with new protocols and applications that arise at a constant pace. All these characteristics designate the Internet a valuable and challenging data source and application domain for a research activity, both looking at Transport layer, analyzing network tra c flows, and going up to Application layer, focusing on the ever-growing next generation web services: blogs, micro-blogs, on-line social networks, photo sharing services and many other applications (e.g., Twitter, Facebook, Flickr, etc.).
In this thesis work we focus on the study, design and development of novel algorithms and frameworks to support large scale data mining activities over huge and heterogeneous data volumes, with a particular focus on Internet data as data source and targeting network tra c classification, on-line social network analysis, recommendation systems and cloud services and Big data
An integrated semantic-based framework for intelligent similarity measurement and clustering of microblogging posts
Twitter, the most popular microblogging platform, is gaining rapid prominence as a source of
information sharing and social awareness due to its popularity and massive user generated
content. These include applications such as tailoring advertisement campaigns, event
detection, trends analysis, and prediction of micro-populations. The aforementioned
applications are generally conducted through cluster analysis of tweets to generate a more
concise and organized representation of the massive raw tweets. However, current approaches
perform traditional cluster analysis using conventional proximity measures, such as Euclidean
distance. However, the sheer volume, noise, and dynamism of Twitter, impose challenges that
hinder the efficacy of traditional clustering algorithms in detecting meaningful clusters within
microblogging posts. The research presented in this thesis sets out to design and develop a
novel short text semantic similarity (STSS) measure, named TREASURE, which captures the
semantic and structural features of microblogging posts for intelligently predicting the
similarities. TREASURE is utilised in the development of an innovative semantic-based
cluster analysis algorithm (SBCA) that contributes in generating more accurate and
meaningful granularities within microblogging posts. The integrated semantic-based
framework incorporating TREASURE and the SBCA algorithm tackles both the problem of
microblogging cluster analysis and contributes to the success of a variety of natural language
processing (NLP) and computational intelligence research.
TREASURE utilises word embedding neural network (NN) models to capture the semantic
relationships between words based on their co-occurrences in a corpus. Moreover,
TREASURE analyses the morphological and lexical structure of tweets to predict the syntactic
similarities. An intrinsic evaluation of TREASURE was performed with reference to a reliable
similarity benchmark generated through an experiment to gather human ratings on a Twitter
political dataset. A further evaluation was performed with reference to the SemEval-2014
similarity benchmark in order to validate the generalizability of TREASURE. The intrinsic
evaluation and statistical analysis demonstrated a strong positive linear correlation between
TREASURE and human ratings for both benchmarks. Furthermore, TREASURE achieved a
significantly higher correlation coefficient compared to existing state-of-the-art STSS
measures.
The SBCA algorithm incorporates TREASURE as the proximity measure. Unlike
conventional partition-based clustering algorithms, the SBCA algorithm is fully unsupervised
and dynamically determine the number of clusters beforehand. Subjective evaluation criteria
were employed to evaluate the SBCA algorithm with reference to the SemEval-2014 similarity
benchmark. Furthermore, an experiment was conducted to produce a reliable multi-class
benchmark on the European Referendum political domain, which was also utilised to evaluate
the SBCA algorithm. The evaluation results provide evidence that the SBCA algorithm
undertakes highly accurate combining and separation decisions and can generate pure clusters
from microblogging posts.
The contributions of this thesis to knowledge are mainly demonstrated as: 1) Development
of a novel STSS measure for microblogging posts (TREASURE). 2) Development of a new
SBCA algorithm that incorporates TREASURE to detect semantic themes in microblogs. 3)
Generating a word embedding pre-trained model learned from a large corpus of political
tweets. 4) Production of a reliable similarity-annotated benchmark and a reliable multi-class
benchmark in the domain of politics
- …