1,341 research outputs found
EveTAR: Building a Large-Scale Multi-Task Test Collection over Arabic Tweets
This article introduces a new language-independent approach for creating a
large-scale high-quality test collection of tweets that supports multiple
information retrieval (IR) tasks without running a shared-task campaign. The
adopted approach (demonstrated over Arabic tweets) designs the collection
around significant (i.e., popular) events, which enables the development of
topics that represent frequent information needs of Twitter users for which
rich content exists. That inherently facilitates the support of multiple tasks
that generally revolve around events, namely event detection, ad-hoc search,
timeline generation, and real-time summarization. The key highlights of the
approach include diversifying the judgment pool via interactive search and
multiple manually-crafted queries per topic, collecting high-quality
annotations via crowd-workers for relevancy and in-house annotators for
novelty, filtering out low-agreement topics and inaccessible tweets, and
providing multiple subsets of the collection for better availability. Applying
our methodology on Arabic tweets resulted in EveTAR , the first
freely-available tweet test collection for multiple IR tasks. EveTAR includes a
crawl of 355M Arabic tweets and covers 50 significant events for which about
62K tweets were judged with substantial average inter-annotator agreement
(Kappa value of 0.71). We demonstrate the usability of EveTAR by evaluating
existing algorithms in the respective tasks. Results indicate that the new
collection can support reliable ranking of IR systems that is comparable to
similar TREC collections, while providing strong baseline results for future
studies over Arabic tweets
Aggregating Content and Network Information to Curate Twitter User Lists
Twitter introduced user lists in late 2009, allowing users to be grouped
according to meaningful topics or themes. Lists have since been adopted by
media outlets as a means of organising content around news stories. Thus the
curation of these lists is important - they should contain the key information
gatekeepers and present a balanced perspective on a story. Here we address this
list curation process from a recommender systems perspective. We propose a
variety of criteria for generating user list recommendations, based on content
analysis, network analysis, and the "crowdsourcing" of existing user lists. We
demonstrate that these types of criteria are often only successful for datasets
with certain characteristics. To resolve this issue, we propose the aggregation
of these different "views" of a news story on Twitter to produce more accurate
user recommendations to support the curation process
Extracting News Events from Microblogs
Twitter stream has become a large source of information for many people, but
the magnitude of tweets and the noisy nature of its content have made
harvesting the knowledge from Twitter a challenging task for researchers for a
long time. Aiming at overcoming some of the main challenges of extracting the
hidden information from tweet streams, this work proposes a new approach for
real-time detection of news events from the Twitter stream. We divide our
approach into three steps. The first step is to use a neural network or deep
learning to detect news-relevant tweets from the stream. The second step is to
apply a novel streaming data clustering algorithm to the detected news tweets
to form news events. The third and final step is to rank the detected events
based on the size of the event clusters and growth speed of the tweet
frequencies. We evaluate the proposed system on a large, publicly available
corpus of annotated news events from Twitter. As part of the evaluation, we
compare our approach with a related state-of-the-art solution. Overall, our
experiments and user-based evaluation show that our approach on detecting
current (real) news events delivers a state-of-the-art performance
Extracting keywords from tweets
Nos Ășltimos anos, uma enorme quantidade de informaçÔes foi disponibilizada na Internet. As redes sociais estĂŁo entre as que mais contribuem para esse aumento no volume de dados. O Twitter, em particular, abriu o caminho, enquanto plataforma social, para que pessoas e organizaçÔes possam interagir entre si, gerando grandes volumes de dados a partir dos quais Ă© possĂvel extrair informação Ăștil. Uma tal quantidade de dados, permitirĂĄ por exemplo, revelar-se importante se e quando, vĂĄrios indivĂduos relatarem sintomas de doença ao mesmo tempo e no mesmo lugar. Processar automaticamente um tal volume de informaçÔes e obter a partir dele conhecimento Ăștil, torna-se, no entanto, uma tarefa impossĂvel para qualquer ser humano. Os extratores de palavras-chave surgem neste contexto como uma ferramenta valiosa que visa facilitar este trabalho, ao permitir, de uma forma rĂĄpida, ter acesso a um conjunto de termos caracterizadores do documento.
Neste trabalho, tentamos contribuir para um melhor entendimento deste problema, avaliando a eficĂĄcia do YAKE (um algoritmo de extração de palavras-chave nĂŁo supervisionado) em cima de um conjunto de tweets, um tipo de texto, caracterizado nĂŁo sĂł pelo seu reduzido tamanho, mas tambĂ©m pela sua natureza nĂŁo estruturada. Embora os extratores de palavras-chave tenham sido amplamente aplicados a textos genĂ©ricos, como a relatĂłrios, artigos, entre outros, a sua aplicabilidade em tweets Ă© escassa e atĂ© ao momento nĂŁo foi disponibilizado formalmente nenhum conjunto de dados. Neste trabalho e por forma a contornar esse problema optĂĄmos por desenvolver e tornar disponĂvel uma nova coleção de dados, um importante contributo para que a comunidade cientĂfica promova novas soluçÔes neste domĂnio. O KWTweet foi anotado por 15 anotadores e resultou em 7736 tweets anotados. Com base nesta informação, pudemos posteriormente avaliar a eficĂĄcia do YAKE! contra 9 baselines de extração de palavra-chave nĂŁo supervisionados (TextRank, KP-Miner, SingleRank, PositionRank, TopicPageRank, MultipartiteRank, TopicRank, Rake e TF.IDF). Os resultados obtidos demonstram que o YAKE! tem um desempenho superior quando comparado com os seus competidores, provando-se assim a sua eficĂĄcia neste tipo de textos. Por fim, disponibilizamos uma demo que visa demonstrar o funcionamento do YAKE! Nesta plataforma web, os utilizadores tĂȘm a possibilidade de fazer uma pesquisa por utilizador ou hashtag e dessa forma obter as palavras chave mais relevantes atravĂ©s de uma nuvem de palavra
Ranking of high-value social audiences on Twitter
Even though social media offers plenty of business opportunities, for a company to identify the right audience from the massive amount of social media data is highly challenging given finite resources and marketing budgets. In this paper, we present a ranking mechanism that is capable of identifying the top-k social audience members on Twitter based on an index. Data from three different Twitter business account owners were used in our experiments to validate this ranking mechanism. The results show that the index developed using a combination of semi-supervised and supervised learning methods is indeed generic enough to retrieve relevant audience members from the three different data sets. This approach of combining Fuzzy Match, Twitter Latent Dirichlet Allocation and Support Vector Machine Ensemble is able to leverage on the content of account owners to construct seed words and training data sets with minimal annotation efforts. We conclude that this ranking mechanism has the potential to be adopted in real-world applications for differentiating prospective customers from the general audience and enabling market segmentation for better business decision making
YouTube AV 50K: An Annotated Corpus for Comments in Autonomous Vehicles
With one billion monthly viewers, and millions of users discussing and
sharing opinions, comments below YouTube videos are rich sources of data for
opinion mining and sentiment analysis. We introduce the YouTube AV 50K dataset,
a freely-available collections of more than 50,000 YouTube comments and
metadata below autonomous vehicle (AV)-related videos. We describe its creation
process, its content and data format, and discuss its possible usages.
Especially, we do a case study of the first self-driving car fatality to
evaluate the dataset, and show how we can use this dataset to better understand
public attitudes toward self-driving cars and public reactions to the accident.
Future developments of the dataset are also discussed.Comment: in Proceedings of the Thirteenth International Joint Symposium on
Artificial Intelligence and Natural Language Processing (iSAI-NLP 2018
- âŠ