1,341 research outputs found

    EveTAR: Building a Large-Scale Multi-Task Test Collection over Arabic Tweets

    Full text link
    This article introduces a new language-independent approach for creating a large-scale high-quality test collection of tweets that supports multiple information retrieval (IR) tasks without running a shared-task campaign. The adopted approach (demonstrated over Arabic tweets) designs the collection around significant (i.e., popular) events, which enables the development of topics that represent frequent information needs of Twitter users for which rich content exists. That inherently facilitates the support of multiple tasks that generally revolve around events, namely event detection, ad-hoc search, timeline generation, and real-time summarization. The key highlights of the approach include diversifying the judgment pool via interactive search and multiple manually-crafted queries per topic, collecting high-quality annotations via crowd-workers for relevancy and in-house annotators for novelty, filtering out low-agreement topics and inaccessible tweets, and providing multiple subsets of the collection for better availability. Applying our methodology on Arabic tweets resulted in EveTAR , the first freely-available tweet test collection for multiple IR tasks. EveTAR includes a crawl of 355M Arabic tweets and covers 50 significant events for which about 62K tweets were judged with substantial average inter-annotator agreement (Kappa value of 0.71). We demonstrate the usability of EveTAR by evaluating existing algorithms in the respective tasks. Results indicate that the new collection can support reliable ranking of IR systems that is comparable to similar TREC collections, while providing strong baseline results for future studies over Arabic tweets

    Aggregating Content and Network Information to Curate Twitter User Lists

    Full text link
    Twitter introduced user lists in late 2009, allowing users to be grouped according to meaningful topics or themes. Lists have since been adopted by media outlets as a means of organising content around news stories. Thus the curation of these lists is important - they should contain the key information gatekeepers and present a balanced perspective on a story. Here we address this list curation process from a recommender systems perspective. We propose a variety of criteria for generating user list recommendations, based on content analysis, network analysis, and the "crowdsourcing" of existing user lists. We demonstrate that these types of criteria are often only successful for datasets with certain characteristics. To resolve this issue, we propose the aggregation of these different "views" of a news story on Twitter to produce more accurate user recommendations to support the curation process

    Extracting News Events from Microblogs

    Full text link
    Twitter stream has become a large source of information for many people, but the magnitude of tweets and the noisy nature of its content have made harvesting the knowledge from Twitter a challenging task for researchers for a long time. Aiming at overcoming some of the main challenges of extracting the hidden information from tweet streams, this work proposes a new approach for real-time detection of news events from the Twitter stream. We divide our approach into three steps. The first step is to use a neural network or deep learning to detect news-relevant tweets from the stream. The second step is to apply a novel streaming data clustering algorithm to the detected news tweets to form news events. The third and final step is to rank the detected events based on the size of the event clusters and growth speed of the tweet frequencies. We evaluate the proposed system on a large, publicly available corpus of annotated news events from Twitter. As part of the evaluation, we compare our approach with a related state-of-the-art solution. Overall, our experiments and user-based evaluation show that our approach on detecting current (real) news events delivers a state-of-the-art performance

    Extracting keywords from tweets

    Get PDF
    Nos Ășltimos anos, uma enorme quantidade de informaçÔes foi disponibilizada na Internet. As redes sociais estĂŁo entre as que mais contribuem para esse aumento no volume de dados. O Twitter, em particular, abriu o caminho, enquanto plataforma social, para que pessoas e organizaçÔes possam interagir entre si, gerando grandes volumes de dados a partir dos quais Ă© possĂ­vel extrair informação Ăștil. Uma tal quantidade de dados, permitirĂĄ por exemplo, revelar-se importante se e quando, vĂĄrios indivĂ­duos relatarem sintomas de doença ao mesmo tempo e no mesmo lugar. Processar automaticamente um tal volume de informaçÔes e obter a partir dele conhecimento Ăștil, torna-se, no entanto, uma tarefa impossĂ­vel para qualquer ser humano. Os extratores de palavras-chave surgem neste contexto como uma ferramenta valiosa que visa facilitar este trabalho, ao permitir, de uma forma rĂĄpida, ter acesso a um conjunto de termos caracterizadores do documento. Neste trabalho, tentamos contribuir para um melhor entendimento deste problema, avaliando a eficĂĄcia do YAKE (um algoritmo de extração de palavras-chave nĂŁo supervisionado) em cima de um conjunto de tweets, um tipo de texto, caracterizado nĂŁo sĂł pelo seu reduzido tamanho, mas tambĂ©m pela sua natureza nĂŁo estruturada. Embora os extratores de palavras-chave tenham sido amplamente aplicados a textos genĂ©ricos, como a relatĂłrios, artigos, entre outros, a sua aplicabilidade em tweets Ă© escassa e atĂ© ao momento nĂŁo foi disponibilizado formalmente nenhum conjunto de dados. Neste trabalho e por forma a contornar esse problema optĂĄmos por desenvolver e tornar disponĂ­vel uma nova coleção de dados, um importante contributo para que a comunidade cientĂ­fica promova novas soluçÔes neste domĂ­nio. O KWTweet foi anotado por 15 anotadores e resultou em 7736 tweets anotados. Com base nesta informação, pudemos posteriormente avaliar a eficĂĄcia do YAKE! contra 9 baselines de extração de palavra-chave nĂŁo supervisionados (TextRank, KP-Miner, SingleRank, PositionRank, TopicPageRank, MultipartiteRank, TopicRank, Rake e TF.IDF). Os resultados obtidos demonstram que o YAKE! tem um desempenho superior quando comparado com os seus competidores, provando-se assim a sua eficĂĄcia neste tipo de textos. Por fim, disponibilizamos uma demo que visa demonstrar o funcionamento do YAKE! Nesta plataforma web, os utilizadores tĂȘm a possibilidade de fazer uma pesquisa por utilizador ou hashtag e dessa forma obter as palavras chave mais relevantes atravĂ©s de uma nuvem de palavra

    Ranking of high-value social audiences on Twitter

    Get PDF
    Even though social media offers plenty of business opportunities, for a company to identify the right audience from the massive amount of social media data is highly challenging given finite resources and marketing budgets. In this paper, we present a ranking mechanism that is capable of identifying the top-k social audience members on Twitter based on an index. Data from three different Twitter business account owners were used in our experiments to validate this ranking mechanism. The results show that the index developed using a combination of semi-supervised and supervised learning methods is indeed generic enough to retrieve relevant audience members from the three different data sets. This approach of combining Fuzzy Match, Twitter Latent Dirichlet Allocation and Support Vector Machine Ensemble is able to leverage on the content of account owners to construct seed words and training data sets with minimal annotation efforts. We conclude that this ranking mechanism has the potential to be adopted in real-world applications for differentiating prospective customers from the general audience and enabling market segmentation for better business decision making

    YouTube AV 50K: An Annotated Corpus for Comments in Autonomous Vehicles

    Full text link
    With one billion monthly viewers, and millions of users discussing and sharing opinions, comments below YouTube videos are rich sources of data for opinion mining and sentiment analysis. We introduce the YouTube AV 50K dataset, a freely-available collections of more than 50,000 YouTube comments and metadata below autonomous vehicle (AV)-related videos. We describe its creation process, its content and data format, and discuss its possible usages. Especially, we do a case study of the first self-driving car fatality to evaluate the dataset, and show how we can use this dataset to better understand public attitudes toward self-driving cars and public reactions to the accident. Future developments of the dataset are also discussed.Comment: in Proceedings of the Thirteenth International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2018
    • 

    corecore