4 research outputs found

    On the Impact of Entity Linking in Microblog Real-Time Filtering

    Full text link
    Microblogging is a model of content sharing in which the temporal locality of posts with respect to important events, either of foreseeable or unforeseeable nature, makes applica- tions of real-time filtering of great practical interest. We propose the use of Entity Linking (EL) in order to improve the retrieval effectiveness, by enriching the representation of microblog posts and filtering queries. EL is the process of recognizing in an unstructured text the mention of relevant entities described in a knowledge base. EL of short pieces of text is a difficult task, but it is also a scenario in which the information EL adds to the text can have a substantial impact on the retrieval process. We implement a start-of-the-art filtering method, based on the best systems from the TREC Microblog track realtime adhoc retrieval and filtering tasks , and extend it with a Wikipedia-based EL method. Results show that the use of EL significantly improves over non-EL based versions of the filtering methods.Comment: 6 pages, 1 figure, 1 table. SAC 2015, Salamanca, Spain - April 13 - 17, 201

    Adaptive Method for Following Dynamic Topics on Twitter

    Get PDF
    Many research social studies of public response on social media require following (i.e., tracking) topics on Twitter for long periods of time. The current approaches rely on streaming tweets based on some hashtags or keywords, or following some Twitter accounts. Such approaches lead to limited coverage of on-topic tweets. In this paper, we introduce a novel technique for following such topics in a more effective way. A topic is defined as a set of well-prepared queries that cover the static side of the topic. We propose an automatic approach that adapts to emerging aspects of a tracked broad topic over time. We tested our tracking approach on three broad dynamic topics that are hot in different categories: Egyptian politics, Syrian conflict, and international sports. We measured the effectiveness of our approach over four full days spanning a period of four months to ensure consistency in effectiveness. Experimental results showed that, on average, our approach achieved over 100 % increase in recall relative to the baseline Boolean approach, while maintaining an acceptable precision of 83%

    Hyperlink-extended pseudo relevance feedback for improved microblog retrieval

    Get PDF
    Microblog retrieval has received much attention in recent years due to the wide spread of social microblogging platforms such as Twitter. The main motive behind microblog retrieval is to serve users searching a big collection of microblogs a list of relevant documents (microblogs) matching their search needs. What makes microblog retrieval different from normal web retrieval is the short length of the user queries and the documents that you search in, which leads to a big vocabulary mismatch problem. Many research studies investigated different approaches for microblog retrieval. Query expansion is one of the approaches that showed stable performance for improving microblog retrieval effectiveness. Query expansion is used mainly to overcome the vocabulary mismatch problem between user queries and short relevant documents. In our work, we investigate existing query expansion method (Pseudo Relevance Feedback - PRF) comprehensively, and propose an extension using the information from hyperlinks attached to the top relevant documents. Our experimental results on TREC microblog data showed that Pseudo Relevance Feedback (PRF) alone could outperform many retrieval approaches if configured properly. We showed that combining the expansion terms with the original query by a weight, not to dilute the effect of the original query, could lead to superior results. The weighted combine of the expansion terms is different than what is commonly used in the literature by appending the expansion terms to the original query without weighting. We experimented using different weighting schemes, and empirically found that assigning a small weight for the expansion terms 0.2, and 0.8 for the original query performs the best for the three evaluation sets 2011, 2012, and 2013. We applied the previous weighting scheme to the most reported PRF configuration used in the literature and measured the retrieval performance. The P@30 performance achieved using our weighting scheme was 0.485, 0.4136, and 0.4811 compared to 0.4585, 0.3548, and 0.3861 without applying weighting for the three evaluation sets 2011, 2012 and 2013 respectively. The MAP performance achieved using our weighting scheme was 0.4386, 0.2845, and 0.3262 compared to 0.3592, 0.2074, and 0.2256 without applying weighting for the three evaluation sets 2011, 2012 and 2013 respectively. Results also showed that utilizing hyperlinked documents attached to the top relevant tweets in query expansion improves the results over traditional PRF. By utilizing hyperlinked documents in the query expansion our best runs achieved 0.5000, 0.4339, and 0.5546 P@30 compared to 0.4864, 0.4203, and 0.5322 when applying traditional PRF, and 0.4587, 0.3044, and 0.3584 MAP when applying traditional PRF compared to 0.4405, 0.2850, and 0.3492 when utilizing the hyperlinked document contents (using web page titles, and meta-descriptions) for the three evaluation sets 2011, 2012 and 2013 respectively. We explored different types of information extracted from the hyperlinked documents; we show that using the document titles and meta-descriptions helps in improving the retrieval performance the most. On the other hand, using the meta- keywords degraded the retrieval performance. For the test set released in 2013, using our hyperlinked-extended approach achieved the best improvement over the PRF baseline, 0.5546 P@30 compared to 0.5322 and 0.3584 MAP compared to 0.3492. For the test sets released in 2011 and 2012 we got less improvements over PRF, 0.5000, 0.4339 P@30 compared to 0.4864, 0.4203, and 0.4587, 0.3044 MAP compared to 0.4405, 0.2850. We showed that this behavior was due to the age of the collection, where a lot of hyperlinked documents were taken down or moved and we couldn\u27t get their information. Our best results achieved using hyperlink-extended PRF achieved statistically significant improvements over the traditional PRF for the test sets released in 2011, and 2013 using paired t-test with p-value \u3c 0.05. Moreover, our proposed approach outperformed the best results reported at TREC microblog track for the years 2011, and 2013, which applied more sophisticated algorithms. Our proposed approach achieved 0.5000, 0.5546 P@30 compared to 0.4551, 0.5528 achieved by the best runs in TREC, and 0.4587, 0.3584 MAP compared to 0.3350, 0.3524 for the evaluation sets of 2011 and 2013 respectively. The main contributions of our work can be listed as follows: 1. Providing a comprehensive study for the usage of traditional PRF with microblog retrieval using various configurations. 2. Introducing a hyperlink-based PRF approach for microblog retrieval by utilizing hyperlinks embedded in initially retrieved tweets, which showed a significant improvement to retrieval effectiveness

    Microblogging Temporal Summarization: Filtering Important Twitter Updates for Breaking News

    Get PDF
    While news stories are an important traditional medium to broadcast and consume news, microblogging has recently emerged as a place where people can dis- cuss, disseminate, collect or report information about news. However, the massive information in the microblogosphere makes it hard for readers to keep up with these real-time updates. This is especially a problem when it comes to breaking news, where people are more eager to know “what is happening”. Therefore, this dis- sertation is intended as an exploratory effort to investigate computational methods to augment human effort when monitoring the development of breaking news on a given topic from a microblog stream by extractively summarizing the updates in a timely manner. More specifically, given an interest in a topic, either entered as a query or presented as an initial news report, a microblog temporal summarization system is proposed to filter microblog posts from a stream with three primary concerns: topical relevance, novelty, and salience. Considering the relatively high arrival rate of microblog streams, a cascade framework consisting of three stages is proposed to progressively reduce quantity of posts. For each step in the cascade, this dissertation studies methods that improve over current baselines. In the relevance filtering stage, query and document expansion techniques are applied to mitigate sparsity and vocabulary mismatch issues. The use of word embedding as a basis for filtering is also explored, using unsupervised and supervised modeling to characterize lexical and semantic similarity. In the novelty filtering stage, several statistical ways of characterizing novelty are investigated and ensemble learning techniques are used to integrate results from these diverse techniques. These results are compared with a baseline clustering approach using both standard and delay-discounted measures. In the salience filtering stage, because of the real-time prediction requirement a method of learning verb phrase usage from past relevant news reports is used in conjunction with some standard measures for characterizing writing quality. Following a Cranfield-like evaluation paradigm, this dissertation includes a se- ries of experiments to evaluate the proposed methods for each step, and for the end- to-end system. New microblog novelty and salience judgments are created, building on existing relevance judgments from the TREC Microblog track. The results point to future research directions at the intersection of social media, computational jour- nalism, information retrieval, automatic summarization, and machine learning
    corecore