331 research outputs found
An in-Browser Microblog Ranking Engine
International audienceMicroblogs, although extremely peculiar pieces of data, constitute a very rich source of information, which has been widely exploited recently, thanks to the liberal access Twitter offers through its API. Nevertheless, computing relevant answers to general queries is still a very challenging task. We propose a new engine, the Twittering Machine, which evaluates SQL like queries on streams of tweets, using ranking techniques computed at query time. Our algorithm is real time, it produces streams of results which are refined progressively, adaptive, the queries continuously adapt to new trends, invasive, it interacts with Twitter by suggesting relevant users to follow, and query results to publish as tweets. Moreover it works in a decentralized environment, directly in the browser on the client side, making it easy to use, and server independent
Interactive Search and Exploration in Online Discussion Forums Using Multimodal Embeddings
In this paper we present a novel interactive multimodal learning system,
which facilitates search and exploration in large networks of social multimedia
users. It allows the analyst to identify and select users of interest, and to
find similar users in an interactive learning setting. Our approach is based on
novel multimodal representations of users, words and concepts, which we
simultaneously learn by deploying a general-purpose neural embedding model. We
show these representations to be useful not only for categorizing users, but
also for automatically generating user and community profiles. Inspired by
traditional summarization approaches, we create the profiles by selecting
diverse and representative content from all available modalities, i.e. the
text, image and user modality. The usefulness of the approach is evaluated
using artificial actors, which simulate user behavior in a relevance feedback
scenario. Multiple experiments were conducted in order to evaluate the quality
of our multimodal representations, to compare different embedding strategies,
and to determine the importance of different modalities. We demonstrate the
capabilities of the proposed approach on two different multimedia collections
originating from the violent online extremism forum Stormfront and the
microblogging platform Twitter, which are particularly interesting due to the
high semantic level of the discussions they feature
Re-ranking Real-time Web Tweets to Find Reliable and Influential Twitterers
Twitter is a powerful social media tool to share information on different topics around the world. Following different users/accounts is the most effective way to get information propagated in Twitter. Due to Twitter's limited searching and lack of navigation support, searching Twitter is not easy and requires effort to find reliable information. This thesis proposed a new methodology to rank tweets based on their authority with the goal of aiding users identifying influential Twitterers. This methodology, HIRKM rank, is influenced by PageRank, Alexa Rank, original tweet or a retweet and the use of hash tags to determine the authorisation of each tweet. This method is applied to rank TREC 2011 microblogging dataset which contains over 16 million tweets based on 50 predefined topics. The results are a list of tweets presented in a descending order based on their authorities which are relevant to the users search queries and will be evaluated using TREC’s official golden standard for the microblogging dataset
Hyperlink-extended pseudo relevance feedback for improved microblog retrieval
Microblog retrieval has received much attention in recent years due to the wide spread of social microblogging platforms such as Twitter. The main motive behind microblog retrieval is to serve users searching a big collection of microblogs a list of relevant documents (microblogs) matching their search needs. What makes microblog retrieval different from normal web retrieval is the short length of the user queries and the documents that you search in, which leads to a big vocabulary mismatch problem. Many research studies investigated different approaches for microblog retrieval. Query expansion is one of the approaches that showed stable performance for improving microblog retrieval effectiveness. Query expansion is used mainly to overcome the vocabulary mismatch problem between user queries and short relevant documents. In our work, we investigate existing query expansion method (Pseudo Relevance Feedback - PRF) comprehensively, and propose an extension using the information from hyperlinks attached to the top relevant documents. Our experimental results on TREC microblog data showed that Pseudo Relevance Feedback (PRF) alone could outperform many retrieval approaches if configured properly. We showed that combining the expansion terms with the original query by a weight, not to dilute the effect of the original query, could lead to superior results. The weighted combine of the expansion terms is different than what is commonly used in the literature by appending the expansion terms to the original query without weighting. We experimented using different weighting schemes, and empirically found that assigning a small weight for the expansion terms 0.2, and 0.8 for the original query performs the best for the three evaluation sets 2011, 2012, and 2013. We applied the previous weighting scheme to the most reported PRF configuration used in the literature and measured the retrieval performance. The P@30 performance achieved using our weighting scheme was 0.485, 0.4136, and 0.4811 compared to 0.4585, 0.3548, and 0.3861 without applying weighting for the three evaluation sets 2011, 2012 and 2013 respectively. The MAP performance achieved using our weighting scheme was 0.4386, 0.2845, and 0.3262 compared to 0.3592, 0.2074, and 0.2256 without applying weighting for the three evaluation sets 2011, 2012 and 2013 respectively. Results also showed that utilizing hyperlinked documents attached to the top relevant tweets in query expansion improves the results over traditional PRF. By utilizing hyperlinked documents in the query expansion our best runs achieved 0.5000, 0.4339, and 0.5546 P@30 compared to 0.4864, 0.4203, and 0.5322 when applying traditional PRF, and 0.4587, 0.3044, and 0.3584 MAP when applying traditional PRF compared to 0.4405, 0.2850, and 0.3492 when utilizing the hyperlinked document contents (using web page titles, and meta-descriptions) for the three evaluation sets 2011, 2012 and 2013 respectively. We explored different types of information extracted from the hyperlinked documents; we show that using the document titles and meta-descriptions helps in improving the retrieval performance the most. On the other hand, using the meta- keywords degraded the retrieval performance. For the test set released in 2013, using our hyperlinked-extended approach achieved the best improvement over the PRF baseline, 0.5546 P@30 compared to 0.5322 and 0.3584 MAP compared to 0.3492. For the test sets released in 2011 and 2012 we got less improvements over PRF, 0.5000, 0.4339 P@30 compared to 0.4864, 0.4203, and 0.4587, 0.3044 MAP compared to 0.4405, 0.2850. We showed that this behavior was due to the age of the collection, where a lot of hyperlinked documents were taken down or moved and we couldn\u27t get their information. Our best results achieved using hyperlink-extended PRF achieved statistically significant improvements over the traditional PRF for the test sets released in 2011, and 2013 using paired t-test with p-value \u3c 0.05. Moreover, our proposed approach outperformed the best results reported at TREC microblog track for the years 2011, and 2013, which applied more sophisticated algorithms. Our proposed approach achieved 0.5000, 0.5546 P@30 compared to 0.4551, 0.5528 achieved by the best runs in TREC, and 0.4587, 0.3584 MAP compared to 0.3350, 0.3524 for the evaluation sets of 2011 and 2013 respectively. The main contributions of our work can be listed as follows: 1. Providing a comprehensive study for the usage of traditional PRF with microblog retrieval using various configurations. 2. Introducing a hyperlink-based PRF approach for microblog retrieval by utilizing hyperlinks embedded in initially retrieved tweets, which showed a significant improvement to retrieval effectiveness
Building a Self-Contained Search Engine in the Browser
JavaScript engines inside modern web browsers are capa-ble of running sophisticated multi-player games, rendering impressive 3D scenes, and supporting complex, interactive visualizations. Can this processing power be harnessed for information retrieval? This paper explores the feasibility of building a JavaScript search engine that runs completely self-contained on the client side within the browser—this in-cludes building the inverted index, gathering terms statistics for scoring, and performing query evaluation. The design takes advantage of the IndexDB API, which is implemented by the LevelDB key–value store inside Google’s Chrome browser. Experiments show that although the performance of the JavaScript prototype falls far short of the open-source Lucene search engine, it is sufficiently responsive for interac-tive applications. This feasibility demonstration opens the door to interesting applications and architectures
Exploratory Analysis of Highly Heterogeneous Document Collections
We present an effective multifaceted system for exploratory analysis of
highly heterogeneous document collections. Our system is based on intelligently
tagging individual documents in a purely automated fashion and exploiting these
tags in a powerful faceted browsing framework. Tagging strategies employed
include both unsupervised and supervised approaches based on machine learning
and natural language processing. As one of our key tagging strategies, we
introduce the KERA algorithm (Keyword Extraction for Reports and Articles).
KERA extracts topic-representative terms from individual documents in a purely
unsupervised fashion and is revealed to be significantly more effective than
state-of-the-art methods. Finally, we evaluate our system in its ability to
help users locate documents pertaining to military critical technologies buried
deep in a large heterogeneous sea of information.Comment: 9 pages; KDD 2013: 19th ACM SIGKDD Conference on Knowledge Discovery
and Data Minin
Evaluation Measures for Relevance and Credibility in Ranked Lists
Recent discussions on alternative facts, fake news, and post truth politics
have motivated research on creating technologies that allow people not only to
access information, but also to assess the credibility of the information
presented to them by information retrieval systems. Whereas technology is in
place for filtering information according to relevance and/or credibility, no
single measure currently exists for evaluating the accuracy or precision (and
more generally effectiveness) of both the relevance and the credibility of
retrieved results. One obvious way of doing so is to measure relevance and
credibility effectiveness separately, and then consolidate the two measures
into one. There at least two problems with such an approach: (I) it is not
certain that the same criteria are applied to the evaluation of both relevance
and credibility (and applying different criteria introduces bias to the
evaluation); (II) many more and richer measures exist for assessing relevance
effectiveness than for assessing credibility effectiveness (hence risking
further bias).
Motivated by the above, we present two novel types of evaluation measures
that are designed to measure the effectiveness of both relevance and
credibility in ranked lists of retrieval results. Experimental evaluation on a
small human-annotated dataset (that we make freely available to the research
community) shows that our measures are expressive and intuitive in their
interpretation
- …