145 research outputs found
Multi-Perspective Relevance Matching with Hierarchical ConvNets for Social Media Search
Despite substantial interest in applications of neural networks to
information retrieval, neural ranking models have only been applied to standard
ad hoc retrieval tasks over web pages and newswire documents. This paper
proposes MP-HCNN (Multi-Perspective Hierarchical Convolutional Neural Network)
a novel neural ranking model specifically designed for ranking short social
media posts. We identify document length, informal language, and heterogeneous
relevance signals as features that distinguish documents in our domain, and
present a model specifically designed with these characteristics in mind. Our
model uses hierarchical convolutional layers to learn latent semantic
soft-match relevance signals at the character, word, and phrase levels. A
pooling-based similarity measurement layer integrates evidence from multiple
types of matches between the query, the social media post, as well as URLs
contained in the post. Extensive experiments using Twitter data from the TREC
Microblog Tracks 2011--2014 show that our model significantly outperforms prior
feature-based as well and existing neural ranking models. To our best
knowledge, this paper presents the first substantial work tackling search over
social media posts using neural ranking models.Comment: AAAI 2019, 10 page
BERT-Embedding and Citation Network Analysis based Query Expansion Technique for Scholarly Search
The enormous growth of research publications has made it challenging for
academic search engines to bring the most relevant papers against the given
search query. Numerous solutions have been proposed over the years to improve
the effectiveness of academic search, including exploiting query expansion and
citation analysis. Query expansion techniques mitigate the mismatch between the
language used in a query and indexed documents. However, these techniques can
suffer from introducing non-relevant information while expanding the original
query. Recently, contextualized model BERT to document retrieval has been quite
successful in query expansion. Motivated by such issues and inspired by the
success of BERT, this paper proposes a novel approach called QeBERT. QeBERT
exploits BERT-based embedding and Citation Network Analysis (CNA) in query
expansion for improving scholarly search. Specifically, we use the
context-aware BERT-embedding and CNA for query expansion in Pseudo-Relevance
Feedback (PRF) fash-ion. Initial experimental results on the ACL dataset show
that BERT-embedding can provide a valuable augmentation to query expansion and
improve search relevance when combined with CNA.Comment: 1
ON RELEVANCE FILTERING FOR REAL-TIME TWEET SUMMARIZATION
Real-time tweet summarization systems (RTS) require mechanisms for capturing relevant tweets, identifying novel tweets, and capturing timely tweets. In this thesis, we tackle the RTS problem with a main focus on the relevance filtering. We experimented with different traditional retrieval models.
Additionally, we propose two extensions to alleviate the sparsity and topic drift challenges that affect the relevance filtering. For the sparsity, we propose leveraging word embeddings in Vector Space model (VSM) term weighting to empower the system to use semantic similarity alongside the lexical matching. To mitigate the effect of topic drift, we exploit explicit relevance feedback to enhance profile representation to cope with its development in the stream over time.
We conducted extensive experiments over three standard English TREC test collections that were built specifically for RTS. Although the extensions do not generally exhibit better performance, they are comparable to the baselines used.
Moreover, we extended an event detection Arabic tweets test collection, called EveTAR, to support tasks that require novelty in the system's output. We collected novelty judgments using in-house annotators and used the collection to test our RTS system. We report preliminary results on EveTAR using different models of the RTS system.This work was made possible by NPRP grants # NPRP 7-1313-1-245 and # NPRP 7-1330-2-483 from the Qatar National Research Fund (a member of Qatar Foundation)
End-to-end Neural Information Retrieval
In recent years we have witnessed many successes of neural networks in the information
retrieval community with lots of labeled data. Yet it remains unknown whether the same
techniques can be easily adapted to search social media posts where the text is much
shorter. In addition, we find that most neural information retrieval models are compared
against weak baselines. In this thesis, we build an end-to-end neural information retrieval
system using two toolkits: Anserini and MatchZoo. In addition, we also propose a novel
neural model to capture the relevance of short and varied tweet text, named MP-HCNN.
With the information retrieval toolkit Anserini, we build a reranking architecture based
on various traditional information retrieval models (QL, QL+RM3, BM25, BM25+RM3),
including a strong pseudo-relevance feedback baseline: RM3. With the neural network
toolkit MatchZoo, we offer an empirical study of a number of popular neural network
ranking models (DSSM, CDSSM, KNRM, DUET, DRMM). Experiments on datasets from
the TREC Microblog Tracks and the TREC Robust Retrieval Track show that most
existing neural network models cannot beat a simple language model baseline. How-
ever, DRMM provides a significant improvement over the pseudo-relevance feedback baseline
(BM25+RM3) on the Robust04 dataset and DUET, DRMM and MP-HCNN can provide
significant improvements over the baseline (QL+RM3) on the microblog datasets. Further
detailed analyses suggest that searching social media and searching news articles exhibit
several different characteristics that require customized model design, shedding light on
future directions
Social Media Text Processing and Semantic Analysis for Smart Cities
With the rise of Social Media, people obtain and share information almost
instantly on a 24/7 basis. Many research areas have tried to gain valuable
insights from these large volumes of freely available user generated content.
With the goal of extracting knowledge from social media streams that might be
useful in the context of intelligent transportation systems and smart cities,
we designed and developed a framework that provides functionalities for
parallel collection of geo-located tweets from multiple pre-defined bounding
boxes (cities or regions), including filtering of non-complying tweets, text
pre-processing for Portuguese and English language, topic modeling, and
transportation-specific text classifiers, as well as, aggregation and data
visualization.
We performed an exploratory data analysis of geo-located tweets in 5
different cities: Rio de Janeiro, S\~ao Paulo, New York City, London and
Melbourne, comprising a total of more than 43 million tweets in a period of 3
months. Furthermore, we performed a large scale topic modelling comparison
between Rio de Janeiro and S\~ao Paulo. Interestingly, most of the topics are
shared between both cities which despite being in the same country are
considered very different regarding population, economy and lifestyle.
We take advantage of recent developments in word embeddings and train such
representations from the collections of geo-located tweets. We then use a
combination of bag-of-embeddings and traditional bag-of-words to train
travel-related classifiers in both Portuguese and English to filter
travel-related content from non-related. We created specific gold-standard data
to perform empirical evaluation of the resulting classifiers. Results are in
line with research work in other application areas by showing the robustness of
using word embeddings to learn word similarities that bag-of-words is not able
to capture
- …