491 research outputs found
Query Expansion for Survey Question Retrieval in the Social Sciences
In recent years, the importance of research data and the need to archive and
to share it in the scientific community have increased enormously. This
introduces a whole new set of challenges for digital libraries. In the social
sciences typical research data sets consist of surveys and questionnaires. In
this paper we focus on the use case of social science survey question reuse and
on mechanisms to support users in the query formulation for data sets. We
describe and evaluate thesaurus- and co-occurrence-based approaches for query
expansion to improve retrieval quality in digital libraries and research data
archives. The challenge here is to translate the information need and the
underlying sociological phenomena into proper queries. As we can show retrieval
quality can be improved by adding related terms to the queries. In a direct
comparison automatically expanded queries using extracted co-occurring terms
can provide better results than queries manually reformulated by a domain
expert and better results than a keyword-based BM25 baseline.Comment: to appear in Proceedings of 19th International Conference on Theory
and Practice of Digital Libraries 2015 (TPDL 2015
Duplicate Question Retrieval and Confirmation Time Prediction in Software Communities
Community Question Answering (CQA) in different domains is growing at a large
scale because of the availability of several platforms and huge shareable
information among users. With the rapid growth of such online platforms, a
massive amount of archived data makes it difficult for moderators to retrieve
possible duplicates for a new question and identify and confirm existing
question pairs as duplicates at the right time. This problem is even more
critical in CQAs corresponding to large software systems like askubuntu where
moderators need to be experts to comprehend something as a duplicate. Note that
the prime challenge in such CQA platforms is that the moderators are themselves
experts and are therefore usually extremely busy with their time being
extraordinarily expensive. To facilitate the task of the moderators, in this
work, we have tackled two significant issues for the askubuntu CQA platform:
(1) retrieval of duplicate questions given a new question and (2) duplicate
question confirmation time prediction. In the first task, we focus on
retrieving duplicate questions from a question pool for a particular newly
posted question. In the second task, we solve a regression problem to rank a
pair of questions that could potentially take a long time to get confirmed as
duplicates. For duplicate question retrieval, we propose a Siamese neural
network based approach by exploiting both text and network-based features,
which outperforms several state-of-the-art baseline techniques. Our method
outperforms DupPredictor and DUPE by 5% and 7% respectively. For duplicate
confirmation time prediction, we have used both the standard machine learning
models and neural network along with the text and graph-based features. We
obtain Spearman's rank correlation of 0.20 and 0.213 (statistically
significant) for text and graph based features respectively.Comment: Full paper accepted at ASONAM 2023: The 2023 IEEE/ACM International
Conference on Advances in Social Networks Analysis and Minin
Soft Seeded SSL Graphs for Unsupervised Semantic Similarity-based Retrieval
Semantic similarity based retrieval is playing an increasingly important role
in many IR systems such as modern web search, question-answering, similar
document retrieval etc. Improvements in retrieval of semantically similar
content are very significant to applications like Quora, Stack Overflow, Siri
etc. We propose a novel unsupervised model for semantic similarity based
content retrieval, where we construct semantic flow graphs for each query, and
introduce the concept of "soft seeding" in graph based semi-supervised learning
(SSL) to convert this into an unsupervised model.
We demonstrate the effectiveness of our model on an equivalent question
retrieval problem on the Stack Exchange QA dataset, where our unsupervised
approach significantly outperforms the state-of-the-art unsupervised models,
and produces comparable results to the best supervised models. Our research
provides a method to tackle semantic similarity based retrieval without any
training data, and allows seamless extension to different domain QA
communities, as well as to other semantic equivalence tasks.Comment: Published in Proceedings of the 2017 ACM Conference on Information
and Knowledge Management (CIKM '17
- …