Search CORE

8,125 research outputs found

Recommended from our members

Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings

Author: Anastasiou Lucas
Gyawali Bikash
Knoth Petr
Publication venue: European Language Resources Association
Publication date: 11/05/2020
Field of study

Deduplication is the task of identifying near and exact duplicate data items in a collection. In this paper, we present a novel method for deduplication of scholarly documents. We develop a hybrid model which uses structural similarity (locality sensitive hashing) and meaning representation (word embeddings) of document texts to determine (near) duplicates. Our collection constitutes a subset of multidisciplinary scholarly documents aggregated from research repositories. We identify several issues causing data inaccuracies in such collections and motivate the need for deduplication. In lack of existing dataset suitable for study of deduplication of scholarly documents, we create a ground truth dataset of 100K scholarly documents and conduct a series of experiments to empirically establish optimal values for the parameters of our deduplication method. Experimental evaluation shows that our method achieves a macro F1-score of 0.90. We productionise our method as a publicly accessible web API service serving deduplication of scholarly documents in real time

Open Research Online (The Open University)

An Approach of Semantic Similarity Measure between Documents Based on Big Data

Author: Beni-Hssane Abderrahim
Birjali Marouane
Erritali Mohammed
Madani Youness
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/10/2016
Field of study

Semantic indexing and document similarity is an important information retrieval system problem in Big Data with broad applications. In this paper, we investigate MapReduce programming model as a specific framework for managing distributed processing in a large of amount documents. Then we study the state of the art of different approaches for computing the similarity of documents. Finally, we propose our approach of semantic similarity measures using WordNet as an external network semantic resource. For evaluation, we compare the proposed approach with other approaches previously presented by using our new MapReduce algorithm. Experimental results review that our proposed approach outperforms the state of the art ones on running time performance and increases the measurement of semantic similarity

IAES journal

Crossref

Institute of Advanced Engineering and Science

Soft Seeded SSL Graphs for Unsupervised Semantic Similarity-based Retrieval

Author: Datt Madhav
Srivastava Avikalp
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 15/12/2017
Field of study

Semantic similarity based retrieval is playing an increasingly important role in many IR systems such as modern web search, question-answering, similar document retrieval etc. Improvements in retrieval of semantically similar content are very significant to applications like Quora, Stack Overflow, Siri etc. We propose a novel unsupervised model for semantic similarity based content retrieval, where we construct semantic flow graphs for each query, and introduce the concept of "soft seeding" in graph based semi-supervised learning (SSL) to convert this into an unsupervised model. We demonstrate the effectiveness of our model on an equivalent question retrieval problem on the Stack Exchange QA dataset, where our unsupervised approach significantly outperforms the state-of-the-art unsupervised models, and produces comparable results to the best supervised models. Our research provides a method to tackle semantic similarity based retrieval without any training data, and allows seamless extension to different domain QA communities, as well as to other semantic equivalence tasks.Comment: Published in Proceedings of the 2017 ACM Conference on Information and Knowledge Management (CIKM '17

arXiv.org e-Print Archive

Crossref