Distributed web-scale infrastructure for crawling, indexing and search with semantic support

Dlugolinsky, S.; Hluchy, L.; Laclavik, M.; Seleng, M.

Distributed web-scale infrastructure for crawling, indexing and search with semantic support

Authors: S. Dlugolinsky
L. Hluchy
M. Laclavik
M. Seleng
Publication date: 1 January 2012
Publisher: Akademia Górniczo-Hutnicza im. Stanisława Staszica w Krakowie. Wydawnictwo AGH

Abstract

In this paper, we describe our work in progress in the scope of web-scale information extraction and information retrieval utilizing distributed computing. We present a distributed architecture built on top of the MapReduce paradigm for information retrieval, information processing and intelligent search supported by spatial capabilities. Proposed architecture is focused on crawling documents in several different formats, information extraction, lightweight semantic annotation of the extracted information, indexing of extracted information and finally on indexing of documents based on the geo-spatial information found in a document. We demonstrate the architecture on two use cases, where the first is search in job offers retrieved from the LinkedIn portal and the second is search in BBC news feeds and discuss several problems we had to face during the implemen-tation. We also discuss spatial search applications for both cases because both LinkedIn job offer pages and BBC news feeds contain a lot of spatial information to extract and process

Similar works

Full text

Available Versions

Biblioteka Nauki - repozytorium artykuÅÃ³w

oai:bibliotekanauki.pl:305377

Last time updated on 20/05/2022

Distributed web-scale infrastructure for crawling, indexing and search with semantic support

Abstract

Similar works

Full text

Available Versions

Biblioteka Nauki - repozytorium artykuÅÃ³w

Biblioteka Nauki - repozytorium artykuÅÃ³w