150 research outputs found
The iCrawl Wizard -- Supporting Interactive Focused Crawl Specification
Collections of Web documents about specific topics are needed for many areas
of current research. Focused crawling enables the creation of such collections
on demand. Current focused crawlers require the user to manually specify
starting points for the crawl (seed URLs). These are also used to describe the
expected topic of the collection. The choice of seed URLs influences the
quality of the resulting collection and requires a lot of expertise. In this
demonstration we present the iCrawl Wizard, a tool that assists users in
defining focused crawls efficiently and semi-automatically. Our tool uses major
search engines and Social Media APIs as well as information extraction
techniques to find seed URLs and a semantic description of the crawl intent.
Using the iCrawl Wizard even non-expert users can create semantic
specifications for focused crawlers interactively and efficiently.Comment: Published in the Proceedings of the European Conference on
Information Retrieval (ECIR) 201
Towards Better Understanding Researcher Strategies in Cross-Lingual Event Analytics
With an increasing amount of information on globally important events, there
is a growing demand for efficient analytics of multilingual event-centric
information. Such analytics is particularly challenging due to the large amount
of content, the event dynamics and the language barrier. Although memory
institutions increasingly collect event-centric Web content in different
languages, very little is known about the strategies of researchers who conduct
analytics of such content. In this paper we present researchers' strategies for
the content, method and feature selection in the context of cross-lingual
event-centric analytics observed in two case studies on multilingual Wikipedia.
We discuss the influence factors for these strategies, the findings enabled by
the adopted methods along with the current limitations and provide
recommendations for services supporting researchers in cross-lingual
event-centric analytics.Comment: In Proceedings of the International Conference on Theory and Practice
of Digital Libraries 201
MultiWiki: interlingual text passage alignment in Wikipedia
In this article we address the problem of text passage alignment across interlingual article pairs in Wikipedia. We develop methods that enable the identification and interlinking of text passages written in different languages and containing overlapping information. Interlingual text passage alignment can enable Wikipedia editors and readers to better understand language-specific context of entities, provide valuable insights in cultural differences and build a basis for qualitative analysis of the articles. An important challenge inthis context is the trade-off between the granularity of the extracted text passages and the precision of the alignment. Whereas short text passages can result in more precise alignment, longer text passages can facilitate a better overview of the differences in an article pair. To better understand these aspects from the user perspective, we conduct a user study at the example of the German, Russian and the English Wikipedia and collect a user-annotated benchmark. Then we propose MultiWiki – a method that adopts an integrated approach to the text passage alignment using semantic similarity measures and greedy algorithms and achieves precise results with respect to the user-defined alignment. MultiWiki demonstration is publicly available and currently supports four language pairs
- …