398 research outputs found

    Query-related data extraction of hidden web documents

    Get PDF
    The larger amount of information on the Web is stored in document databases and is not indexed by general-purpose search engines (i.e., Google and Yahoo). Such information is dynamically generated through querying databases — which are referred to as Hidden Web databases. Documents returned in response to a user query are typically presented using templategenerated Web pages. This paper proposes a novel approach that identifies Web page templates by analysing the textual contents and the adjacent tag structures of a document in order to extract query-related data. Preliminary results demonstrate that our approach effectively detects templates and retrieves data with high recall and precision

    Application and evaluation of multi-dimensional diversity

    Get PDF
    Traditional information retrieval (IR) systems mostly focus on finding documents relevant to queries without considering other documents in the search results. This approach works quite well in general cases; however, this also means that the set of returned documents in a result list can be very similar to each other. This can be an undesired system property from a user's perspective. The creation of IR systems that support the search result diversification present many challenges, indeed current evaluation measures and methodologies are still unclear with regards to specific search domains and dimensions of diversity. In this paper, we highlight various issues in relation to image search diversification for the ImageClef 2009 collection and tasks. Furthermore, we discuss the problem of defining clusters/subtopics by mixing diversity dimensions regardless of which dimension is important in relation to information need or circumstances. We also introduce possible applications and evaluation metrics for diversity based retrieval

    Spatio-textual indexing for geographical search on the web

    Get PDF
    Many web documents refer to specific geographic localities and many people include geographic context in queries to web search engines. Standard web search engines treat the geographical terms in the same way as other terms. This can result in failure to find relevant documents that refer to the place of interest using alternative related names, such as those of included or nearby places. This can be overcome by associating text indexing with spatial indexing methods that exploit geo-tagging procedures to categorise documents with respect to geographic space. We describe three methods for spatio-textual indexing based on multiple spatially indexed text indexes, attaching spatial indexes to the document occurrences of a text index, and merging text index access results with results of access to a spatial index of documents. These schemes are compared experimentally with a conventional text index search engine, using a collection of geo-tagged web documents, and are shown to be able to compete in speed and storage performance with pure text indexing

    Evaluation Measures for Relevance and Credibility in Ranked Lists

    Full text link
    Recent discussions on alternative facts, fake news, and post truth politics have motivated research on creating technologies that allow people not only to access information, but also to assess the credibility of the information presented to them by information retrieval systems. Whereas technology is in place for filtering information according to relevance and/or credibility, no single measure currently exists for evaluating the accuracy or precision (and more generally effectiveness) of both the relevance and the credibility of retrieved results. One obvious way of doing so is to measure relevance and credibility effectiveness separately, and then consolidate the two measures into one. There at least two problems with such an approach: (I) it is not certain that the same criteria are applied to the evaluation of both relevance and credibility (and applying different criteria introduces bias to the evaluation); (II) many more and richer measures exist for assessing relevance effectiveness than for assessing credibility effectiveness (hence risking further bias). Motivated by the above, we present two novel types of evaluation measures that are designed to measure the effectiveness of both relevance and credibility in ranked lists of retrieval results. Experimental evaluation on a small human-annotated dataset (that we make freely available to the research community) shows that our measures are expressive and intuitive in their interpretation
    • …
    corecore