58,634 research outputs found

    Selective web information retrieval

    Get PDF
    This thesis proposes selective Web information retrieval, a framework formulated in terms of statistical decision theory, with the aim to apply an appropriate retrieval approach on a per-query basis. The main component of the framework is a decision mechanism that selects an appropriate retrieval approach on a per-query basis. The selection of a particular retrieval approach is based on the outcome of an experiment, which is performed before the final ranking of the retrieved documents. The experiment is a process that extracts features from a sample of the set of retrieved documents. This thesis investigates three broad types of experiments. The first one counts the occurrences of query terms in the retrieved documents, indicating the extent to which the query topic is covered in the document collection. The second type of experiments considers information from the distribution of retrieved documents in larger aggregates of related Web documents, such as whole Web sites, or directories within Web sites. The third type of experiments estimates the usefulness of the hyperlink structure among a sample of the set of retrieved Web documents. The proposed experiments are evaluated in the context of both informational and navigational search tasks with an optimal Bayesian decision mechanism, where it is assumed that relevance information exists. This thesis further investigates the implications of applying selective Web information retrieval in an operational setting, where the tuning of a decision mechanism is based on limited existing relevance information and the information retrieval system’s input is a stream of queries related to mixed informational and navigational search tasks. First, the experiments are evaluated using different training and testing query sets, as well as a mixture of different types of queries. Second, query sampling is introduced, in order to approximate the queries that a retrieval system receives, and to tune an ad-hoc decision mechanism with a broad set of automatically sampled queries

    Development of Digital Repository and Retrieval System for Rose Germplasm Management

    Get PDF
    Live repository of rose consisting of different genotypes and species of roses available across the globe has been established at ICAR-IIHR. All these genotypes have been characterized for 60 morphological characters for description of these varieties. Along with the live repository of plants, efforts have been made to develop digital repository of all these genotypes. The digital repository consists of description of characters, quantitative measurement for selected important characters and images for all the descriptors. A web-enabled interface has been developed for the selective retrieval of accessions with desired characters, and also for retrieval of all the information for the selected genotype. The information system will be useful across the germplasm collection centers, for the breeders and other end users by enabling them to select the appropriate germplasm andavoid duplicates

    Efficient & Effective Selective Query Rewriting with Efficiency Predictions

    Get PDF
    To enhance effectiveness, a user's query can be rewritten internally by the search engine in many ways, for example by applying proximity, or by expanding the query with related terms. However, approaches that benefit effectiveness often have a negative impact on efficiency, which has impacts upon the user satisfaction, if the query is excessively slow. In this paper, we propose a novel framework for using the predicted execution time of various query rewritings to select between alternatives on a per-query basis, in a manner that ensures both effectiveness and efficiency. In particular, we propose the prediction of the execution time of ephemeral (e.g., proximity) posting lists generated from uni-gram inverted index posting lists, which are used in establishing the permissible query rewriting alternatives that may execute in the allowed time. Experiments examining both the effectiveness and efficiency of the proposed approach demonstrate that a 49% decrease in mean response time (and 62% decrease in 95th-percentile response time) can be attained without significantly hindering the effectiveness of the search engine

    Adversarial Sampling and Training for Semi-Supervised Information Retrieval

    Full text link
    Ad-hoc retrieval models with implicit feedback often have problems, e.g., the imbalanced classes in the data set. Too few clicked documents may hurt generalization ability of the models, whereas too many non-clicked documents may harm effectiveness of the models and efficiency of training. In addition, recent neural network-based models are vulnerable to adversarial examples due to the linear nature in them. To solve the problems at the same time, we propose an adversarial sampling and training framework to learn ad-hoc retrieval models with implicit feedback. Our key idea is (i) to augment clicked examples by adversarial training for better generalization and (ii) to obtain very informational non-clicked examples by adversarial sampling and training. Experiments are performed on benchmark data sets for common ad-hoc retrieval tasks such as Web search, item recommendation, and question answering. Experimental results indicate that the proposed approaches significantly outperform strong baselines especially for high-ranked documents, and they outperform IRGAN in NDCG@5 using only 5% of labeled data for the Web search task.Comment: Published in WWW 201

    Content-Aware DataGuides for Indexing Large Collections of XML Documents

    Get PDF
    XML is well-suited for modelling structured data with textual content. However, most indexing approaches perform structure and content matching independently, combining the retrieved path and keyword occurrences in a third step. This paper shows that retrieval in XML documents can be accelerated significantly by processing text and structure simultaneously during all retrieval phases. To this end, the Content-Aware DataGuide (CADG) enhances the wellknown DataGuide with (1) simultaneous keyword and path matching and (2) a precomputed content/structure join. Extensive experiments prove the CADG to be 50-90% faster than the DataGuide for various sorts of query and document, including difficult cases such as poorly structured queries and recursive document paths. A new query classification scheme identifies precise query characteristics with a predominant influence on the performance of the individual indices. The experiments show that the CADG is applicable to many real-world applications, in particular large collections of heterogeneously structured XML documents

    Development of a Novel Media-independent Communication Theology for Accessing Local & Web-based Data: Case Study with Robotic Subsystems

    Get PDF
    Realizing media independence in today’s communication system remains an open problem by and large. Information retrieval, mostly through the Internet, is becoming the most demanding feature in technological progress and this web-based data access should ideally be in user-selective form. While blind-folded access of data through the World Wide Web is quite streamlined, the counter-half of the facet, namely, seamless access of information database pertaining to a specific end-device, e.g. robotic systems, is still in a formative stage. This paradigm of access as well as systematic query-based retrieval of data, related to the physical enddevice is very crucial in designing the Internet-based network control of the same in real-time. Moreover, this control of the end-device is directly linked up to the characteristics of three coupled metrics, namely, ‘multiple databases’, ‘multiple servers’ and ‘multiple inputs’ (to each server). This triad, viz. database-input-server (DIS) plays a significant role in overall performance of the system, the background details of which is still very sketchy in global research community. This work addresses the technical issues associated with this theology, with specific reference to formalism of a customized DIS considering real-time delay analysis. The present paper delineates the developmental paradigms of novel multi-input multioutput communication semantics for retrieving web-based information from physical devices, namely, two representative robotic sub-systems in a coherent and homogeneous mode. The developed protocol can be entrusted for use in real-time in a complete user-friendly manner

    The State-of-the-arts in Focused Search

    Get PDF
    The continuous influx of various text data on the Web requires search engines to improve their retrieval abilities for more specific information. The need for relevant results to a user’s topic of interest has gone beyond search for domain or type specific documents to more focused result (e.g. document fragments or answers to a query). The introduction of XML provides a format standard for data representation, storage, and exchange. It helps focused search to be carried out at different granularities of a structured document with XML markups. This report aims at reviewing the state-of-the-arts in focused search, particularly techniques for topic-specific document retrieval, passage retrieval, XML retrieval, and entity ranking. It is concluded with highlight of open problems
    corecore