23,489 research outputs found

    A Benchmark for Image Retrieval using Distributed Systems over the Internet: BIRDS-I

    Full text link
    The performance of CBIR algorithms is usually measured on an isolated workstation. In a real-world environment the algorithms would only constitute a minor component among the many interacting components. The Internet dramati-cally changes many of the usual assumptions about measuring CBIR performance. Any CBIR benchmark should be designed from a networked systems standpoint. These benchmarks typically introduce communication overhead because the real systems they model are distributed applications. We present our implementation of a client/server benchmark called BIRDS-I to measure image retrieval performance over the Internet. It has been designed with the trend toward the use of small personalized wireless systems in mind. Web-based CBIR implies the use of heteroge-neous image sets, imposing certain constraints on how the images are organized and the type of performance metrics applicable. BIRDS-I only requires controlled human intervention for the compilation of the image collection and none for the generation of ground truth in the measurement of retrieval accuracy. Benchmark image collections need to be evolved incrementally toward the storage of millions of images and that scaleup can only be achieved through the use of computer-aided compilation. Finally, our scoring metric introduces a tightly optimized image-ranking window.Comment: 24 pages, To appear in the Proc. SPIE Internet Imaging Conference 200

    Evaluating the benefits of key-value databases for scientific applications

    Get PDF
    The convergence of Big Data applications with High-Performance Computing requires new methodologies to store, manage and process large amounts of information. Traditional storage solutions are unable to scale and that results in complex coding strategies. For example, the brain atlas of the Human Brain Project has the challenge to process large amounts of high-resolution brain images. Given the computing needs, we study the effects of replacing a traditional storage system with a distributed Key-Value database on a cell segmentation application. The original code uses HDF5 files on GPFS through an intricate interface, imposing synchronizations. On the other hand, by using Apache Cassandra or ScyllaDB through Hecuba, the application code is greatly simplified. Thanks to the Key-Value data model, the number of synchronizations is reduced and the time dedicated to I/O scales when increasing the number of nodes.This project/research has received funding from the European Unions Horizon 2020 Framework Programme for Research and Innovation under the Speci c Grant Agreement No. 720270 (Human Brain Project SGA1) and the Speci c Grant Agreement No. 785907 (Human Brain Project SGA2). This work has also been supported by the Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), and by Generalitat de Catalunya (contract 2017-SGR-1414).Postprint (author's final draft

    The study of probability model for compound similarity searching

    Get PDF
    Information Retrieval or IR system main task is to retrieve relevant documents according to the users query. One of IR most popular retrieval model is the Vector Space Model. This model assumes relevance based on similarity, which is defined as the distance between query and document in the concept space. All currently existing chemical compound database systems have adapt the vector space model to calculate the similarity of a database entry to a query compound. However, it assumes that fragments represented by the bits are independent of one another, which is not necessarily true. Hence, the possibility of applying another IR model is explored, which is the Probabilistic Model, for chemical compound searching. This model estimates the probabilities of a chemical structure to have the same bioactivity as a target compound. It is envisioned that by ranking chemical structures in decreasing order of their probability of relevance to the query structure, the effectiveness of a molecular similarity searching system can be increased. Both fragment dependencies and independencies assumption are taken into consideration in achieving improvement towards compound similarity searching system. After conducting a series of simulated similarity searching, it is concluded that PM approaches really did perform better than the existing similarity searching. It gave better result in all evaluation criteria to confirm this statement. In terms of which probability model performs better, the BD model shown improvement over the BIR model

    Improving average ranking precision in user searches for biomedical research datasets

    Full text link
    Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorisation method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries. Our system provides competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP among the participants, being +22.3% higher than the median infAP of the participant's best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system's performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. Our similarity measure algorithm seems to be robust, in particular compared to Divergence From Randomness framework, having smaller performance variations under different training conditions. Finally, the result categorization did not have significant impact on the system's performance. We believe that our solution could be used to enhance biomedical dataset management systems. In particular, the use of data driven query expansion methods could be an alternative to the complexity of biomedical terminologies

    An analysis of the use of graphics for information retrieval

    Get PDF
    Several research groups have addressed the problem of retrieving vector graphics. This work has, however, focused either on domain-dependent areas or was based on very simple graphics languages. Here we take a fresh look at the issue of graphics retrieval in general and in particular at the tasks which retrieval systems must support. The paper presents a series of case studies which explored the needs of professionals in the hope that these needs can help direct future graphics IR research. Suggested modelling techniques for some of the graphic collections are also presented
    corecore