1,773,459 research outputs found

    The Lucene for Information Access and Retrieval Research (LIARR) Workshop at SIGIR 2017

    Get PDF
    As an empirical discipline, information access and retrieval research requires substantial software infrastructure to index and search large collections. This workshop is motivated by the desire to better align information retrieval research with the practice of building search applications from the perspective of open-source information retrieval systems. Our goal is to promote the use of Lucene for information access and retrieval research

    HDIdx: High-Dimensional Indexing for Efficient Approximate Nearest Neighbor Search

    Get PDF
    Fast Nearest Neighbor (NN) search is a fundamental challenge in large-scale data processing and analytics, particularly for analyzing multimedia contents which are often of high dimensionality. Instead of using exact NN search, extensive research efforts have been focusing on approximate NN search algorithms. In this work, we present "HDIdx", an efficient high-dimensional indexing library for fast approximate NN search, which is open-source and written in Python. It offers a family of state-of-the-art algorithms that convert input high-dimensional vectors into compact binary codes, making them very efficient and scalable for NN search with very low space complexity

    Investigating people: a qualitative analysis of the search behaviours of open-source intelligence analysts

    Get PDF
    The Internet and the World Wide Web have become integral parts of the lives of many modern individuals, enabling almost instantaneous communication, sharing and broadcasting of thoughts, feelings and opinions. Much of this information is publicly facing, and as such, it can be utilised in a multitude of online investigations, ranging from employee vetting and credit checking to counter-terrorism and fraud prevention/detection. However, the search needs and behaviours of these investigators are not well documented in the literature. In order to address this gap, an in-depth qualitative study was carried out in cooperation with a leading investigation company. The research contribution is an initial identification of Open-Source Intelligence investigator search behaviours, the procedures and practices that they undertake, along with an overview of the difficulties and challenges that they encounter as part of their domain. This lays the foundation for future research in to the varied domain of Open-Source Intelligence gathering

    Managing Search Strategies for Open Innovation: The Role of Environmental Munificence as well as Internal and External R&D

    Get PDF
    Firms compete increasingly in an open innovation environment. Search strategies for external knowledge become therefore crucial for firm success. Existing research differentiates between the breadth (diversity) and depth (intensity) with which firms pursue external knowledge source. A consensus exists that resource constrains force firms to balance both dimensions. However, relatively little is known on how managers can selectively strengthen one of these dimensions. We argue conceptually that the breadth and depth of a search strategy depends upon the nature of a firm's absorptive capacity (i.e. whether they are built through internal or external R&D activities) and the munificence of its innovation environment. We test these hypotheses empirically for a large sample of more than 8,300 firms from 12 European countries. Our empirical results show that in-house R&D strengthens the depth of a firm's search strategy while external R&D activities (e.g. contract research) increase its breadth. Moreover, we find that scarce innovation environments favor deep search strategies while breadth is more prevalent in munificent environments. We develop targeted management recommendations based on these results. --Open innovation,absorptive capacity,search strategies

    Development of 2MASS Catalog Server Kit

    Full text link
    We develop a software kit called "2MASS Catalog Server Kit" to easily construct a high-performance database server for the 2MASS Point Source Catalog (includes 470,992,970 objects) and several all-sky catalogs. Users can perform fast radial search and rectangular search using provided stored functions in SQL similar to SDSS SkyServer. Our software kit utilizes open-source RDBMS, and therefore any astronomers and developers can install our kit on their personal computers for research, observation, etc. Out kit is tuned for optimal coordinate search performance. We implement an effective radial search using an orthogonal coordinate system, which does not need any techniques that depend on HTM or HEALpix. Applying the xyz coordinate system to the database index, we can easily implement a system of fast radial search for relatively small (less than several million rows) catalogs. To enable high-speed search of huge catalogs on RDBMS, we apply three additional techniques: table partitioning, composite expression index, and optimization in stored functions. As a result, we obtain satisfactory performance of radial search for the 2MASS catalog. Our system can also perform fast rectangular search. It is implemented using techniques similar to those applied for radial search. Our way of implementation enables a compact system and will give important hints for a low-cost development of other huge catalog databases.Comment: 2011 PASP accepte

    Comprehensive characterization of an open source document search engine

    Get PDF
    This work performs a thorough characterization and analysis of the open source Lucene search library. The article describes in detail the architecture, functionality, and micro-architectural behavior of the search engine, and investigates prominent online document search research issues. In particular, we study how intra-server index partitioning affects the response time and throughput, explore the potential use of low power servers for document search, and examine the sources of performance degradation ands the causes of tail latencies. Some of our main conclusions are the following: (a) intra-server index partitioning can reduce tail latencies but with diminishing benefits as incoming query traffic increases, (b) low power servers given enough partitioning can provide same average and tail response times as conventional high performance servers, (c) index search is a CPU-intensive cache-friendly application, and (d) C-states are the main culprits for performance degradation in document search.Web of Science162art. no. 1

    Facilitating Wiki/Repository Communication with Metadata

    Get PDF
    4th International Conference on Open RepositoriesThis presentation was part of the session : Fedora User Group PresentationsDate: 2009-05-20 01:30 PM – 03:00 PMThe National Science Digital Library (NSDL) Materials Digital Library Pathway (MatDL) has implemented an information infrastructure to disseminate government funded research results and to provide content as well as services to support the integration of research and education in materials. This paper describes how we are enabling two-way communication between a digital repository and open-source collaborative tools, such as wikis, to support users in materials research and education in the creation and re-use of compelling learning resources. A search results plug-in for MediaWiki has been developed to display relevant search results from the Fedora-based MatDL repository in the Soft Matter Wiki established and developed by MatDL and its partners. Wiki-to-repository information transfer has also been facilitated by mapping the metadata associated with resources originating in the wiki onto Dublin Core (DC) metadata elements and making the metadata and resources available in the repository.The Materials Digital Library Pathway (DUE-0532831) is supported by the National Science Foundation

    Scalable Semantic Matching of Queries to Ads in Sponsored Search Advertising

    Full text link
    Sponsored search represents a major source of revenue for web search engines. This popular advertising model brings a unique possibility for advertisers to target users' immediate intent communicated through a search query, usually by displaying their ads alongside organic search results for queries deemed relevant to their products or services. However, due to a large number of unique queries it is challenging for advertisers to identify all such relevant queries. For this reason search engines often provide a service of advanced matching, which automatically finds additional relevant queries for advertisers to bid on. We present a novel advanced matching approach based on the idea of semantic embeddings of queries and ads. The embeddings were learned using a large data set of user search sessions, consisting of search queries, clicked ads and search links, while utilizing contextual information such as dwell time and skipped ads. To address the large-scale nature of our problem, both in terms of data and vocabulary size, we propose a novel distributed algorithm for training of the embeddings. Finally, we present an approach for overcoming a cold-start problem associated with new ads and queries. We report results of editorial evaluation and online tests on actual search traffic. The results show that our approach significantly outperforms baselines in terms of relevance, coverage, and incremental revenue. Lastly, we open-source learned query embeddings to be used by researchers in computational advertising and related fields.Comment: 10 pages, 4 figures, 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Ital

    The White Rose Consortium ePrints Repository: creating a shared institutional repository for the Universities of Leeds, Sheffield and York

    Get PDF
    The White Rose Consortium ePrints Repository was created as part of the JISC funded SHERPA project . The Consortium is a partnership between the Universities of Leeds, Sheffield and York. The three universities share a single installation of the open source EPrints software (developed by Southampton University). The repository houses published research output from across the consortium – primarily peer-reviewed journal papers – and can be viewed at http://eprints.whiterose.ac.uk/. Currently, all the repository content is openly accessible and our access statistics suggest a good level of usage, with many users coming into the system through Google and other search engines
    corecore