95,985 research outputs found

    The Effects of Index Storage on Ranked Information Retrieval

    Get PDF
    Information retrieval is the process of recalling and ordering all relevant documents based on a user\u27s search query. Examples of information retrieval systems are Google, Bing, and Yahoo search. In order to perform an effective search, these systems utilize an inverted index for mapping content, such as words, to the original document. It is widely believed there are two options for implementing an inverted index and these options are in memory or as a file. This investigation looks at implementing an inverted index as a table in a database as compared to the other two options. In addition, this investigation will look at the optimal combination of inverted index implementation to retrieval algorithms such as TD-IDF, Best Match 25, and a unigram model with Jelinek-Mercer smoothing. This is determined by designing and developing a system which will index and search three different collections of various data, size, and complexities. By doing this, it is found that utilizing an inverted index implemented in a database is a viable option for information retrieval. It is also noteworthy that Best Match 25 or a unigram language model consistently outperforms TD-IDF. In conclusion, if the collection cannot be indexed in memory, then utilizing a database implemented index is a sufficient second option

    Efficient and Effective Query Auto-Completion

    Full text link
    Query Auto-Completion (QAC) is an ubiquitous feature of modern textual search systems, suggesting possible ways of completing the query being typed by the user. Efficiency is crucial to make the system have a real-time responsiveness when operating in the million-scale search space. Prior work has extensively advocated the use of a trie data structure for fast prefix-search operations in compact space. However, searching by prefix has little discovery power in that only completions that are prefixed by the query are returned. This may impact negatively the effectiveness of the QAC system, with a consequent monetary loss for real applications like Web Search Engines and eCommerce. In this work we describe the implementation that empowers a new QAC system at eBay, and discuss its efficiency/effectiveness in relation to other approaches at the state-of-the-art. The solution is based on the combination of an inverted index with succinct data structures, a much less explored direction in the literature. This system is replacing the previous implementation based on Apache SOLR that was not always able to meet the required service-level-agreement.Comment: Published in SIGIR 202

    INVERTED INDEX COMPRESSION DENGAN METODE GAMMA CODE PADA INFORMATION RETRIEVAL SYSTEM

    Get PDF
    ABSTRAKSI: Pada information retrieval system, inverted index digunakan untuk mengevaluasi suatu query. Semakin banyak dokumen yang harus disimpan, maka semakin besar pula inverted index yang terbentuk. Dan semakin banyak pula query yang harus diproses pada pencarian dokumen-dokumen tersebut. Maka, dibutuhkan suatu cara optimisasi performansi untuk memenuhi kebutuhan dalam penyimpanan inverted index yang semakin besar dan pemrosesan query yang semakin banyak, salah satunya adalah kompresi inverted index. Kompresi inverted index diharapkan dapat mengurangi kebutuhan ruang penyimpanan inverted index dan meningkatkan penggunaan cache di memori. Salah satu metode kompresi inverted index adalah Gamma code, yang mengubah integer menjadi binary codeword. Data yang dikompresi berupa ID dokumen dan frekuensi term. Pada tugas akhir ini, dilakukan pengujian penerapan kompresi inverted index pada information retrieval system dengan koleksi dokumen yang berukuran kecil dan koleksi dokumen yang berukuran besar. Dari analisis hasil pengujian, diperoleh kesimpulan bahwa Gamma code dapat menghasilkan performansi yang baik dalam hal ukuran inverted index pada koleksi dokumen yang besar, karena term-termnya tersebar di banyak dokumen, sehingga pengkodean Gamma lebih pendek pada setiap posting. Juga menghasilkan performansi yang baik dalam hal ukuran inverted index pada koleksi dokumen yang besar, karena rata-rata rasio waktu pemrosesan query-nya lebih kecil dibandingkan pada koleksi dokumen yang kecil . Kata Kunci : kompresi, inverted index, Gamma code, integerABSTRACT: In information retrieval system, inverted index is used to evaluate query. More documents to be store can causes larger inverted index to be create and more queries that must be processed in search system. So, needed an optimization query, one of which is inverted index compression. Inverted index is expected to reduce storage space requirements and increase the usage of memory cache, thus avoiding the full access to the disk during query evaluation. One of the inverted index compression method is the Gamma code. Gamma code is one of the compression technique that turns an integer into a binary codeword. Compressed data is document ID and term frequency. The testing is an implementation of inverted index compression in information retrieval system with a small document collection and a large document collection. From the analysis of test results, we conclude that Gamma code can increase performance in size of inverted index and query processing time on a large document collection. In inverted index size, the terms of large document collection are distributed in many documents, so it result shorter encoding of Gamma code. In query processing, average query processing time ratio of large document collection is lower than small document collection.Keyword: compression, inverted index, gamma code, intege

    CONTEXT-BASED AUTOSUGGEST ON GRAPH DATA

    Get PDF
    Autosuggest is an important feature in any search applications. Currently, most applications only suggest a single term based on how frequent that term appears in the indexed documents or how often it is searched upon. These approaches might not provide the most relevant suggestions because users often enter a series of related query terms to answer a question they have in mind. In this project, we implemented the Smart Solr Suggester plugin using a context-based approach that takes into account the relationships among search keywords. In particular, we used the keywords that the user has chosen so far in the search text box as the context to autosuggest their next incomplete keyword. This context-based approach uses the relationships between entities in the graph data that the user is searching on and therefore would provide more meaningful suggestions

    The NASA Astrophysics Data System: Architecture

    Full text link
    The powerful discovery capabilities available in the ADS bibliographic services are possible thanks to the design of a flexible search and retrieval system based on a relational database model. Bibliographic records are stored as a corpus of structured documents containing fielded data and metadata, while discipline-specific knowledge is segregated in a set of files independent of the bibliographic data itself. The creation and management of links to both internal and external resources associated with each bibliography in the database is made possible by representing them as a set of document properties and their attributes. To improve global access to the ADS data holdings, a number of mirror sites have been created by cloning the database contents and software on a variety of hardware and software platforms. The procedures used to create and manage the database and its mirrors have been written as a set of scripts that can be run in either an interactive or unsupervised fashion. The ADS can be accessed at http://adswww.harvard.eduComment: 25 pages, 8 figures, 3 table
    corecore