Search CORE

95,985 research outputs found

The Effects of Index Storage on Ranked Information Retrieval

Author: Mantheiy James E., Jr.
Publication venue: The Research Repository @ WVU
Publication date: 01/12/2012
Field of study

Information retrieval is the process of recalling and ordering all relevant documents based on a user\u27s search query. Examples of information retrieval systems are Google, Bing, and Yahoo search. In order to perform an effective search, these systems utilize an inverted index for mapping content, such as words, to the original document. It is widely believed there are two options for implementing an inverted index and these options are in memory or as a file. This investigation looks at implementing an inverted index as a table in a database as compared to the other two options. In addition, this investigation will look at the optimal combination of inverted index implementation to retrieval algorithms such as TD-IDF, Best Match 25, and a unigram model with Jelinek-Mercer smoothing. This is determined by designing and developing a system which will index and search three different collections of various data, size, and complexities. By doing this, it is found that utilizing an inverted index implemented in a database is a viable option for information retrieval. It is also noteworthy that Best Match 25 or a unigram language model consistently outperforms TD-IDF. In conclusion, if the collection cannot be indexed in memory, then utilizing a database implemented index is a sufficient second option

The Research Repository @ WVU (West Virginia University)

Efficient and Effective Query Auto-Completion

Author: Fano R. M.
Krishnan U.
Martinez-Prieto M. A.
Pibiri G. E.
Pibiri G. E.
Plaisance J.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 10/06/2020
Field of study

Query Auto-Completion (QAC) is an ubiquitous feature of modern textual search systems, suggesting possible ways of completing the query being typed by the user. Efficiency is crucial to make the system have a real-time responsiveness when operating in the million-scale search space. Prior work has extensively advocated the use of a trie data structure for fast prefix-search operations in compact space. However, searching by prefix has little discovery power in that only completions that are prefixed by the query are returned. This may impact negatively the effectiveness of the QAC system, with a consequent monetary loss for real applications like Web Search Engines and eCommerce. In this work we describe the implementation that empowers a new QAC system at eBay, and discuss its efficiency/effectiveness in relation to other approaches at the state-of-the-art. The solution is based on the combination of an inverted index with succinct data structures, a much less explored direction in the literature. This system is replacing the previous implementation based on Apache SOLR that was not always able to meet the required service-level-agreement.Comment: Published in SIGIR 202

arXiv.org e-Print Archive

Crossref

INVERTED INDEX COMPRESSION DENGAN METODE GAMMA CODE PADA INFORMATION RETRIEVAL SYSTEM

Author: AULIA PUTRI RAHMADANIA
Publication venue: Universitas Telkom
Publication date: 01/01/2010
Field of study

ABSTRAKSI: Pada information retrieval system, inverted index digunakan untuk mengevaluasi suatu query. Semakin banyak dokumen yang harus disimpan, maka semakin besar pula inverted index yang terbentuk. Dan semakin banyak pula query yang harus diproses pada pencarian dokumen-dokumen tersebut. Maka, dibutuhkan suatu cara optimisasi performansi untuk memenuhi kebutuhan dalam penyimpanan inverted index yang semakin besar dan pemrosesan query yang semakin banyak, salah satunya adalah kompresi inverted index. Kompresi inverted index diharapkan dapat mengurangi kebutuhan ruang penyimpanan inverted index dan meningkatkan penggunaan cache di memori. Salah satu metode kompresi inverted index adalah Gamma code, yang mengubah integer menjadi binary codeword. Data yang dikompresi berupa ID dokumen dan frekuensi term. Pada tugas akhir ini, dilakukan pengujian penerapan kompresi inverted index pada information retrieval system dengan koleksi dokumen yang berukuran kecil dan koleksi dokumen yang berukuran besar. Dari analisis hasil pengujian, diperoleh kesimpulan bahwa Gamma code dapat menghasilkan performansi yang baik dalam hal ukuran inverted index pada koleksi dokumen yang besar, karena term-termnya tersebar di banyak dokumen, sehingga pengkodean Gamma lebih pendek pada setiap posting. Juga menghasilkan performansi yang baik dalam hal ukuran inverted index pada koleksi dokumen yang besar, karena rata-rata rasio waktu pemrosesan query-nya lebih kecil dibandingkan pada koleksi dokumen yang kecil . Kata Kunci : kompresi, inverted index, Gamma code, integerABSTRACT: In information retrieval system, inverted index is used to evaluate query. More documents to be store can causes larger inverted index to be create and more queries that must be processed in search system. So, needed an optimization query, one of which is inverted index compression. Inverted index is expected to reduce storage space requirements and increase the usage of memory cache, thus avoiding the full access to the disk during query evaluation. One of the inverted index compression method is the Gamma code. Gamma code is one of the compression technique that turns an integer into a binary codeword. Compressed data is document ID and term frequency. The testing is an implementation of inverted index compression in information retrieval system with a small document collection and a large document collection. From the analysis of test results, we conclude that Gamma code can increase performance in size of inverted index and query processing time on a large document collection. In inverted index size, the terms of large document collection are distributed in many documents, so it result shorter encoding of Gamma code. In query processing, average query processing time ratio of large document collection is lower than small document collection.Keyword: compression, inverted index, gamma code, intege

Open Library

ImageTerrier: an extensible platform for scalable high-performance image retrieval

Author: Dupplaw David
Hare Jonathon
Lewis Paul H.
Samangooei Sina
Publication venue
Publication date: 05/06/2012
Field of study

Southampton (e-Prints Soton)

CONTEXT-BASED AUTOSUGGEST ON GRAPH DATA

Author: Nguyen Hai
Publication venue: SJSU ScholarWorks
Publication date: 21/05/2015
Field of study

Autosuggest is an important feature in any search applications. Currently, most applications only suggest a single term based on how frequent that term appears in the indexed documents or how often it is searched upon. These approaches might not provide the most relevant suggestions because users often enter a series of related query terms to answer a question they have in mind. In this project, we implemented the Smart Solr Suggester plugin using a context-based approach that takes into account the relationships among search keywords. In particular, we used the keywords that the user has chosen so far in the search text box as the context to autosuggest their next incomplete keyword. This context-based approach uses the relationships between entities in the graph data that the user is searching on and therefore would provide more meaningful suggestions

SJSU ScholarWorks

The NASA Astrophysics Data System: Architecture

Author: Accomazzi A.
Eichhorn G.
Grant C. S.
Kurtz M. J.
Murray S. S.
Publication venue: 'EDP Sciences'
Publication date: 04/02/2000
Field of study

The powerful discovery capabilities available in the ADS bibliographic services are possible thanks to the design of a flexible search and retrieval system based on a relational database model. Bibliographic records are stored as a corpus of structured documents containing fielded data and metadata, while discipline-specific knowledge is segregated in a set of files independent of the bibliographic data itself. The creation and management of links to both internal and external resources associated with each bibliography in the database is made possible by representing them as a set of document properties and their attributes. To improve global access to the ADS data holdings, a number of mirror sites have been created by cloning the database contents and software on a variety of hardware and software platforms. The procedures used to create and manage the database and its mirrors have been written as a set of scripts that can be run in either an interactive or unsupervised fashion. The ADS can be accessed at http://adswww.harvard.eduComment: 25 pages, 8 figures, 3 table

arXiv.org e-Print Archive

Crossref

EDP Sciences OAI-PMH repository (1.2.0)