95,985 research outputs found
The Effects of Index Storage on Ranked Information Retrieval
Information retrieval is the process of recalling and ordering all relevant documents based on a user\u27s search query. Examples of information retrieval systems are Google, Bing, and Yahoo search. In order to perform an effective search, these systems utilize an inverted index for mapping content, such as words, to the original document. It is widely believed there are two options for implementing an inverted index and these options are in memory or as a file. This investigation looks at implementing an inverted index as a table in a database as compared to the other two options. In addition, this investigation will look at the optimal combination of inverted index implementation to retrieval algorithms such as TD-IDF, Best Match 25, and a unigram model with Jelinek-Mercer smoothing. This is determined by designing and developing a system which will index and search three different collections of various data, size, and complexities. By doing this, it is found that utilizing an inverted index implemented in a database is a viable option for information retrieval. It is also noteworthy that Best Match 25 or a unigram language model consistently outperforms TD-IDF. In conclusion, if the collection cannot be indexed in memory, then utilizing a database implemented index is a sufficient second option
Efficient and Effective Query Auto-Completion
Query Auto-Completion (QAC) is an ubiquitous feature of modern textual search
systems, suggesting possible ways of completing the query being typed by the
user. Efficiency is crucial to make the system have a real-time responsiveness
when operating in the million-scale search space. Prior work has extensively
advocated the use of a trie data structure for fast prefix-search operations in
compact space. However, searching by prefix has little discovery power in that
only completions that are prefixed by the query are returned. This may impact
negatively the effectiveness of the QAC system, with a consequent monetary loss
for real applications like Web Search Engines and eCommerce. In this work we
describe the implementation that empowers a new QAC system at eBay, and discuss
its efficiency/effectiveness in relation to other approaches at the
state-of-the-art. The solution is based on the combination of an inverted index
with succinct data structures, a much less explored direction in the
literature. This system is replacing the previous implementation based on
Apache SOLR that was not always able to meet the required
service-level-agreement.Comment: Published in SIGIR 202
INVERTED INDEX COMPRESSION DENGAN METODE GAMMA CODE PADA INFORMATION RETRIEVAL SYSTEM
ABSTRAKSI: Pada information retrieval system, inverted index digunakan untuk mengevaluasi suatu query. Semakin banyak dokumen yang harus disimpan, maka semakin besar pula inverted index yang terbentuk. Dan semakin banyak pula query yang harus diproses pada pencarian dokumen-dokumen tersebut. Maka, dibutuhkan suatu cara optimisasi performansi untuk memenuhi kebutuhan dalam penyimpanan inverted index yang semakin besar dan pemrosesan query yang semakin banyak, salah satunya adalah kompresi inverted index. Kompresi inverted index diharapkan dapat mengurangi kebutuhan ruang penyimpanan inverted index dan meningkatkan penggunaan cache di memori. Salah satu metode kompresi inverted index adalah Gamma code, yang mengubah integer menjadi binary codeword. Data yang dikompresi berupa ID dokumen dan frekuensi term. Pada tugas akhir ini, dilakukan pengujian penerapan kompresi inverted index pada information retrieval system dengan koleksi dokumen yang berukuran kecil dan koleksi dokumen yang berukuran besar. Dari analisis hasil pengujian, diperoleh kesimpulan bahwa Gamma code dapat menghasilkan performansi yang baik dalam hal ukuran inverted index pada koleksi dokumen yang besar, karena term-termnya tersebar di banyak dokumen, sehingga pengkodean Gamma lebih pendek pada setiap posting. Juga menghasilkan performansi yang baik dalam hal ukuran inverted index pada koleksi dokumen yang besar, karena rata-rata rasio waktu pemrosesan query-nya lebih kecil dibandingkan pada koleksi dokumen yang kecil . Kata Kunci : kompresi, inverted index, Gamma code, integerABSTRACT: In information retrieval system, inverted index is used to evaluate query. More documents to be store can causes larger inverted index to be create and more queries that must be processed in search system. So, needed an optimization query, one of which is inverted index compression. Inverted index is expected to reduce storage space requirements and increase the usage of memory cache, thus avoiding the full access to the disk during query evaluation. One of the inverted index compression method is the Gamma code. Gamma code is one of the compression technique that turns an integer into a binary codeword. Compressed data is document ID and term frequency. The testing is an implementation of inverted index compression in information retrieval system with a small document collection and a large document collection. From the analysis of test results, we conclude that Gamma code can increase performance in size of inverted index and query processing time on a large document collection. In inverted index size, the terms of large document collection are distributed in many documents, so it result shorter encoding of Gamma code. In query processing, average query processing time ratio of large document collection is lower than small document collection.Keyword: compression, inverted index, gamma code, intege
CONTEXT-BASED AUTOSUGGEST ON GRAPH DATA
Autosuggest is an important feature in any search applications. Currently, most applications only suggest a single term based on how frequent that term appears in the indexed documents or how often it is searched upon. These approaches might not provide the most relevant suggestions because users often enter a series of related query terms to answer a question they have in mind. In this project, we implemented the Smart Solr Suggester plugin using a context-based approach that takes into account the relationships among search keywords. In particular, we used the keywords that the user has chosen so far in the search text box as the context to autosuggest their next incomplete keyword. This context-based approach uses the relationships between entities in the graph data that the user is searching on and therefore would provide more meaningful suggestions
The NASA Astrophysics Data System: Architecture
The powerful discovery capabilities available in the ADS bibliographic
services are possible thanks to the design of a flexible search and retrieval
system based on a relational database model. Bibliographic records are stored
as a corpus of structured documents containing fielded data and metadata, while
discipline-specific knowledge is segregated in a set of files independent of
the bibliographic data itself.
The creation and management of links to both internal and external resources
associated with each bibliography in the database is made possible by
representing them as a set of document properties and their attributes.
To improve global access to the ADS data holdings, a number of mirror sites
have been created by cloning the database contents and software on a variety of
hardware and software platforms.
The procedures used to create and manage the database and its mirrors have
been written as a set of scripts that can be run in either an interactive or
unsupervised fashion.
The ADS can be accessed at http://adswww.harvard.eduComment: 25 pages, 8 figures, 3 table
- …