346 research outputs found

    Inverted index compression based on term and document identifier reassignment

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2008.Thesis (Master's) -- Bilkent University, 2008.Includes bibliographical references leaves 43-46.Compression of inverted indexes received great attention in recent years. An inverted index consists of lists of document identifiers, also referred as posting lists, for each term. Compressing an inverted index reduces the size of the index, which also improves the query performance due to the reduction on disk access times. In recent studies, it is shown that reassigning document identifiers has great effect in compression of an inverted index. In this work, we propose a novel technique that reassigns both term and document identifiers of an inverted index by transforming the matrix representation of the index into a block-diagonal form, which improves the compression ratio dramatically. We adapted row-net hypergraph-partitioning model for the transformation into block-diagonal form, which improves the compression ratio by as much as 50%. To the best of our knowledge, this method performs more effectively than previous inverted index compression techniques.Baykan, İzzet ÇağrıM.S

    Iterative preprocessing of event logs

    Get PDF

    Incremental cluster-based retrieval using compressed cluster-skipping inverted files

    Get PDF
    We propose a unique cluster-based retrieval (CBR) strategy using a new cluster-skipping inverted file for improving query processing efficiency. The new inverted file incorporates cluster membership and centroid information along with the usual document information into a single structure. In our incremental-CBR strategy, during query evaluation, both best(-matching) clusters and the best(-matching) documents of such clusters are computed together with a single posting-list access per query term. As we switch from term to term, the best clusters are recomputed and can dynamically change. During query-document matching, only relevant portions of the posting lists corresponding to the best clusters are considered and the rest are skipped. The proposed approach is essentially tailored for environments where inverted files are compressed, and provides substantial efficiency improvement while yielding comparable, or sometimes better, effectiveness figures. Our experiments with various collections show that the incremental-CBR strategy using a compressed cluster-skipping inverted file significantly improves CPU time efficiency, regardless of query length. The new compressed inverted file imposes an acceptable storage overhead in comparison to a typical inverted file. We also show that our approach scales well with the collection size. © 2008 ACM

    Bridging Dense and Sparse Maximum Inner Product Search

    Full text link
    Maximum inner product search (MIPS) over dense and sparse vectors have progressed independently in a bifurcated literature for decades; the latter is better known as top-kk retrieval in Information Retrieval. This duality exists because sparse and dense vectors serve different end goals. That is despite the fact that they are manifestations of the same mathematical problem. In this work, we ask if algorithms for dense vectors could be applied effectively to sparse vectors, particularly those that violate the assumptions underlying top-kk retrieval methods. We study IVF-based retrieval where vectors are partitioned into clusters and only a fraction of clusters are searched during retrieval. We conduct a comprehensive analysis of dimensionality reduction for sparse vectors, and examine standard and spherical KMeans for partitioning. Our experiments demonstrate that IVF serves as an efficient solution for sparse MIPS. As byproducts, we identify two research opportunities and demonstrate their potential. First, we cast the IVF paradigm as a dynamic pruning technique and turn that insight into a novel organization of the inverted index for approximate MIPS for general sparse vectors. Second, we offer a unified regime for MIPS over vectors that have dense and sparse subspaces, and show its robustness to query distributions

    Improving the efficiency of search engines : strategies for focused crawling, searching, and index pruning

    Get PDF
    Ankara : The Department of Computer Engineering and the Instıtute of Engineering and Science of Bilkent University, 2009.Thesis (Ph. D.) -- Bilkent University, 2009.Includes bibliographical references leaves 157-169.Search engines are the primary means of retrieval for text data that is abundantly available on the Web. A standard search engine should carry out three fundamental tasks, namely; crawling the Web, indexing the crawled content, and finally processing the queries using the index. Devising efficient methods for these tasks is an important research topic. In this thesis, we introduce efficient strategies related to all three tasks involved in a search engine. Most of the proposed strategies are essentially applicable when a grouping of documents in its broadest sense (i.e., in terms of automatically obtained classes/clusters, or manually edited categories) is readily available or can be constructed in a feasible manner. Additionally, we also introduce static index pruning strategies that are based on the query views. For the crawling task, we propose a rule-based focused crawling strategy that exploits interclass rules among the document classes in a topic taxonomy. These rules capture the probability of having hyperlinks between two classes. The rulebased crawler can tunnel toward the on-topic pages by following a path of off-topic pages, and thus yields higher harvest rate for crawling on-topic pages. In the context of indexing and query processing tasks, we concentrate on conducting efficient search, again, using document groups; i.e., clusters or categories. In typical cluster-based retrieval (CBR), first, clusters that are most similar to a given free-text query are determined, and then documents from these clusters are selected to form the final ranked output. For efficient CBR, we first identify and evaluate some alternative query processing strategies. Next, we introduce a new index organization, so-called cluster-skipping inverted index structure (CS-IIS). It is shown that typical-CBR with CS-IIS outperforms previous CBR strategies (with an ordinary index) for a number of datasets and under varying search parameters. In this thesis, an enhanced version of CS-IIS is further proposed, in which all information to compute query-cluster similarities during query evaluation is stored. We introduce an incremental-CBR strategy that operates on top of this latter index structure, and demonstrate its search efficiency for different scenarios. Finally, we exploit query views that are obtained from the search engine query logs to tailor more effective static pruning techniques. This is also related to the indexing task involved in a search engine. In particular, query view approach is incorporated into a set of existing pruning strategies, as well as some new variants proposed by us. We show that query view based strategies significantly outperform the existing approaches in terms of the query output quality, for both disjunctive and conjunctive evaluation of queries.Altıngövde, İsmail SengörPh.D

    Essays on Building Growth From Ideas

    Get PDF
    In my dissertation, I study the interplay between innovation and economic growth with an emphasis on the misallocation of ideas and talent. The first chapter focuses on the allocation of the innovations themselves, considering the optimal allocation of ideas across firms, and how it can be achieved via a market for ideas. This is done by first creating an operational measure of technological propinquity between firms and inventions. Using this empirical measure, new stylized facts are documented which suggest that ideas may indeed be initially misallocated across firms, but the market for patents can alleviate the initial misallocation. A quantitative model is built to perform some thought experiments which suggest that the economic growth rate can be increased by reducing the frictions in this market. The second chapter investigates why some organizations and societies are more prolific than others in coming up with radical, path-breaking innovations, and marshals evidence showing that the cause might be the openness to disruption in their culture. In order to investigate this hypothesis, it is posited that firms and societies that are more open to disruption would allow the young to rise faster in the hierarchy. This theory allows one to use the age of the top management in organizations as a proxy for the intangible concept. It is documented that many measures of radical innovations are positively correlated with openness to disruption as captured by manager age. The third chapter analyzes the link between the misallocation of talent and innovation, asking the question whether the talented people in the society are given the chance to create new innovations that enhance social welfare. This is investigated empirically using a novel approach that utilizes the power of surnames in capturing socioeconomic status persistence across time. The results show that the social allocation favors the wealthy over the talented in determining who become inventors. The rest of the chapter develops a quantitative model of allocation of talent that can replicate the observed correlation patterns, and shows that progressive bequest taxation can reduce this misallocation

    Four essays in empirical economics

    Get PDF
    This thesis studies four topics in empirical economics as summarized below. Chapter 1 documents the roles of heterogeneity, sorting, and complementarity in a framework where workers, managers, and firms interact to shape productivity. I show that the source of heterogeneity in the form of manager ability is an important driver of differences in firm productivity. I empirically identify complementarities between workers, managers, and firms using my estimation methodology. Counterfactual results show that reallocating workers by applying a positive assortative sorting rule can increase police department productivity by 10%. Chapter 2 documents that growth of Airbnb is likely to affect the local housing rental market by reducing the supply of properties. I combine data from Airbnb and Zoopla and examine how the price of individual houses evolves over time, as Airbnb penetrates the market in the area of Greater London. Leveraging the fact that properties with more than three bedrooms are less exposed to Airbnb, I use a difference-in-differences strategy by year and house type. I find that a 10-percent increase in the number of Airbnb properties in a ward increases real rents by 0.1 percent. Chapter 3: Religious groups sometimes resist modern welfare-enhancing interventions, adversely affecting the group's human capital levels. In this context, we study whether the two largest religious groups in India (Hindus and Muslims) resisted western education because they shared religious identity with the rulers deposed by the British colonisers. We find that Muslim literacy in an Indian district under the British is lower where the deposed ruler was a Muslim, while Hindu literacy is lower where the deposed ruler was a Hindu. Chapter 4: We digitize the financial disclosure of elite bureaucrats from India and combine this novel data with web-scraped career histories to study the private wealth accumulation of public servants. Employing a difference in difference event study approach, we find that the annual growth rate is 10% higher for the value of assets and 4.4% higher for the number after bureaucrats being transferred to an important post with the power to make influential policies. We document that the results are consistent with a rent-seeking explanation
    corecore