Search CORE

346 research outputs found

Inverted index compression based on term and document identifier reassignment

Author: Baykan İzzet Çağrı
Publication venue: Bilkent University
Publication date: 01/01/2008
Field of study

Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2008.Thesis (Master's) -- Bilkent University, 2008.Includes bibliographical references leaves 43-46.Compression of inverted indexes received great attention in recent years. An inverted index consists of lists of document identifiers, also referred as posting lists, for each term. Compressing an inverted index reduces the size of the index, which also improves the query performance due to the reduction on disk access times. In recent studies, it is shown that reassigning document identifiers has great effect in compression of an inverted index. In this work, we propose a novel technique that reassigns both term and document identifiers of an inverted index by transforming the matrix representation of the index into a block-diagonal form, which improves the compression ratio dramatically. We adapted row-net hypergraph-partitioning model for the transformation into block-diagonal form, which improves the compression ratio by as much as 50%. To the best of our knowledge, this method performs more effectively than previous inverted index compression techniques.Baykan, İzzet ÇağrıM.S

Bilkent University Institutional Repository

Bisimulation partitioning and partition maintenance:on very large directed acyclic graphs

Author: Hellings J.A.J.
Publication venue
Publication date: 31/08/2011
Field of study

Pure OAI Repository

Iterative preprocessing of event logs

Author: van Heumen P.J.
Publication venue
Publication date: 01/01/2011
Field of study

Repository TU/e

Pure OAI Repository

Incremental cluster-based retrieval using compressed cluster-skipping inverted files

Author: Altingovde I.S.
Can F.
Demir E.
Ulusoy Ö.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2008
Field of study

We propose a unique cluster-based retrieval (CBR) strategy using a new cluster-skipping inverted file for improving query processing efficiency. The new inverted file incorporates cluster membership and centroid information along with the usual document information into a single structure. In our incremental-CBR strategy, during query evaluation, both best(-matching) clusters and the best(-matching) documents of such clusters are computed together with a single posting-list access per query term. As we switch from term to term, the best clusters are recomputed and can dynamically change. During query-document matching, only relevant portions of the posting lists corresponding to the best clusters are considered and the rest are skipped. The proposed approach is essentially tailored for environments where inverted files are compressed, and provides substantial efficiency improvement while yielding comparable, or sometimes better, effectiveness figures. Our experiments with various collections show that the incremental-CBR strategy using a compressed cluster-skipping inverted file significantly improves CPU time efficiency, regardless of query length. The new compressed inverted file imposes an acceptable storage overhead in comparison to a typical inverted file. We also show that our approach scales well with the collection size. © 2008 ACM

Bilkent University Institutional Repository

Bridging Dense and Sparse Maximum Inner Product Search

Author: Bruch Sebastian
Ingber Amir
Liberty Edo
Nardini Franco Maria
Publication venue
Publication date: 16/09/2023
Field of study

Maximum inner product search (MIPS) over dense and sparse vectors have progressed independently in a bifurcated literature for decades; the latter is better known as top-

k

retrieval in Information Retrieval. This duality exists because sparse and dense vectors serve different end goals. That is despite the fact that they are manifestations of the same mathematical problem. In this work, we ask if algorithms for dense vectors could be applied effectively to sparse vectors, particularly those that violate the assumptions underlying top-

k

retrieval methods. We study IVF-based retrieval where vectors are partitioned into clusters and only a fraction of clusters are searched during retrieval. We conduct a comprehensive analysis of dimensionality reduction for sparse vectors, and examine standard and spherical KMeans for partitioning. Our experiments demonstrate that IVF serves as an efficient solution for sparse MIPS. As byproducts, we identify two research opportunities and demonstrate their potential. First, we cast the IVF paradigm as a dynamic pruning technique and turn that insight into a novel organization of the inverted index for approximate MIPS for general sparse vectors. Second, we offer a unified regime for MIPS over vectors that have dense and sparse subspaces, and show its robustness to query distributions

arXiv.org e-Print Archive

Improving the efficiency of search engines : strategies for focused crawling, searching, and index pruning

Author: Altıngövde İsmail Sengör
Publication venue: Bilkent University
Publication date: 01/01/2009
Field of study

Ankara : The Department of Computer Engineering and the Instıtute of Engineering and Science of Bilkent University, 2009.Thesis (Ph. D.) -- Bilkent University, 2009.Includes bibliographical references leaves 157-169.Search engines are the primary means of retrieval for text data that is abundantly available on the Web. A standard search engine should carry out three fundamental tasks, namely; crawling the Web, indexing the crawled content, and finally processing the queries using the index. Devising efficient methods for these tasks is an important research topic. In this thesis, we introduce efficient strategies related to all three tasks involved in a search engine. Most of the proposed strategies are essentially applicable when a grouping of documents in its broadest sense (i.e., in terms of automatically obtained classes/clusters, or manually edited categories) is readily available or can be constructed in a feasible manner. Additionally, we also introduce static index pruning strategies that are based on the query views. For the crawling task, we propose a rule-based focused crawling strategy that exploits interclass rules among the document classes in a topic taxonomy. These rules capture the probability of having hyperlinks between two classes. The rulebased crawler can tunnel toward the on-topic pages by following a path of off-topic pages, and thus yields higher harvest rate for crawling on-topic pages. In the context of indexing and query processing tasks, we concentrate on conducting efficient search, again, using document groups; i.e., clusters or categories. In typical cluster-based retrieval (CBR), first, clusters that are most similar to a given free-text query are determined, and then documents from these clusters are selected to form the final ranked output. For efficient CBR, we first identify and evaluate some alternative query processing strategies. Next, we introduce a new index organization, so-called cluster-skipping inverted index structure (CS-IIS). It is shown that typical-CBR with CS-IIS outperforms previous CBR strategies (with an ordinary index) for a number of datasets and under varying search parameters. In this thesis, an enhanced version of CS-IIS is further proposed, in which all information to compute query-cluster similarities during query evaluation is stored. We introduce an incremental-CBR strategy that operates on top of this latter index structure, and demonstrate its search efficiency for different scenarios. Finally, we exploit query views that are obtained from the search engine query logs to tailor more effective static pruning techniques. This is also related to the indexing task involved in a search engine. In particular, query view approach is incorporated into a set of existing pruning strategies, as well as some new variants proposed by us. We show that query view based strategies significantly outperform the existing approaches in terms of the query output quality, for both disjunctive and conjunctive evaluation of queries.Altıngövde, İsmail SengörPh.D

Bilkent University Institutional Repository

Essays on Building Growth From Ideas

Author: Celik Murat Alp
Publication venue: ScholarlyCommons
Publication date: 01/01/2016
Field of study

In my dissertation, I study the interplay between innovation and economic growth with an emphasis on the misallocation of ideas and talent. The first chapter focuses on the allocation of the innovations themselves, considering the optimal allocation of ideas across firms, and how it can be achieved via a market for ideas. This is done by first creating an operational measure of technological propinquity between firms and inventions. Using this empirical measure, new stylized facts are documented which suggest that ideas may indeed be initially misallocated across firms, but the market for patents can alleviate the initial misallocation. A quantitative model is built to perform some thought experiments which suggest that the economic growth rate can be increased by reducing the frictions in this market. The second chapter investigates why some organizations and societies are more prolific than others in coming up with radical, path-breaking innovations, and marshals evidence showing that the cause might be the openness to disruption in their culture. In order to investigate this hypothesis, it is posited that firms and societies that are more open to disruption would allow the young to rise faster in the hierarchy. This theory allows one to use the age of the top management in organizations as a proxy for the intangible concept. It is documented that many measures of radical innovations are positively correlated with openness to disruption as captured by manager age. The third chapter analyzes the link between the misallocation of talent and innovation, asking the question whether the talented people in the society are given the chance to create new innovations that enhance social welfare. This is investigated empirically using a novel approach that utilizes the power of surnames in capturing socioeconomic status persistence across time. The results show that the social allocation favors the wealthy over the talented in determining who become inventors. The rest of the chapter develops a quantitative model of allocation of talent that can replicate the observed correlation patterns, and shows that progressive bequest taxation can reduce this misallocation

ScholarlyCommons@Penn

Four essays in empirical economics

Author: Chaudhary Amit
Publication venue
Publication date
Field of study

This thesis studies four topics in empirical economics as summarized below. Chapter 1 documents the roles of heterogeneity, sorting, and complementarity in a framework where workers, managers, and firms interact to shape productivity. I show that the source of heterogeneity in the form of manager ability is an important driver of differences in firm productivity. I empirically identify complementarities between workers, managers, and firms using my estimation methodology. Counterfactual results show that reallocating workers by applying a positive assortative sorting rule can increase police department productivity by 10%. Chapter 2 documents that growth of Airbnb is likely to affect the local housing rental market by reducing the supply of properties. I combine data from Airbnb and Zoopla and examine how the price of individual houses evolves over time, as Airbnb penetrates the market in the area of Greater London. Leveraging the fact that properties with more than three bedrooms are less exposed to Airbnb, I use a difference-in-differences strategy by year and house type. I find that a 10-percent increase in the number of Airbnb properties in a ward increases real rents by 0.1 percent. Chapter 3: Religious groups sometimes resist modern welfare-enhancing interventions, adversely affecting the group's human capital levels. In this context, we study whether the two largest religious groups in India (Hindus and Muslims) resisted western education because they shared religious identity with the rulers deposed by the British colonisers. We find that Muslim literacy in an Indian district under the British is lower where the deposed ruler was a Muslim, while Hindu literacy is lower where the deposed ruler was a Hindu. Chapter 4: We digitize the financial disclosure of elite bureaucrats from India and combine this novel data with web-scraped career histories to study the private wealth accumulation of public servants. Employing a difference in difference event study approach, we find that the annual growth rate is 10% higher for the value of assets and 4.4% higher for the number after bureaucrats being transferred to an important post with the power to make influential policies. We document that the results are consistent with a rent-seeking explanation

Warwick Research Archives Portal Repository

Recommended from our members

Automated analysis and validation of open chemical data

Author: Day Nicholas E
Publication venue: University of Cambridge
Publication date: 01/01/2009
Field of study

Methods to automatically extract Open Data from the chemical literature, validate it, and use it to validate theory are examined. Chemical identifiers which assist the automatic location of chemical structures using commercial Web search engines are investigated. The IUPAC International Chemical Idenfitifer (InChI) gives almost 100% recall and precision, though is shown to be too long for present search engines. A combination of InChI and InChIKey, a shorter, fixed-length hash of the InChI string, is concluded to be the best current method of identifying structures. The proportion of published, Open Crystallographic Information Files (CIFs) that are valid with respect to the specification is shown to be improving, and is around 99% in 2007. The error rate in the conversion of valid CIFs to Chemical Markup Language (CML) is less than 0.2%. The machine generation of connection tables from CIFs requires many heuristics, and in some cases it is impossible to deduce the exact connection table. CrystalEye, a fully-automated system for the reformulation of the fragmented crystallographic Web into a structured XML-based repository is described. Published, Open CIFs can be located and aggregated programmatically with almost 100% recall. It is shown that, by converting CIF data to CML, software can be created to use the latest Web standards and technologies to enhance the ability of Web users to browse, find, keep updated, download and reuse the latest published crystallography. A workflow for the high-throughput calculation of solid-state geometry using a semi-empirical method is described. A wide-range of organic and inorganic systems provided by CrystalEye are used to test both the data and the method. Several errors in the method are discovered, many of which can be attributed to the parameterization process. An Open NMR experiment to perform high-throughput prediction of 13C chemical shifts using a GIAO protocol is described. The data and analysis were provided on publicly-available webpages to enable crowdsourcing, which assisted in discovering an error rate of 6.1% in the starting data. The protocol was refined during the work and shown to have an average unsigned error of 2.24ppm for 13C nuclei of small, rigid molecules; comparable to the errors observed elsewhere for general structures using HOSE and Neural Network methods

Apollo (Cambridge)

OpenGrey Repository