346 research outputs found
Inverted index compression based on term and document identifier reassignment
Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2008.Thesis (Master's) -- Bilkent University, 2008.Includes bibliographical references leaves 43-46.Compression of inverted indexes received great attention in recent years. An
inverted index consists of lists of document identifiers, also referred as posting
lists, for each term. Compressing an inverted index reduces the size of the index,
which also improves the query performance due to the reduction on disk access
times.
In recent studies, it is shown that reassigning document identifiers has great
effect in compression of an inverted index. In this work, we propose a novel
technique that reassigns both term and document identifiers of an inverted index
by transforming the matrix representation of the index into a block-diagonal
form, which improves the compression ratio dramatically. We adapted row-net
hypergraph-partitioning model for the transformation into block-diagonal form,
which improves the compression ratio by as much as 50%. To the best of our
knowledge, this method performs more effectively than previous inverted index
compression techniques.Baykan, İzzet ÇağrıM.S
Incremental cluster-based retrieval using compressed cluster-skipping inverted files
We propose a unique cluster-based retrieval (CBR) strategy using a new cluster-skipping inverted file for improving query processing efficiency. The new inverted file incorporates cluster membership and centroid information along with the usual document information into a single structure. In our incremental-CBR strategy, during query evaluation, both best(-matching) clusters and the best(-matching) documents of such clusters are computed together with a single posting-list access per query term. As we switch from term to term, the best clusters are recomputed and can dynamically change. During query-document matching, only relevant portions of the posting lists corresponding to the best clusters are considered and the rest are skipped. The proposed approach is essentially tailored for environments where inverted files are compressed, and provides substantial efficiency improvement while yielding comparable, or sometimes better, effectiveness figures. Our experiments with various collections show that the incremental-CBR strategy using a compressed cluster-skipping inverted file significantly improves CPU time efficiency, regardless of query length. The new compressed inverted file imposes an acceptable storage overhead in comparison to a typical inverted file. We also show that our approach scales well with the collection size. © 2008 ACM
Bridging Dense and Sparse Maximum Inner Product Search
Maximum inner product search (MIPS) over dense and sparse vectors have
progressed independently in a bifurcated literature for decades; the latter is
better known as top- retrieval in Information Retrieval. This duality exists
because sparse and dense vectors serve different end goals. That is despite the
fact that they are manifestations of the same mathematical problem. In this
work, we ask if algorithms for dense vectors could be applied effectively to
sparse vectors, particularly those that violate the assumptions underlying
top- retrieval methods. We study IVF-based retrieval where vectors are
partitioned into clusters and only a fraction of clusters are searched during
retrieval. We conduct a comprehensive analysis of dimensionality reduction for
sparse vectors, and examine standard and spherical KMeans for partitioning. Our
experiments demonstrate that IVF serves as an efficient solution for sparse
MIPS. As byproducts, we identify two research opportunities and demonstrate
their potential. First, we cast the IVF paradigm as a dynamic pruning technique
and turn that insight into a novel organization of the inverted index for
approximate MIPS for general sparse vectors. Second, we offer a unified regime
for MIPS over vectors that have dense and sparse subspaces, and show its
robustness to query distributions
Improving the efficiency of search engines : strategies for focused crawling, searching, and index pruning
Ankara : The Department of Computer Engineering and the Instıtute of Engineering and Science of Bilkent University, 2009.Thesis (Ph. D.) -- Bilkent University, 2009.Includes bibliographical references leaves 157-169.Search engines are the primary means of retrieval for text data that is abundantly
available on the Web. A standard search engine should carry out three
fundamental tasks, namely; crawling the Web, indexing the crawled content, and
finally processing the queries using the index. Devising efficient methods for these
tasks is an important research topic. In this thesis, we introduce efficient strategies
related to all three tasks involved in a search engine. Most of the proposed
strategies are essentially applicable when a grouping of documents in its broadest
sense (i.e., in terms of automatically obtained classes/clusters, or manually
edited categories) is readily available or can be constructed in a feasible manner.
Additionally, we also introduce static index pruning strategies that are based on
the query views.
For the crawling task, we propose a rule-based focused crawling strategy that
exploits interclass rules among the document classes in a topic taxonomy. These
rules capture the probability of having hyperlinks between two classes. The rulebased
crawler can tunnel toward the on-topic pages by following a path of off-topic
pages, and thus yields higher harvest rate for crawling on-topic pages.
In the context of indexing and query processing tasks, we concentrate on conducting
efficient search, again, using document groups; i.e., clusters or categories.
In typical cluster-based retrieval (CBR), first, clusters that are most similar to a
given free-text query are determined, and then documents from these clusters are
selected to form the final ranked output. For efficient CBR, we first identify and
evaluate some alternative query processing strategies. Next, we introduce a new
index organization, so-called cluster-skipping inverted index structure (CS-IIS).
It is shown that typical-CBR with CS-IIS outperforms previous CBR strategies
(with an ordinary index) for a number of datasets and under varying search parameters.
In this thesis, an enhanced version of CS-IIS is further proposed, in
which all information to compute query-cluster similarities during query evaluation
is stored. We introduce an incremental-CBR strategy that operates on top
of this latter index structure, and demonstrate its search efficiency for different
scenarios.
Finally, we exploit query views that are obtained from the search engine query
logs to tailor more effective static pruning techniques. This is also related to the
indexing task involved in a search engine. In particular, query view approach
is incorporated into a set of existing pruning strategies, as well as some new
variants proposed by us. We show that query view based strategies significantly
outperform the existing approaches in terms of the query output quality, for both
disjunctive and conjunctive evaluation of queries.Altıngövde, İsmail SengörPh.D
Essays on Building Growth From Ideas
In my dissertation, I study the interplay between innovation and economic growth with an emphasis on the misallocation of ideas and talent.
The first chapter focuses on the allocation of the innovations themselves, considering the optimal allocation of ideas across firms, and how it can be achieved via a market for ideas. This is done by first creating an operational measure of technological propinquity between firms and inventions. Using this empirical measure, new stylized facts are documented which suggest that ideas may indeed be initially misallocated across firms, but the market for patents can alleviate the initial misallocation. A quantitative model is built to perform some thought experiments which suggest that the economic growth rate can be increased by reducing the frictions in this market.
The second chapter investigates why some organizations and societies are more prolific than others in coming up with radical, path-breaking innovations, and marshals evidence showing that the cause might be the openness to disruption in their culture. In order to investigate this hypothesis, it is posited that firms and societies that are more open to disruption would allow the young to rise faster in the hierarchy. This theory allows one to use the age of the top management in organizations as a proxy for the intangible concept. It is documented that many measures of radical innovations are positively correlated with openness to disruption as captured by manager age.
The third chapter analyzes the link between the misallocation of talent and innovation, asking the question whether the talented people in the society are given the chance to create new innovations that enhance social welfare. This is investigated empirically using a novel approach that utilizes the power of surnames in capturing socioeconomic status persistence across time. The results show that the social allocation favors the wealthy over the talented in determining who become inventors. The rest of the chapter develops a quantitative model of allocation of talent that can replicate the observed correlation patterns, and shows that progressive bequest taxation can reduce this misallocation
Four essays in empirical economics
This thesis studies four topics in empirical economics as summarized below.
Chapter 1 documents the roles of heterogeneity, sorting, and complementarity in a framework where workers, managers, and firms interact to shape productivity. I show that the source of heterogeneity in the form of manager ability is an important driver of differences in firm productivity. I empirically identify complementarities between workers, managers, and firms using my estimation methodology. Counterfactual results show that reallocating workers by applying a positive assortative sorting rule can increase police department productivity by 10%.
Chapter 2 documents that growth of Airbnb is likely to affect the local housing rental market by reducing the supply of properties. I combine data from Airbnb and Zoopla and examine how the price of individual houses evolves over time, as Airbnb penetrates the market in the area of Greater London. Leveraging the fact that properties with more than three bedrooms are less exposed to Airbnb, I use a difference-in-differences strategy by year and house type. I find that a 10-percent increase in the number of Airbnb properties in a ward increases real rents by 0.1 percent.
Chapter 3: Religious groups sometimes resist modern welfare-enhancing interventions,
adversely affecting the group's human capital levels. In this context, we study whether the two largest religious groups in India (Hindus and Muslims) resisted western education because they shared religious identity with the rulers deposed by the British colonisers. We find that Muslim literacy in an Indian district under the British is lower where the deposed ruler was a Muslim, while Hindu literacy is lower where the deposed ruler was a Hindu.
Chapter 4: We digitize the financial disclosure of elite bureaucrats from India and combine this novel data with web-scraped career histories to study the private wealth accumulation of public servants. Employing a difference in difference event study approach, we find that the annual growth rate is 10% higher for the value of assets and 4.4% higher for the number after bureaucrats being transferred to an important post with the power to make influential policies. We document that the results are consistent with a rent-seeking explanation
Recommended from our members
Automated analysis and validation of open chemical data
Methods to automatically extract Open Data from the chemical literature,
validate it, and use it to validate theory are examined.
Chemical identifiers which assist the automatic location of chemical structures
using commercial Web search engines are investigated. The IUPAC
International Chemical Idenfitifer (InChI) gives almost 100% recall and precision,
though is shown to be too long for present search engines. A combination
of InChI and InChIKey, a shorter, fixed-length hash of the InChI
string, is concluded to be the best current method of identifying structures.
The proportion of published, Open Crystallographic Information Files
(CIFs) that are valid with respect to the specification is shown to be improving,
and is around 99% in 2007. The error rate in the conversion of valid
CIFs to Chemical Markup Language (CML) is less than 0.2%. The machine
generation of connection tables from CIFs requires many heuristics, and in
some cases it is impossible to deduce the exact connection table.
CrystalEye, a fully-automated system for the reformulation of the fragmented
crystallographic Web into a structured XML-based repository is described.
Published, Open CIFs can be located and aggregated programmatically
with almost 100% recall. It is shown that, by converting CIF data
to CML, software can be created to use the latest Web standards and technologies
to enhance the ability of Web users to browse, find, keep updated,
download and reuse the latest published crystallography.
A workflow for the high-throughput calculation of solid-state geometry
using a semi-empirical method is described. A wide-range of organic and
inorganic systems provided by CrystalEye are used to test both the data and
the method. Several errors in the method are discovered, many of which can
be attributed to the parameterization process.
An Open NMR experiment to perform high-throughput prediction of 13C
chemical shifts using a GIAO protocol is described. The data and analysis
were provided on publicly-available webpages to enable crowdsourcing, which
assisted in discovering an error rate of 6.1% in the starting data. The protocol
was refined during the work and shown to have an average unsigned error
of 2.24ppm for 13C nuclei of small, rigid molecules; comparable to the errors
observed elsewhere for general structures using HOSE and Neural Network
methods
- …