114,913 research outputs found
TopSig: Topology Preserving Document Signatures
Performance comparisons between File Signatures and Inverted Files for text
retrieval have previously shown several significant shortcomings of file
signatures relative to inverted files. The inverted file approach underpins
most state-of-the-art search engine algorithms, such as Language and
Probabilistic models. It has been widely accepted that traditional file
signatures are inferior alternatives to inverted files. This paper describes
TopSig, a new approach to the construction of file signatures. Many advances in
semantic hashing and dimensionality reduction have been made in recent times,
but these were not so far linked to general purpose, signature file based,
search engines. This paper introduces a different signature file approach that
builds upon and extends these recent advances. We are able to demonstrate
significant improvements in the performance of signature file based indexing
and retrieval, performance that is comparable to that of state of the art
inverted file based systems, including Language models and BM25. These findings
suggest that file signatures offer a viable alternative to inverted files in
suitable settings and from the theoretical perspective it positions the file
signatures model in the class of Vector Space retrieval models.Comment: 12 pages, 8 figures, CIKM 201
INFORMATION RETRIEVAL USING LATENT SEMANTIC INDEXING
Our capabilities for collecting and storing data of all kinds are greater then ever. On the other side analyzing, summarizing and extracting information from this data is harder than ever. That’s why there is a growing need for the fast and efficient algorithms for information retrieval.In this paper we present some mathematical models based on linear algebra used to extract the relevant documents for some subjects out of a large set of text document. This is a typical problem of a search engine on the World Wide Web. We use vector space model, which is based on literal matching of terms in the documents and the queries. The vector space model is implemented by creating the term-document matrix. Literal matching of terms does not necessarily retrieve all relevant documents. Synonymy (multiple words having the same meaning) and polysemy (words having multiple meaning) are two major obstacles for efficient information retrieval. Latent Semantic Indexing represents documents by approximations and tends to cluster documents on similar topics even if their term profiles are somewhat different. This approximate representation is accomplished using a low-rank singular value decomposition (SVD) approximation of the term-document matrix. In this paper we compare the precision of information retrieval for different ranks of SVD representation of term-document matrix
Random Indexing K-tree
Random Indexing (RI) K-tree is the combination of two algorithms for
clustering. Many large scale problems exist in document clustering. RI K-tree
scales well with large inputs due to its low complexity. It also exhibits
features that are useful for managing a changing collection. Furthermore, it
solves previous issues with sparse document vectors when using K-tree. The
algorithms and data structures are defined, explained and motivated. Specific
modifications to K-tree are made for use with RI. Experiments have been
executed to measure quality. The results indicate that RI K-tree improves
document cluster quality over the original K-tree algorithm.Comment: 8 pages, ADCS 2009; Hyperref and cleveref LaTeX packages conflicted.
Removed clevere
The relationship between IR and multimedia databases
Modern extensible database systems support multimedia data through ADTs. However, because of the problems with multimedia query formulation, this support is not sufficient.\ud
\ud
Multimedia querying requires an iterative search process involving many different representations of the objects in the database. The support that is needed is very similar to the processes in information retrieval.\ud
\ud
Based on this observation, we develop the miRRor architecture for multimedia query processing. We design a layered framework based on information retrieval techniques, to provide a usable query interface to the multimedia database.\ud
\ud
First, we introduce a concept layer to enable reasoning over low-level concepts in the database.\ud
\ud
Second, we add an evidential reasoning layer as an intermediate between the user and the concept layer.\ud
\ud
Third, we add the functionality to process the users' relevance feedback.\ud
\ud
We then adapt the inference network model from text retrieval to an evidential reasoning model for multimedia query processing.\ud
\ud
We conclude with an outline for implementation of miRRor on top of the Monet extensible database system
Search Result Clustering via Randomized Partitioning of Query-Induced Subgraphs
In this paper, we present an approach to search result clustering, using
partitioning of underlying link graph. We define the notion of "query-induced
subgraph" and formulate the problem of search result clustering as a problem of
efficient partitioning of given subgraph into topic-related clusters. Also, we
propose a novel algorithm for approximative partitioning of such graph, which
results in cluster quality comparable to the one obtained by deterministic
algorithms, while operating in more efficient computation time, suitable for
practical implementations. Finally, we present a practical clustering search
engine developed as a part of this research and use it to get results about
real-world performance of proposed concepts.Comment: 16th Telecommunications Forum TELFOR 200
Finding Your Literature Match -- A Recommender System
The universe of potentially interesting, searchable literature is expanding
continuously. Besides the normal expansion, there is an additional influx of
literature because of interdisciplinary boundaries becoming more and more
diffuse. Hence, the need for accurate, efficient and intelligent search tools
is bigger than ever. Even with a sophisticated search engine, looking for
information can still result in overwhelming results. An overload of
information has the intrinsic danger of scaring visitors away, and any
organization, for-profit or not-for-profit, in the business of providing
scholarly information wants to capture and keep the attention of its target
audience. Publishers and search engine engineers alike will benefit from a
service that is able to provide visitors with recommendations that closely meet
their interests. Providing visitors with special deals, new options and
highlights may be interesting to a certain degree, but what makes more sense
(especially from a commercial point of view) than to let visitors do most of
the work by the mere action of making choices? Hiring psychics is not an
option, so a technological solution is needed to recommend items that a visitor
is likely to be looking for. In this presentation we will introduce such a
solution and argue that it is practically feasible to incorporate this approach
into a useful addition to any information retrieval system with enough usage.Comment: Contribution to the proceedings of the colloquium Future Professional
Communication in Astronomy II, 13-14 April 2010, Cambridge, Massachusetts. 11
pages, 4 figures
- …