1,100 research outputs found
Challenging Ubiquitous Inverted Files
Stand-alone ranking systems based on highly optimized inverted file structures are generally considered âtheâ solution for building search engines. Observing various developments in software and hardware, we argue however that IR research faces a complex engineering problem in the quest for more flexible yet efficient retrieval systems. We propose to base the development of retrieval systems on âthe database approachâ: mapping high-level declarative specifications of the retrieval process into efficient query plans. We present the Mirror DBMS as a prototype implementation of a retrieval system based on this approach
Implementation of an efficient Fuzzy Logic based Information Retrieval System
This paper exemplifies the implementation of an efficient Information
Retrieval (IR) System to compute the similarity between a dataset and a query
using Fuzzy Logic. TREC dataset has been used for the same purpose. The dataset
is parsed to generate keywords index which is used for the similarity
comparison with the user query. Each query is assigned a score value based on
its fuzzy similarity with the index keywords. The relevant documents are
retrieved based on the score value. The performance and accuracy of the
proposed fuzzy similarity model is compared with Cosine similarity model using
Precision-Recall curves. The results prove the dominance of Fuzzy Similarity
based IR system.Comment: arXiv admin note: substantial text overlap with
http://ntz-develop.blogspot.in/ ,
http://www.micsymposium.org/mics2012/submissions/mics2012_submission_8.pdf ,
http://www.slideshare.net/JeffreyStricklandPhD/predictive-modeling-and-analytics-selectchapters-41304405
by other author
Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents
Important legacy paper documents are digitized and collected in online accessible archives. This enables the preservation, sharing, and significantly the searching of
these documents. The text contents of these document images can be transcribed automatically using OCR systems and then stored in an information retrieval system. However, OCR systems make errors in character recognition which have previously been shown to impact on document retrieval behaviour. In particular relevance feedback query-expansion methods, which are often effective for improving electronic
text retrieval, are observed to be less reliable for retrieval of scanned document images. Our experimental examination of the effects of character recognition errors
on an ad hoc OCR retrieval task demonstrates that, while baseline information retrieval can remain relatively unaffected by transcription errors, relevance feedback via query expansion becomes highly unstable. This paper examines the reason for this behaviour, and introduces novel modifications to standard relevance feedback methods. These methods are shown experimentally to improve the effectiveness of relevance feedback for errorful OCR transcriptions. The new methods combine similar recognised character strings based on term collection frequency and a string edit-distance measure. The techniques are domain independent and make no use of external resources such as dictionaries or training data
Formal models, usability and related work in IR (editorial for special edition)
The Glasgow IR group has carried out both theoretical and empirical work, aimed at giving end users efficient and effective access to large collections of multimedia data
Entity Query Feature Expansion Using Knowledge Base Links
Recent advances in automatic entity linking and knowledge base
construction have resulted in entity annotations for document and
query collections. For example, annotations of entities from large
general purpose knowledge bases, such as Freebase and the Google
Knowledge Graph. Understanding how to leverage these entity
annotations of text to improve ad hoc document retrieval is an open
research area. Query expansion is a commonly used technique to
improve retrieval effectiveness. Most previous query expansion
approaches focus on text, mainly using unigram concepts. In this
paper, we propose a new technique, called entity query feature
expansion (EQFE) which enriches the query with features from
entities and their links to knowledge bases, including structured
attributes and text. We experiment using both explicit query entity
annotations and latent entities. We evaluate our technique on TREC
text collections automatically annotated with knowledge base entity
links, including the Google Freebase Annotations (FACC1) data.
We find that entity-based feature expansion results in significant
improvements in retrieval effectiveness over state-of-the-art text
expansion approaches
The State-of-the-arts in Focused Search
The continuous influx of various text data on the Web requires search engines to improve their retrieval abilities for more specific information. The need for relevant results to a userâs topic of interest has gone beyond search for domain or type specific documents to more focused result (e.g. document fragments or answers to a query). The introduction of XML provides a format standard for data representation, storage, and exchange. It helps focused search to be carried out at different granularities of a structured document with XML markups. This report aims at reviewing the state-of-the-arts in focused search, particularly techniques for topic-specific document retrieval, passage retrieval, XML retrieval, and entity ranking. It is concluded with highlight of open problems
Content And Multimedia Database Management Systems
A database management system is a general-purpose software system that facilitates the processes of defining, constructing, and manipulating databases for various applications. The main characteristic of the âdatabase approachâ is that it increases the value of data by its emphasis on data independence. DBMSs, and in particular those based on the relational data model, have been very successful at the management of administrative data in the business domain. This thesis has investigated data management in multimedia digital libraries, and its implications on the design of database management systems. The main problem of multimedia data management is providing access to the stored objects. The content structure of administrative data is easily represented in alphanumeric values. Thus, database technology has primarily focused on handling the objectsâ logical structure. In the case of multimedia data, representation of content is far from trivial though, and not supported by current database management systems
Aggregated search: a new information retrieval paradigm
International audienceTraditional search engines return ranked lists of search results. It is up to the user to scroll this list, scan within different documents and assemble information that fulfill his/her information need. Aggregated search represents a new class of approaches where the information is not only retrieved but also assembled. This is the current evolution in Web search, where diverse content (images, videos, ...) and relational content (similar entities, features) are included in search results. In this survey, we propose a simple analysis framework for aggregated search and an overview of existing work. We start with related work in related domains such as federated search, natural language generation and question answering. Then we focus on more recent trends namely cross vertical aggregated search and relational aggregated search which are already present in current Web search
Probabilistic retrieval models - relationships, context-specific application, selection and implementation
PhDRetrieval models are the core components of information retrieval systems, which guide the document
and query representations, as well as the document ranking schemes. TF-IDF, binary
independence retrieval (BIR) model and language modelling (LM) are three of the most influential
contemporary models due to their stability and performance. The BIR model and LM
have probabilistic theory as their basis, whereas TF-IDF is viewed as a heuristic model, whose
theoretical justification always fascinates researchers.
This thesis firstly investigates the parallel derivation of BIR model, LM and Poisson model,
wrt event spaces, relevance assumptions and ranking rationales. It establishes a bridge between
the BIR model and LM, and derives TF-IDF from the probabilistic framework.
Then, the thesis presents the probabilistic logical modelling of the retrieval models. Various
ways of how to estimate and aggregate probability, and alternative implementation to nonprobabilistic
operator are demonstrated. Typical models have been implemented.
The next contribution concerns the usage of of context-specific frequencies, i.e., the frequencies
counted based on assorted element types or within different text scopes. The hypothesis
is that they can help to rank the elements in structured document retrieval. The thesis applies
context-specific frequencies on term weighting schemes in these models, and the outcome is a
generalised retrieval model with regard to both element and document ranking.
The retrieval models behave differently on the same query set: for some queries, one model
performs better, for other queries, another model is superior. Therefore, one idea to improve the
overall performance of a retrieval system is to choose for each query the model that is likely
to perform the best. This thesis proposes and empirically explores the model selection method
according to the correlation of query feature and query performance, which contributes to the
methodology of dynamically choosing a model.
In summary, this thesis contributes a study of probabilistic models and their relationships,
the probabilistic logical modelling of retrieval models, the usage and effect of context-specific
frequencies in models, and the selection of retrieval models
- âŠ