78,037 research outputs found
Using Language Models for Information Retrieval
Because of the world wide web, information retrieval systems are now used by millions of untrained users all over the world. The search engines that perform the information retrieval tasks, often retrieve thousands of potentially interesting documents to a query. The documents should be ranked in decreasing order of relevance in order to be useful to the user. This book describes a mathematical model of information retrieval based on the use of statistical language models. The approach uses simple document-based unigram models to compute for each document the probability that it generates the query. This probability is used to rank the documents. The study makes the following research contributions. * The development of a model that integrates term weighting, relevance feedback and structured queries. * The development of a model that supports multiple representations of a request or information need by integrating a statistical translation model. * The development of a model that supports multiple representations of a document, for instance by allowing proximity searches or searches for terms from a particular record field (e.g. a search for terms from the title). * A mathematical interpretation of stop word removal and stemming. * A mathematical interpretation of operators for mandatory terms, wildcards and synonyms. * A practical comparison of a language model-based retrieval system with similar systems that are based on well-established models and term weighting algorithms in a controlled experiment. * The application of the model to cross-language information retrieval and adaptive information filtering, and the evaluation of two prototype systems in a controlled experiment. Experimental results on three standard tasks show that the language model-based algorithms work as well as, or better than, today's top-performing retrieval algorithms. The standard tasks investigated are ad-hoc retrieval (when there are no previously retrieved documents to guide the search), retrospective relevance weighting (find the optimum model for a given set of relevant documents), and ad-hoc retrieval using manually formulated Boolean queries. The application to cross-language retrieval and adaptive filtering shows the practical use of respectively structured queries, and relevance feedback
An Expressive Language and Efficient Execution System for Software Agents
Software agents can be used to automate many of the tedious, time-consuming
information processing tasks that humans currently have to complete manually.
However, to do so, agent plans must be capable of representing the myriad of
actions and control flows required to perform those tasks. In addition, since
these tasks can require integrating multiple sources of remote information ?
typically, a slow, I/O-bound process ? it is desirable to make execution as
efficient as possible. To address both of these needs, we present a flexible
software agent plan language and a highly parallel execution system that enable
the efficient execution of expressive agent plans. The plan language allows
complex tasks to be more easily expressed by providing a variety of operators
for flexibly processing the data as well as supporting subplans (for
modularity) and recursion (for indeterminate looping). The executor is based on
a streaming dataflow model of execution to maximize the amount of operator and
data parallelism possible at runtime. We have implemented both the language and
executor in a system called THESEUS. Our results from testing THESEUS show that
streaming dataflow execution can yield significant speedups over both
traditional serial (von Neumann) as well as non-streaming dataflow-style
execution that existing software and robot agent execution systems currently
support. In addition, we show how plans written in the language we present can
represent certain types of subtasks that cannot be accomplished using the
languages supported by network query engines. Finally, we demonstrate that the
increased expressivity of our plan language does not hamper performance;
specifically, we show how data can be integrated from multiple remote sources
just as efficiently using our architecture as is possible with a
state-of-the-art streaming-dataflow network query engine
Escaping the Trap of too Precise Topic Queries
At the very center of digital mathematics libraries lie controlled
vocabularies which qualify the {\it topic} of the documents. These topics are
used when submitting a document to a digital mathematics library and to perform
searches in a library. The latter are refined by the use of these topics as
they allow a precise classification of the mathematics area this document
addresses. However, there is a major risk that users employ too precise topics
to specify their queries: they may be employing a topic that is only "close-by"
but missing to match the right resource. We call this the {\it topic trap}.
Indeed, since 2009, this issue has appeared frequently on the i2geo.net
platform. Other mathematics portals experience the same phenomenon. An approach
to solve this issue is to introduce tolerance in the way queries are understood
by the user. In particular, the approach of including fuzzy matches but this
introduces noise which may prevent the user of understanding the function of
the search engine.
In this paper, we propose a way to escape the topic trap by employing the
navigation between related topics and the count of search results for each
topic. This supports the user in that search for close-by topics is a click
away from a previous search. This approach was realized with the i2geo search
engine and is described in detail where the relation of being {\it related} is
computed by employing textual analysis of the definitions of the concepts
fetched from the Wikipedia encyclopedia.Comment: 12 pages, Conference on Intelligent Computer Mathematics 2013 Bath,
U
mRUBiS: An Exemplar for Model-Based Architectural Self-Healing and Self-Optimization
Self-adaptive software systems are often structured into an adaptation engine
that manages an adaptable software by operating on a runtime model that
represents the architecture of the software (model-based architectural
self-adaptation). Despite the popularity of such approaches, existing exemplars
provide application programming interfaces but no runtime model to develop
adaptation engines. Consequently, there does not exist any exemplar that
supports developing, evaluating, and comparing model-based self-adaptation off
the shelf. Therefore, we present mRUBiS, an extensible exemplar for model-based
architectural self-healing and self-optimization. mRUBiS simulates the
adaptable software and therefore provides and maintains an architectural
runtime model of the software, which can be directly used by adaptation engines
to realize and perform self-adaptation. Particularly, mRUBiS supports injecting
issues into the model, which should be handled by self-adaptation, and
validating the model to assess the self-adaptation. Finally, mRUBiS allows
developers to explore variants of adaptation engines (e.g., event-driven
self-adaptation) and to evaluate the effectiveness, efficiency, and scalability
of the engines
Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources
Apache Calcite is a foundational software framework that provides query
processing, optimization, and query language support to many popular
open-source data processing systems such as Apache Hive, Apache Storm, Apache
Flink, Druid, and MapD. Calcite's architecture consists of a modular and
extensible query optimizer with hundreds of built-in optimization rules, a
query processor capable of processing a variety of query languages, an adapter
architecture designed for extensibility, and support for heterogeneous data
models and stores (relational, semi-structured, streaming, and geospatial).
This flexible, embeddable, and extensible architecture is what makes Calcite an
attractive choice for adoption in big-data frameworks. It is an active project
that continues to introduce support for the new types of data sources, query
languages, and approaches to query processing and optimization.Comment: SIGMOD'1
cphVB: A System for Automated Runtime Optimization and Parallelization of Vectorized Applications
Modern processor architectures, in addition to having still more cores, also
require still more consideration to memory-layout in order to run at full
capacity. The usefulness of most languages is deprecating as their
abstractions, structures or objects are hard to map onto modern processor
architectures efficiently.
The work in this paper introduces a new abstract machine framework, cphVB,
that enables vector oriented high-level programming languages to map onto a
broad range of architectures efficiently. The idea is to close the gap between
high-level languages and hardware optimized low-level implementations. By
translating high-level vector operations into an intermediate vector bytecode,
cphVB enables specialized vector engines to efficiently execute the vector
operations.
The primary success parameters are to maintain a complete abstraction from
low-level details and to provide efficient code execution across different,
modern, processors. We evaluate the presented design through a setup that
targets multi-core CPU architectures. We evaluate the performance of the
implementation using Python implementations of well-known algorithms: a jacobi
solver, a kNN search, a shallow water simulation and a synthetic stencil
simulation. All demonstrate good performance
- …