Search CORE

134,314 research outputs found

Structured Text Retrieval Models

Author: C.L.A. Clarke
G. Navarro
R.A. Baeza-Yates
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 03/04/2008
Field of study

Structured text retrieval models provide a formal definition or mathematical framework for querying semistructured textual databases. A textual database contains both content and structure. The content is the text itself, and the structure divides the database into separate textual parts and relates those textual parts by some criterion. Often, textual databases can be represented as marked up text, for instance as XML, where the XML elements define the structure on the text content. Retrieval models for textual databases should comprise three parts: 1) a model of the text, 2) a model of the structure, and 3) a query language [4]: The model of the text defines a tokenization into words or other semantic units, as well as stop words, stemming, synonyms, etc. The model of the structure defines parts of the text, typically a contiguous portion of the text called element, region, or segment, which is defined on top of the text modelâ\u80\u99s word tokens. The query language typically defines a number of operators on content and structure such as set operators and operators like â\u80\u9ccontaining â\u80\u9d and â\u80\u9ccontained-by â\u80\u9d to model relations between content and structure, as well as relations between the structural elements themselves. Using such a query language, the (expert) user can for instance formulate requests like â\u80\u9cI want a paragraph discussing formal models near to a table discussing the differences between databases and information retrievalâ\u80\u9d. Here, â\u80\u9cformal models â\u80\u9d and â\u80\u9cdifferences between databases and information retrieval â\u80\u9d should match the content that needs to be retrieved from the database, whereas â\u80\u9cparagraph â\u80\u9d and â\u80\u9ctable â\u80\u9d refer to structural constraints on the units to retrieve. The features, structuring power, and the expressiveness of the query languages of several models for structured text retrieval are discussed below. HISTORICAL BACKGROUND The STAIRS system (Storage and Information Retrieval System), which was developed at IBM already in the late 1950â\u80\u99s allowed querying both content and structure. Much like todayâ\u80\u99s On-line Public Access Catalogues, it wa

CiteSeerX

Crossref

Radboud Repository

University of Twente Research Information

Semantic Technologies for Manuscript Descriptions — Concepts and Visions

Author: Kummer Robert
Publication venue: Books on Demand (BoD)
Publication date: 01/01/2011
Field of study

The contribution at hand relates recent developments in the area of the World Wide Web to codicological research. In the last number of years, an informational extension of the internet has been discussed and extensively researched: the Semantic Web. It has already been applied in many areas, including digital information processing of cultural heritage data. The Semantic Web facilitates the organisation and linking of data across websites, according to a given semantic structure. Software can then process this structural and semantic information to extract further knowledge. In the area of codicological research, many institutions are making efforts to improve the online availability of handwritten codices. If these resources could also employ Semantic Web techniques, considerable research potential could be unleashed. However, data acquisition from less structured data sources will be problematic. In particular, data stemming from unstructured sources needs to be made accessible to SemanticWeb tools through information extraction techniques. In the area of museum research, the CIDOC Conceptual Reference Model (CRM) has been widely examined and is being adopted successfully. The CRM translates well to Semantic Web research, and its concentration on contextualization of objects could support approaches in codicological research. Further concepts for the creation and management of bibliographic coherences and structured vocabularies related to the CRM will be considered in this chapter. Finally, a user scenario showing all processing steps in their context will be elaborated on

Kölner UniversitätsPublikationsServer

Visual exploration and retrieval of XML document collections with the generic system X2

Author: Felix Weigel
François Bry
H Meuss
Holger Meuss
Klaus U. Schulz
S Ceri
S Mizzaro
Simone Leonardi
T Catarci
T Schlieder
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/03/2005
Field of study

This article reports on the XML retrieval system X2 which has been developed at the University of Munich over the last five years. In a typical session with X2, the user first browses a structural summary of the XML database in order to select interesting elements and keywords occurring in documents. Using this intermediate result, queries combining structure and textual references are composed semiautomatically. After query evaluation, the full set of answers is presented in a visual and structured way. X2 largely exploits the structure found in documents, queries and answers to enable new interactive visualization and exploration techniques that support mixed IR and database-oriented querying, thus bridging the gap between these three views on the data to be retrieved. Another salient characteristic of X2 which distinguishes it from other visual query systems for XML is that it supports various degrees of detailedness in the presentation of answers, as well as techniques for dynamically reordering and grouping retrieved elements once the complete answer set has been computed

Crossref

Open Access LMU

Kolmogorov Complexity in perspective. Part II: Classification, Information Processing and Duality

Author: Ferbus-Zanda Marie
Publication venue
Publication date: 01/01/2010
Field of study

We survey diverse approaches to the notion of information: from Shannon entropy to Kolmogorov complexity. Two of the main applications of Kolmogorov complexity are presented: randomness and classification. The survey is divided in two parts published in a same volume. Part II is dedicated to the relation between logic and information system, within the scope of Kolmogorov algorithmic information theory. We present a recent application of Kolmogorov complexity: classification using compression, an idea with provocative implementation by authors such as Bennett, Vitanyi and Cilibrasi. This stresses how Kolmogorov complexity, besides being a foundation to randomness, is also related to classification. Another approach to classification is also considered: the so-called "Google classification". It uses another original and attractive idea which is connected to the classification using compression and to Kolmogorov complexity from a conceptual point of view. We present and unify these different approaches to classification in terms of Bottom-Up versus Top-Down operational modes, of which we point the fundamental principles and the underlying duality. We look at the way these two dual modes are used in different approaches to information system, particularly the relational model for database introduced by Codd in the 70's. This allows to point out diverse forms of a fundamental duality. These operational modes are also reinterpreted in the context of the comprehension schema of axiomatic set theory ZF. This leads us to develop how Kolmogorov's complexity is linked to intensionality, abstraction, classification and information system.Comment: 43 page

arXiv.org e-Print Archive

Hal-Diderot

Infectious Disease Ontology

Technological developments have resulted in tremendous increases in the volume and diversity of the data and information that must be processed in the course of biomedical and clinical research and practice. Researchers are at the same time under ever greater pressure to share data and to take steps to ensure that data resources are interoperable. The use of ontologies to annotate data has proven successful in supporting these goals and in providing new possibilities for the automated processing of data and information. In this chapter, we describe different types of vocabulary resources and emphasize those features of formal ontologies that make them most useful for computational applications. We describe current uses of ontologies and discuss future goals for ontology-based computing, focusing on its use in the field of infectious diseases. We review the largest and most widely used vocabulary resources relevant to the study of infectious diseases and conclude with a description of the Infectious Disease Ontology (IDO) suite of interoperable ontology modules that together cover the entire infectious disease domain

PhilPapers

CiteSeerX

Crossref

Distributed Information Retrieval using Keyword Auctions

Author: Hiemstra D.
Publication venue: Centre for Telematics and Information Technology, University of Twente
Publication date: 01/01/2008
Field of study

This report motivates the need for large-scale distributed approaches to information retrieval, and proposes solutions based on keyword auctions

CiteSeerX

Radboud Repository

University of Twente Research Information

Embedding machine-readable proteins interactions data in scientific articles for easy access and retrieval

Author: Alberto Termanini
Claudio Franceschi
Paolo Tieri
Piero Fariselli
Publication venue
Publication date: 29/09/2008
Field of study

Extraction of protein-protein interactions data from scientific literature remains a hard, time- and resource-consuming task. This task would be greatly simplified by embedding in the source, i.e. research articles, a standardized, synthetic, machine-readable codification for protein-protein interactions data description, to make the identification and the retrieval of such very valuable information easier, faster, and more reliable than now.
We shortly discuss how this information can be easily encoded and embedded in research papers with the collaboration of authors and scientific publishers, and propose an online demonstrative tool that shows how to help and allow authors for the easy and fast conversion of such valuable biological data into an embeddable, accessible, computer-readable codification

Nature Precedings