15 research outputs found

    Sound ranking algorithms for XML search

    Get PDF
    Ranking algorithms for XML should reflect the actual combined content and structure constraints of queries, while at the same time producing equal rankings for queries that are semantically equal. Ranking algorithms that produce different rankings for queries that are semantically equal are easily detected by tests on large databases: We call such algorithms not sound. We report the behavior of different approaches to ranking content-and-structure queries on pairs of queries for which we expect equal ranking results from the query semantics. We show that most of these approaches are not sound. Of the remaining approaches, only 3 adhere to the W3C XQuery Full-Text standard

    Structured Text Retrieval Models

    Get PDF
    Structured text retrieval models provide a formal definition or mathematical framework for querying semistructured textual databases. A textual database contains both content and structure. The content is the text itself, and the structure divides the database into separate textual parts and relates those textual parts by some criterion. Often, textual databases can be represented as marked up text, for instance as XML, where the XML elements define the structure on the text content. Retrieval models for textual databases should comprise three parts: 1) a model of the text, 2) a model of the structure, and 3) a query language [4]: The model of the text defines a tokenization into words or other semantic units, as well as stop words, stemming, synonyms, etc. The model of the structure defines parts of the text, typically a contiguous portion of the text called element, region, or segment, which is defined on top of the text modelâ\u80\u99s word tokens. The query language typically defines a number of operators on content and structure such as set operators and operators like â\u80\u9ccontaining â\u80\u9d and â\u80\u9ccontained-by â\u80\u9d to model relations between content and structure, as well as relations between the structural elements themselves. Using such a query language, the (expert) user can for instance formulate requests like â\u80\u9cI want a paragraph discussing formal models near to a table discussing the differences between databases and information retrievalâ\u80\u9d. Here, â\u80\u9cformal models â\u80\u9d and â\u80\u9cdifferences between databases and information retrieval â\u80\u9d should match the content that needs to be retrieved from the database, whereas â\u80\u9cparagraph â\u80\u9d and â\u80\u9ctable â\u80\u9d refer to structural constraints on the units to retrieve. The features, structuring power, and the expressiveness of the query languages of several models for structured text retrieval are discussed below. HISTORICAL BACKGROUND The STAIRS system (Storage and Information Retrieval System), which was developed at IBM already in the late 1950â\u80\u99s allowed querying both content and structure. Much like todayâ\u80\u99s On-line Public Access Catalogues, it wa

    Configurable indexing and ranking for XML information retrieval

    Full text link
    Indexing and ranking are two key factors for efficient and effective XML information retrieval. Inappropriate indexing may result in false negatives and false positives, and improper ranking may lead to low precisions. In this paper, we propose a configurable XML information retrieval system, in which users can configure appropriate index types for XML tags and text contents. Based on users ’ index configurations, the system transforms XML structures into a compact tree representation, Ctree, and indexes XML text contents. To support XML ranking, we propose the concepts of “weighted term frequency ” and “inverted element frequency, ” where the weight of a term depends on its frequency and location within an XML element as well as its popularity among similar elements in an XML dataset. We evaluate the effectiveness of our system through extensive experiments on the INEX 03 dataset and 30 content and structure (CAS) topics. The experimental results reveal that our system has significantly high precision at low recall regions and achieves the highest average precision (0.3309) as compared with 38 official INEX 03 submissions using the strict evaluation metric

    ESTER: efficient search on text, entities, and relations

    Get PDF
    We present ESTER, a modular and highly efficient system for combined full-text and ontology search. ESTER builds on a query engine that supports two basic operations: prefix search and join. Both of these can be implemented very efficiently with a compact index, yet in combination provide powerful querying capabilities. We show how ESTER can answer basic SPARQL graphpattern queries on the ontology by reducing them to a small number of these two basic operations. ESTER further supports a natural blend of such semantic queries with ordinary full-text queries. Moreover, the prefix search operation allows for a fully interactive and proactive user interface, which after every keystroke suggests to the user possible semantic interpretations of his or her query, and speculatively executes the most likely of these interpretations. As a proof of concept, we applied ESTER to the English Wikipedia, which contains about 3 million documents, combined with the recent YAGO ontology, which contains about 2.5 million facts. For a variety of complex queries, ESTER achieves worst-case query processing times of a fraction of a second, on a single machine, with an index size of about 4 GB

    A scoring method of XML fragments considering query-oriented statistics

    Full text link

    Fuzzy Range Query in XML

    Get PDF
    This writing project presents a new approach to implement a fuzzy range query solution for retrieving Extensible Markup Language (XML) data. Ever since XML was introduced, it has become a web standard to describe data on the Internet. The need for performing range query against XML data is growing day by day. Many search service providers are eager to improve their solutions on range query against XML data. The project studies and analyzes the limitations on the current range query solutions. The project also proposes a new solution using fuzzy semantic analysis to quantify XML data so that it can be represented within a range. This is accomplished by applying fuzzy logic algorithm to classify and aggregate XML data based on the semantic closeness. An intuitive web interface is also introduced to aid the user to input fuzzy search criteria. Instead of specifying crisp values in the current solutions, the user can simply drag and drop to indicate fuzzy values. Therefore, it’s more user-friendly and desirable for fuzzy range query

    Predicate-based indexing for desktop search

    Get PDF
    Google and other products have revolutionized the way we search for information. There are, however, still a number of research challenges. One challenge that arises specifically in desktop search is to exploit the structure and semantics of documents, as defined by the application program that generated the data (e.g., Word, Excel, or Outlook). The current generation of search products does not understand these structures and therefore often returns wrong results. This paper shows how today's search technology can be extended in order to take the specific semantics of certain structures into account. The key idea is to extend inverted file index structures with predicates which encode the circumstances under which certain keywords of a document become visible to a user. This paper provides a framework that allows to express the semantics of structures in documents and algorithms to construct enhanced, predicate-based indexes. Furthermore, this paper shows how keyword and phrase queries can be processed efficiently on such enhanced indexes. It is shown that the proposed approach has superior retrieval performance with regard to both recall and precision and has tolerable space and query running time overhead

    Information retrieval from civil engineering repositories: the importance of context and granularity

    Get PDF
    Information about the design and construction of buildings can be structured in a particular way. This is especially correct given the increasing complexity of building product models and the emergence of building information models with project documents linked to them. In addition, engineers usually have distinct information needs. Research shows that engineers working with building information models place particular importance on the understanding of retrieved content before using it or applying it and that exploration of context is essential for this understanding. Both these factors (the nature of engineering content and the information needs of engineers) make general information retrieval techniques for computing relevance and visualizing search results less applicable in civil engineering information retrieval systems. This paper argues that granularity is a fundamental concept that needs to be considered when measuring relevance and visualizing search results in information retrieval systems for repositories of building design and construction content. It is hypothesized that the design of systems with careful regard for granularity would improve engineers’ relevance judgment behavior. To test this hypothesis, a prototype system, called CoMem-XML, was developed and evaluated in terms of the time needed for users to find relevant information, the accuracy of their relevance judgment, and their subjective satisfaction with the prototype. A user study was conducted in which test subjects were asked to complete tasks by using various forms of the prototype, to complete a satisfaction questionnaire, and to be interviewed. The findings show that users perform better and are more satisfied when the search result interface of the CoMem-XML system presents only relevant information in context. On the other hand, interfaces that present the retrieved information out of context (i.e., without highlighting its position in the parts hierarchy) are less effective for participants to judge relevance

    Semanttinen tiedonhaku

    Get PDF
    Perinteisillä tiedonhakumenetelmillä ei aina tavoiteta riittävän hyvin tekstien merkitystasoa. Tutkielman aiheena olevan semanttisen tiedonhaun tarkoituksena onkin päästä paremmin kä-siksi sanojen ilmaisemiin merkityksiin. Tämä tapahtuu käyttämällä hyväksi itse tekstiin tai sen esitys-/tallennusrakenteisiin tuotettua semanttista metatietoa. Tutkielmassa tarkastellaan lähemmin kahteen ryhmään kuuluvia semanttisia hakumenetelmiä. Toisen ryhmän muodostavat XML-tekstidokumenttien ominaisuuksia hyödyntävät, toisen taas semanttisen webin mahdollisuuksiin perustuvat järjestelmät. Lisäksi tutkielmassa luonnostellaan ideaalinen semanttinen tiedonhakujärjestelmä, johon tarkasteltuja järjestelmiä verrataan. Vertailussa todetaan, että lähes kaikki ideaalisen hakujärjestelmän piirteet tulevat jossain muodossa toteutetuiksi, joskaan eivät yhdessäkään järjestelmässä samalla kertaa. Semanttisilta hakuominaisuuksiltaan monipuolisimmaksi osoittautuu XML-perustainen SphereSearch-hakukone, joka esimerkiksi sallii käsitehaut ja kykenee muodostamaan vastauselementeistä dokumenttirajat ylittäviä kokonaisuuksia. Toisaalta kaikki tarkastellut järjestelmät noudattavat semanttisen tiedonhaun perusperiaatetta, jonka mukaan etsityn merkityssisällön tavoittamiseksi ei riitä pelkkä hakutermien paikallisten esiintymien löytäminen kohdeaineistosta. Tyypillisimmin periaate on toteutettu ottamalla tiedollisen yksikön (XML-elementin tai semanttisen webin ontologian mukaisen ilmentymäsolmun) relevanssia arvioitaessa huomioon myös siihen rakenteellisesti kytkeytyneiden yksiköiden sisältö ja näiden kytkösten laatu
    corecore