269 research outputs found

    Data mining framework

    Get PDF
    The purpose of this document is building a framework for working with clinical data. Vast amounts of clinical records, stored in health repositories, contain information that can be used to improve the quality of health care. However, the information generated from these records depends vastly on the manner, in which the data is arranged. A number of factors need to be considered, before information can be extracted from the patient records. This document deals with the preparation of a framework for the data, before it can be mined.;One of the issues to deal with is information about the patient contained in the clinical records that can be used for identification purposes. A means to create anonymous records is discussed in this document. Once the records have been de-identified, they can be used for data mining. In addition to storing patient records, the document also discusses the possibility of \u27abstracting\u27 information from these documents and storing them in the repository. Information generated from the combination of patient records and abstracted information, could be used to improve the quality of health care.;This document also discusses the possibility of creating a means to query information from the data repository. A prototype application, which provides all these facilities in a form that can be accessed from any remote location, is discussed. In addition, the prospect of using Clinical Document Architecture format to store the clinical records is explored

    Survey over Existing Query and Transformation Languages

    Get PDF
    A widely acknowledged obstacle for realizing the vision of the Semantic Web is the inability of many current Semantic Web approaches to cope with data available in such diverging representation formalisms as XML, RDF, or Topic Maps. A common query language is the first step to allow transparent access to data in any of these formats. To further the understanding of the requirements and approaches proposed for query languages in the conventional as well as the Semantic Web, this report surveys a large number of query languages for accessing XML, RDF, or Topic Maps. This is the first systematic survey to consider query languages from all these areas. From the detailed survey of these query languages, a common classification scheme is derived that is useful for understanding and differentiating languages within and among all three areas

    Implementation of a XQuery engine for large documents in CanstoreX

    Get PDF
    XML is a markup language used for storing documents which contains structured information. Its flexibility helps in storing, processing and querying diverse and complex documents with any structure. While theoretically, XML could be used to handle any documents, the currently available parsers require large amounts of main-memory resulting into severe restriction on the size of XML documents. As a result, some technologies have been developed to break the XML documents in to smaller chunks and allow the parsers to load only a specific portion of the document when needed.;Two major but diagonally opposite approaches for storing an xml document on the disk have emerged. The first breaks an xml document into parent child pairs and stores them into relational storage. The second approach builds a native storage for xml that attempts to directly capture xml hierarchy. Canonical Storage for XML (CanStoreX) is a native storage technology being developed by our group at Iowa State University that has been tested for pagination of xml documents up to 100 Gigabytes in size. CanStoreX requires that every page is a self-contained xml document on its own right. Thus the pages themselves form an xml-like hierarchy.;XML can be used to encode a variety of data. Examples are system configuration, metadata, documents such as books, relational data, and object-oriented data. An array of technologies has developed to process xml documents. Our major interest in xml lies in the view that an xml document can be considered a database which can then be queried. There exists several query engines for xml. Kweelt is an excellent early platform that supports the Quilt query language. Quilt is a preliminary query language which has subsequently been extended to XQuery, a query language that has been standardized by the W3 Consortium. Quilt, the query language that Kweelt supports, is superseded by XQuery. The original Kweelt uses DOM parser; therefore it can only handle small documents. The main focus of this thesis is to deploy CanStoreX to query documents of the size of gigabytes. The resulting platform has been extensively tested

    Fast in-memory XPath search using compressed indexes

    Get PDF
    A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts querying the text of the document, plus some parts querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution speeds of the text and tree indexes. Here the SXSI system is introduced. It stores the tree structure of an XML document using a bit array of opening and closing brackets plus a sequence of labels, and stores the text nodes of the document using a global compressed self-index. On top of these indexes sits an XPath query engine that is based on tree automata. The engine uses fast counting queries of the text index in order to dynamically determine whether to evaluate top-down or bottom-up with respect to the tree structure. The resulting system has several advantages over existing systems: (1) on pure tree queries (without text search) such as the XPathMark queries, the SXSI system performs on par or better than the fastest known systems MonetDB and Qizx, (2) on queries that use text search, SXSI outperforms the existing systems by 1-3 orders of magnitude (depending on the size of the result set), and (3) with respect to memory consumption, SXSI outperforms all other systems for counting-only queries.Peer reviewe

    Vectorwise: Beyond Column Stores

    Get PDF
    textabstractThis paper tells the story of Vectorwise, a high-performance analytical database system, from multiple perspectives: its history from academic project to commercial product, the evolution of its technical architecture, customer reactions to the product and its future research and development roadmap. One take-away from this story is that the novelty in Vectorwise is much more than just column-storage: it boasts many query processing innovations in its vectorized execution model, and an adaptive mixed row/column data storage model with indexing support tailored to analytical workloads. Another one is that there is a long road from research prototype to commercial product, though database research continues to achieve a strong innovative influence on product development

    Implementation of Web Query Languages Reconsidered

    Get PDF
    Visions of the next generation Web such as the "Semantic Web" or the "Web 2.0" have triggered the emergence of a multitude of data formats. These formats have different characteristics as far as the shape of data is concerned (for example tree- vs. graph-shaped). They are accompanied by a puzzlingly large number of query languages each limited to one data format. Thus, a key feature of the Web, namely to make it possible to access anything published by anyone, is compromised. This thesis is devoted to versatile query languages capable of accessing data in a variety of Web formats. The issue is addressed from three angles: language design, common, yet uniform semantics, and common, yet uniform evaluation. % Thus it is divided in three parts: First, we consider the query language Xcerpt as an example of the advocated class of versatile Web query languages. Using this concrete exemplar allows us to clarify and discuss the vision of versatility in detail. Second, a number of query languages, XPath, XQuery, SPARQL, and Xcerpt, are translated into a common intermediary language, CIQLog. This language has a purely logical semantics, which makes it easily amenable to optimizations. As a side effect, this provides the, to the best of our knowledge, first logical semantics for XQuery and SPARQL. It is a very useful tool for understanding the commonalities and differences of the considered languages. Third, the intermediate logical language is translated into a query algebra, CIQCAG. The core feature of CIQCAG is that it scales from tree- to graph-shaped data and queries without efficiency losses when tree-data and -queries are considered: it is shown that, in these cases, optimal complexities are achieved. CIQCAG is also shown to evaluate each of the aforementioned query languages with a complexity at least as good as the best known evaluation methods so far. For example, navigational XPath is evaluated with space complexity O(q d) and time complexity O(q n) where q is the query size, n the data size, and d the depth of the (tree-shaped) data. CIQCAG is further shown to provide linear time and space evaluation of tree-shaped queries for a larger class of graph-shaped data than any method previously proposed. This larger class of graph-shaped data, called continuous-image graphs, short CIGs, is introduced for the first time in this thesis. A (directed) graph is a CIG if its nodes can be totally ordered in such a manner that, for this order, the children of any node form a continuous interval. CIQCAG achieves these properties by employing a novel data structure, called sequence map, that allows an efficient evaluation of tree-shaped queries, or of tree-shaped cores of graph-shaped queries on any graph-shaped data. While being ideally suited to trees and CIGs, the data structure gracefully degrades to unrestricted graphs. It yields a remarkably efficient evaluation on graph-shaped data that only a few edges prevent from being trees or CIGs

    Incorporating Domain-Specific Information Quality Constraints into Database Queries

    Get PDF
    The range of information now available in queryable repositories opens up a host of possibilities for new and valuable forms of data analysis. Database query languages such as SQL and XQuery offer a concise and high-level means by which such analyses can be implemented, facilitating the extraction of relevant data subsets into either generic or bespoke data analysis environments. Unfortunately, the quality of data in these repositories is often highly variable. The data is still useful, but only if the consumer is aware of the data quality problems and can work around them. Standard query languages offer little support for this aspect of data management. In principle, however, it should be possible to embed constraints describing the consumer’s data quality requirements into the query directly, so that the query evaluator can take over responsibility for enforcing them during query processing. Most previous attempts to incorporate information quality constraints into database queries have been based around a small number of highly generic quality measures, which are defined and computed by the information provider. This is a useful approach in some application areas but, in practice, quality criteria are more commonly determined by the user of the information not by the provider. In this paper, we explore an approach to incorporating quality constraints into databas
    corecore