906 research outputs found

    K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources

    Get PDF
    The integration of heterogeneous data sources and software systems is a major issue in the biomed ical community and several approaches have been explored: linking databases, on-the- fly integration through views, and integration through warehousing. In this paper we report on our experiences with two systems that were developed at the University of Pennsylvania: an integration system called K2, which has primarily been used to provide views over multiple external data sources and software systems; and a data warehouse called GUS which downloads, cleans, integrates and annotates data from multiple external data sources. Although the view and warehouse approaches each have their advantages, there is no clear winner . Therefore, users must consider how the data is to be used, what the performance guarantees must be, and how much programmer time and expertise is available to choose the best strategy for a particular application

    Graph database management systems: storage, management and query processing

    Get PDF
    The proliferation of graph data, generated from diverse sources, have given rise to many research efforts concerning graph analysis. Interactions in social networks, publication networks, protein networks, software code dependencies and transportation systems are all examples of graph-structured data originating from a variety of application domains and demonstrating different characteristics. In recent years, graph database management systems (GDBMS) have been introduced for the management and analysis of graph data. Motivated by the growing number of real-life applications making use of graph database systems, this thesis focuses on the effectiveness and efficiency aspects of such systems. Specifically, we study the following topics relevant to graph database systems: (i) modeling large-scale applications in GDBMS; (ii) storage and indexing issues in GDBMS, and (iii) efficient query processing in GDBMS. In this thesis, we adopt two different application scenarios to examine how graph database systems can model complex features and perform relevant queries on each of them. Motivated by the popular application of social network analytics, we selected Twitter, a microblogging platform, to conduct our detailed analysis. Addressing limitations of existing models, we pro- pose a data model for the Twittersphere that proactively captures Twitter-specific interactions. We examine the feasibility of running analytical queries on GDBMS and offer empirical analysis of the performance of the proposed approach. Next, we consider a use case of modeling software code dependencies in a graph database system, and investigate how these systems can support capturing the evolution of a codebase overtime. We study a code comprehension tool that extracts software dependencies and stores them in a graph database. On a versioned graph built using a very large codebase, we demonstrate how existing code comprehension queries can be efficiently processed and also show the benefit of running queries across multiple versions. Another important aspect of this thesis is the study of storage aspects of graph systems. Throughput of many graph queries can be significantly affected by disk I/O performance; therefore graph database systems need to focus on effective graph storage for optimising disk operations. We observe that the locality of edges plays an important role and we address the edge-labeling problem which aims to label both incoming and outgoing edges of a graph maximizing the ‘edge-consecutiveness’ metric. By achieving a better layout and locality of edges on disk, we show that our proposed algorithms result in significantly improved disk I/O performance leading to faster execution of neighbourhood queries. Some applications require the integrated processing of queries from graph and the textual domains within a graph database system. Aggregation of these dimensions facilitates gaining key insights in several application scenarios. For example, in a social network setting, one may want to find the closest k users in the network (graph traversal) who talk about a particular topic A (textual search). Motivated by such practical use cases, in this thesis we study the top-k social-textual ranking query that essentially requires efficient combination of a keyword search query with a graph traversal. We propose algorithms that leverage graph partitioning techniques, based on the premise that socially close users will be placed within the same partition, allowing more localised computations. We show that our proposed approaches are able to achieve significantly better results compared to standard baselines and demonstrating robust behaviour under changing parameters

    Bridging the semantic gap in content-based image retrieval.

    Get PDF
    To manage large image databases, Content-Based Image Retrieval (CBIR) emerged as a new research subject. CBIR involves the development of automated methods to use visual features in searching and retrieving. Unfortunately, the performance of most CBIR systems is inherently constrained by the low-level visual features because they cannot adequately express the user\u27s high-level concepts. This is known as the semantic gap problem. This dissertation introduces a new approach to CBIR that attempts to bridge the semantic gap. Our approach includes four components. The first one learns a multi-modal thesaurus that associates low-level visual profiles with high-level keywords. This is accomplished through image segmentation, feature extraction, and clustering of image regions. The second component uses the thesaurus to annotate images in an unsupervised way. This is accomplished through fuzzy membership functions to label new regions based on their proximity to the profiles in the thesaurus. The third component consists of an efficient and effective method for fusing the retrieval results from the multi-modal features. Our method is based on learning and adapting fuzzy membership functions to the distribution of the features\u27 distances and assigning a degree of worthiness to each feature. The fourth component provides the user with the option to perform hybrid querying and query expansion. This allows the enrichment of a visual query with textual data extracted from the automatically labeled images in the database. The four components are integrated into a complete CBIR system that can run in three different and complementary modes. The first mode allows the user to query using an example image. The second mode allows the user to specify positive and/or negative sample regions that should or should not be included in the retrieved images. The third mode uses a Graphical Text Interface to allow the user to browse the database interactively using a combination of low-level features and high-level concepts. The proposed system and ail of its components and modes are implemented and validated using a large data collection for accuracy, performance, and improvement over traditional CBIR techniques

    Maximizing User Domain Expertise to Clarify Oblique Specifications of Relational Queries

    Full text link
    While there is abundant access to data management technology today, working with data is still challenging for the average user. One common means of manipulating data is with SQL on relational databases, but this requires knowledge of SQL as well as the database's schema and contents. Consequently, previous work has proposed oblique query specification (OQS) methods such as natural language or programming-by-example to allow users to imprecisely specify their query intent. These methods, however, suffer from either low precision or low expressivity and, in addition, produce a list of candidate SQL queries that make it difficult for users to select their final target query. My thesis is that OQS systems should maximize user domain expertise to triangulate the user's desired query. First, I demonstrate how to leverage previously-issued SQL queries to improve the accuracy of natural language interfaces. Second, I propose a system allowing users to specify a query with both natural language and programming-by-example. Finally, I develop a system where users provide feedback on system-suggested tuples to select a SQL query from a set of candidate queries generated by an OQS system.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/155114/1/cjbaik_1.pd

    The development of a model for organising educational resources on an Intranet

    Get PDF
    The twenty-first century has found education at the crossroads of change. There are burgeoning challenges facing the modern educator. To rise to the importuning, educators find themselves turning to Information Technology for the answers. The technologies utilised in attempts to overcome the challenges often include the Internet and electronic educational resources. Although the Internet is not unduly called the Information Highway, it is also fraught with misleading and incorrect information. Educators’ arduous searches result in few good and useable resources. Thus, to store, organise and efficiently retrieve the discovered resources is a matter of time-saving. The aim of the study was to develop a method to organise and retrieve educational resources in an efficient and personalised manner. In order to do this, an exploration into pedagogy and educational paradigms was undertaken. The current educational paradigm, constructivism, proposes that each learner is an individual with unique learning and personal needs. To develop a new model, the current models need to be understood. The current solutions for the organising of educational resources are realised as several software packages, also called e-learning packages. A list of criteria that describes the essential requirements for organising educational resources was established. These criteria were based upon the pedagogical principles prescribed by educators and the practical technological frameworks necessary to fulfil the needs of the teaching/learning situation. These criteria were utilised to critique and explore the available solutions. It was found that although the available e-learning packages fulfil a need within their genre, it does not meet with the core requirements of constructivism. The resource base model seeks to address these needs by focussing on the educational aspects of resource delivery over an Intranet. For the purposes of storing, organising and delivering the resources, a database had to be established. This database had to have numerous qualities, including the ability to search and retrieve resources with great efficiency. Retrieving data in an efficient manner is the forte of the star schema, while the storing and organising of data is the strength of a normalised schema. It is not standard practice to utilise both types of schemas within the same database. A star schema is usually reserved for data warehouses because of its data retrieval abilities. It is customary to utilise a normalised schema for operational databases. The resource base model, however, needs both the storage facilities of an operational database and the efficient query facilities of a data warehouse. The resource base model, therefore, melds both schemas into one database with interlinking tables. This database forms the foundation (or the back-end) of the resource base. The resource base model utilises web browsers as its user interface (or front-end). The results of the study on the pedagogy, the current e-learning solutions and the resource base were written up within this dissertation. The contribution that this dissertation makes is the development of a technique to efficiently store, organise and retrieve educational resources in such a manner that both the requirements of constructivism and outcomes-based education are fulfilled. To this end, a list of technological and pedagogical criteria on which to critique a resource delivery technique has been developed. This dissertation also elaborates on the schema designs chosen for the resource base, namely the normalised schema and the star schema. From this schema, a prototype has been developed. The prototype’s function was two-fold. The first function is to determine the feasibility of the technique. Secondly, to determine the success of the technique in fulfilling the needs expressed in the list of criteri
    • …
    corecore