277 research outputs found

    Content-Aware DataGuides for Indexing Large Collections of XML Documents

    Get PDF
    XML is well-suited for modelling structured data with textual content. However, most indexing approaches perform structure and content matching independently, combining the retrieved path and keyword occurrences in a third step. This paper shows that retrieval in XML documents can be accelerated significantly by processing text and structure simultaneously during all retrieval phases. To this end, the Content-Aware DataGuide (CADG) enhances the wellknown DataGuide with (1) simultaneous keyword and path matching and (2) a precomputed content/structure join. Extensive experiments prove the CADG to be 50-90% faster than the DataGuide for various sorts of query and document, including difficult cases such as poorly structured queries and recursive document paths. A new query classification scheme identifies precise query characteristics with a predominant influence on the performance of the individual indices. The experiments show that the CADG is applicable to many real-world applications, in particular large collections of heterogeneously structured XML documents

    Rule-Based Software Verification and Correction

    Full text link
    The increasing complexity of software systems has led to the development of sophisticated formal Methodologies for verifying and correcting data and programs. In general, establishing whether a program behaves correctly w.r.t. the original programmer s intention or checking the consistency and the correctness of a large set of data are not trivial tasks as witnessed by many case studies which occur in the literature. In this dissertation, we face two challenging problems of verification and correction. Specifically, verification and correction of declarative programs, and the verification and correction of Web sites (i.e. large collections of semistructured data). Firstly, we propose a general correction scheme for automatically correcting declarative, rule-based programs which exploits a combination of bottom-up as well as topdown inductive learning techniques. Our hybrid hodology is able to infer program corrections that are hard, or even impossible, to obtain with a simpler,automatic top-down or bottom-up learner. Moreover, the scheme will be also particularized to some well-known declarative programming paradigm: that is, the functional logic and the functional programming paradigm. Secondly, we formalize a framework for the automated verification of Web sites which can be used to specify integrity conditions for a given Web site, and then automatically check whether these conditions are fulfilled. We provide a rule-based, formal specification language which allows us to define syntactic as well as semantic properties of the Web site. Then, we formalize a verification technique which detects both incorrect/forbidden patterns as well as lack of information, that is, incomplete/missing Web pages. Useful information is gathered during the verification process which can be used to repair the Web site. So, after a verification phase, one can also infer semi-automatically some possible corrections in order to fix theWeb site. The methodology is based on a novel rewritBallis, D. (2005). Rule-Based Software Verification and Correction [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/194

    User Feedback in Probabilistic XML

    Get PDF
    Data integration is a challenging problem in many application areas. Approaches mostly attempt to resolve semantic uncertainty and conflicts between information sources as part of the data integration process. In some application areas, this is impractical or even prohibitive, for example, in an ambient environment where devices on an ad hoc basis have to exchange information autonomously. We have proposed a probabilistic XML approach that allows data integration without user involvement by storing semantic uncertainty and conflicts in the integrated XML data. As a\ud consequence, the integrated information source represents\ud all possible appearances of objects in the real world, the\ud so-called possible worlds.\ud \ud In this paper, we show how user feedback on query results\ud can resolve semantic uncertainty and conflicts in the\ud integrated data. Hence, user involvement is effectively postponed to query time, when a user is already interacting actively with the system. The technique relates positive and\ud negative statements on query answers to the possible worlds\ud of the information source thereby either reinforcing, penalizing, or eliminating possible worlds. We show that after repeated user feedback, an integrated information source better resembles the real world and may converge towards a non-probabilistic information source

    XML documents schema design

    Get PDF
    The eXtensible Markup Language (XML) is fast emerging as the dominant standard for storing, describing and interchanging data among various systems and databases on the intemet. It offers schema such as Document Type Definition (DTD) or XML Schema Definition (XSD) for defining the syntax and structure of XML documents. To enable efficient usage of XML documents in any application in large scale electronic environment, it is necessary to avoid data redundancies and update anomalies. Redundancy and anomalies in XML documents can lead not only to higher data storage cost but also to increased costs for data transfer and data manipulation.To overcome this problem, this thesis proposes to establish a formal framework of XML document schema design. To achieve this aim, we propose a method to improve and simplify XML schema design by incorporating a conceptual model of the DTD with a theory of database normalization. A conceptual diagram, Graph-Document Type Definition (G-DTD) is proposed to describe the structure of XML documents at the schema level. For G- DTD itself, we define a structure which incorporates attributes, simple elements, complex elements, and relationship types among them. Furthermore, semantic constraints are also precisely defined in order to capture semantic meanings among the defined XML objects.In addition, to provide a guideline to a well-designed schema for XML documents, we propose a set of normal forms for G-DTD on the basis of rules proposed by Arenas and Libkin and Lv. et al. The corresponding normalization rules to transform from a G- DTD into a normal form schema are also discussed. A case study is given to illustrate the applicability of the concept. As a result, we found that the new normal forms are more concise and practical, in particular as they allow the user to find an 'optimal' structure of XML elements/attributes at the schema level. To prove that our approach is applicable for the database designer, we develop a prototype of XML document schema design using a Z formal specification language. Finally, using the same case study, this formal specification is tested to check for correctness and consistency of the specification. Thus, this gives a confidence that our prototype can be implemented successfully to generate an automatic XML schema design

    Management Strategies for Adopting Agile Methods of Software Development in Distributed Teams

    Get PDF
    Between 2003 and 2015, more than 61% of U.S. software development teams failed to satisfy project requirements, budgets, or timelines. Failed projects cost the software industry an estimated 60 billion dollars. Lost opportunities and misused resources are often the result of software development leaders failing to implement appropriate methods for managing software projects. The purpose of this qualitative multiple case study was to explore strategies software development managers use in adopting Agile methodology in the context of distributed teams. The tenets of Agile approach are individual interaction over tools, working software over documentation, and collaboration over a contract. The conceptual framework for the study was adapting Agile development methodologies. The targeted population was software development managers of U.S.-based companies located in Northern California who had successfully adopted Agile methods for distributed teams. Data were collected through face-to-face interviews with 5 managers and a review of project-tracking documentation and tools. Data analysis included inductive coding of transcribed interviews and evaluation of secondary data to identify themes through methodological triangulation. Findings indicated that coaching and training of teams, incremental implementation of Agile processes, and proactive management of communication effectiveness are effective strategies for adopting Agile methodology in the context of distributed teams. Improving the efficacy of Agile adoption may translate to increased financial stability for software engineers across the world as well as accelerate the successful development of information systems, thereby enriching human lives

    Dagstuhl News January - December 2001

    Get PDF
    "Dagstuhl News" is a publication edited especially for the members of the Foundation "Informatikzentrum Schloss Dagstuhl" to thank them for their support. The News give a summary of the scientific work being done in Dagstuhl. Each Dagstuhl Seminar is presented by a small abstract describing the contents and scientific highlights of the seminar as well as the perspectives or challenges of the research topic

    The 4th Conference of PhD Students in Computer Science

    Get PDF

    Querying Large Collections of Semistructured Data

    Get PDF
    An increasing amount of data is published as semistructured documents formatted with presentational markup. Examples include data objects such as mathematical expressions encoded with MathML or web pages encoded with XHTML. Our intention is to improve the state of the art in retrieving, manipulating, or mining such data. We focus first on mathematics retrieval, which is appealing in various domains, such as education, digital libraries, engineering, patent documents, and medical sciences. Capturing the similarity of mathematical expressions also greatly enhances document classification in such domains. Unlike text retrieval, where keywords carry enough semantics to distinguish text documents and rank them, math symbols do not contain much semantic information on their own. Unfortunately, considering the structure of mathematical expressions to calculate relevance scores of documents results in ranking algorithms that are computationally more expensive than the typical ranking algorithms employed for text documents. As a result, current math retrieval systems either limit themselves to exact matches, or they ignore the structure completely; they sacrifice either recall or precision for efficiency. We propose instead an efficient end-to-end math retrieval system based on a structural similarity ranking algorithm. We describe novel optimization techniques to reduce the index size and the query processing time. Thus, with the proposed optimizations, mathematical contents can be fully exploited to rank documents in response to mathematical queries. We demonstrate the effectiveness and the efficiency of our solution experimentally, using a special-purpose testbed that we developed for evaluating math retrieval systems. We finally extend our retrieval system to accommodate rich queries that consist of combinations of math expressions and textual keywords. As a second focal point, we address the problem of recognizing structural repetitions in typical web documents. Most web pages use presentational markup standards, in which the tags control the formatting of documents rather than semantically describing their contents. Hence, their structures typically contain more irregularities than descriptive (data-oriented) markup languages. Even though applications would greatly benefit from a grammar inference algorithm that captures structure to make it explicit, the existing algorithms for XML schema inference, which target data-oriented markup, are ineffective in inferring grammars for web documents with presentational markup. There is currently no general-purpose grammar inference framework that can handle irregularities commonly found in web documents and that can operate with only a few examples. Although inferring grammars for individual web pages has been partially addressed by data extraction tools, the existing solutions rely on simplifying assumptions that limit their application. Hence, we describe a principled approach to the problem by defining a class of grammars that can be inferred from very small sample sets and can capture the structure of most web documents. The effectiveness of this approach, together with a comparison against various classes of grammars including DTDs and XSDs, is demonstrated through extensive experiments on web documents. We finally use the proposed grammar inference framework to extend our math retrieval system and to optimize it further

    Object Fusion in Geographic Information Systems

    Get PDF
    • …
    corecore