9,731 research outputs found

    An Optimistic Approach for Clustering Multi-version XML Documents Using Compressed Delta

    Get PDF
    Today with Standardization of XML as an information exchange over web, huge amount of information is formatted in the XML document. XML documents are huge in size. The amount of information that has to be transmitted, processed, stored, and queried is often larger than that of other data formats. Also in real world applications XML documents are dynamic in nature. The versatile applicability of XML documents in different fields of information maintenance and management is increasing the demand to store different versions of XML documents with time. However, storage of all versions of an XML document may introduce the redundancy. Self describing nature of XML creates the problem of verbosity,in result documents are in huge size. This paper proposes optimistic approach to Re-cluster multi-version XML documents which change in time by reassessing distance between them by using knowledge from initial clustering solution and changes stored in compressed delta. Evolving size of XML document is reduced by applying homomorphic compression before clustering them which retains its original structure. Compressed delta stores the changes responsible for document versions, without decompressing them. Test results shows that our approach performs much better than using full pair-wise document comparison

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Ensuring Query Compatibility with Evolving XML Schemas

    Get PDF
    During the life cycle of an XML application, both schemas and queries may change from one version to another. Schema evolutions may affect query results and potentially the validity of produced data. Nowadays, a challenge is to assess and accommodate the impact of theses changes in rapidly evolving XML applications. This article proposes a logical framework and tool for verifying forward/backward compatibility issues involving schemas and queries. First, it allows analyzing relations between schemas. Second, it allows XML designers to identify queries that must be reformulated in order to produce the expected results across successive schema versions. Third, it allows examining more precisely the impact of schema changes over queries, therefore facilitating their reformulation

    The use of data-mining for the automatic formation of tactics

    Get PDF
    This paper discusses the usse of data-mining for the automatic formation of tactics. It was presented at the Workshop on Computer-Supported Mathematical Theory Development held at IJCAR in 2004. The aim of this project is to evaluate the applicability of data-mining techniques to the automatic formation of tactics from large corpuses of proofs. We data-mine information from large proof corpuses to find commonly occurring patterns. These patterns are then evolved into tactics using genetic programming techniques

    SOTXTSTREAM: Density-based self-organizing clustering of text streams

    Get PDF
    A streaming data clustering algorithm is presented building upon the density-based selforganizing stream clustering algorithm SOSTREAM. Many density-based clustering algorithms are limited by their inability to identify clusters with heterogeneous density. SOSTREAM addresses this limitation through the use of local (nearest neighbor-based) density determinations. Additionally, many stream clustering algorithms use a two-phase clustering approach. In the first phase, a micro-clustering solution is maintained online, while in the second phase, the micro-clustering solution is clustered offline to produce a macro solution. By performing self-organization techniques on micro-clusters in the online phase, SOSTREAM is able to maintain a macro clustering solution in a single phase. Leveraging concepts from SOSTREAM, a new density-based self-organizing text stream clustering algorithm, SOTXTSTREAM, is presented that addresses several shortcomings of SOSTREAM. Gains in clustering performance of this new algorithm are demonstrated on several real-world text stream datasets

    Changing Trains at Wigan: Digital Preservation and the Future of Scholarship

    Get PDF
    This paper examines the impact of the emerging digital landscape on long term access to material created in digital form and its use for research; it examines challenges, risks and expectations.
    • …
    corecore