71 research outputs found

    OSM-CAT: A software application to generate contribution summaries from OpenStreetMap XML

    Get PDF
    OpenStreetMap (OSM) is currently the most extensive and widely-used example of Volunteered Geographical Information (VGI) available on the Internet. The aim of the OSM project is to provide a free and openly accessible spatial database. The data is provided by volunteers, who collect and contribute it to the OSM database using a variety of techniques and methods. OSM data is then most commonly used and accessed via a user-friendly webbased map on www.openstreetmap.org. The spatial data corresponding to any OSM mapped area can be exported in a special XML based format, namely OSM-XML. This provides a convenient and dedicated transport format which matches the OSM databases' model. Using these OSM-XML files one should be able to extract information about contribution patterns and tagging summaries for the data. However, the simplicity of OSM-XML is also potentially its greatest disadvantage. Processing OSM-XML data files efficiently can be problematic given that mapped areas can produce complex, large files. In this thesis we present the design and implementation of a new Java-based software application called the OpenStreetMap Contributor Analysis Tool (OSM-CAT) for computing contribution summaries from OSM-XML. OSM-CAT allows users to process OSM-XML data efficiently, and automatically produces a detailed summary of the contents of the dataset. This analysis places specific emphasis on 'interesting' statistics, such as who contributed to the OSM data in a chosen area, what types of contributions were made to the OSM data, when these contributions were made, and the accuracy of map feature tagging. While similar tools exist that do some of these tasks, OSM-CAT provides GIS researchers and interested individuals with a complete and integrated overview of contributions to OSM, corresponding to the input OSM-XML datasets. We present a full analysis of OSM-CAT on a large set of OSMXML datasets, and discuss its usefulness to the OSM community and beyond. We close the thesis with some conclusions, and set out a number of issues for consideration as future work. A comprehensive appendix is provided with additional information for those wishing to use OSM-CAT

    Distributed and scalable parsing solution for telecom network data

    Get PDF
    The growing usage of mobile devices and the introduction of 5G networks have increased the significance of network data for the telecom business. The success of telecom organizations can depend on employing efficient data engineering techniques for transforming raw network data into useful information by analytics and machine learning (ML). Elisa Oyj., a Finnish telecommunications company, receives massive amounts of network data from network equipment manufactured by various vendors. The effectiveness of data analytics depends on efficient data engineering processes. This thesis presents a scalable data parsing solution that leverages Spark, a distributed programming framework, for parallelizing parsing routines from an existing parsing solution. We design and deploy this solution as a component of the organization's data engineering pipeline to enable automation of data-centric operations. Experimental results indicate that the efficiency of the proposed solution is heavily dependent on the individual file size distribution. The proposed parsing solution demonstrates reliability, scalability, and speed during empirical evaluation and processes a 24-hour network data within 3 hours. The main outcome of the project is an optimized setup with the minimum number of data partitions to ensure zero failures and thus minimum execution time. A smaller execution time leads to lower costs of the continuously running infrastructure provisioned on the cloud

    On the performance of markup language compression

    Get PDF
    Data compression is used in our everyday life to improve computer interaction or simply for storage purposes. Lossless data compression refers to those techniques that are able to compress a file in such ways that the decompressed format is the replica of the original. These techniques, which differ from the lossy data compression, are necessary and heavily used in order to reduce resource usage and improve storage and transmission speeds. Prior research led to huge improvements in compression performance and efficiency for general purpose tools which are mainly based on statistical and dictionary encoding techniques. Extensible Markup Language (XML) is based on redundant data which is parsed as normal text by general-purpose compressors. Several tools for compressing XML data have been developed, resulting in improvements for compression size and speed using different compression techniques. These tools are mostly based on algorithms that rely on variable length encoding. XML Schema is a language used to define the structure and data types of an XML document. As a result of this, it provides XML compression tools additional information that can be used to improve compression efficiency. In addition, XML Schema is also used for validating XML data. For document compression there is a need to generate the schema dynamically for each XML file. This solution can be applied to improve the efficiency of XML compressors. This research investigates a dynamic approach to compress XML data using a hybrid compression tool. This model allows the compression of XML data using variable and fixed length encoding techniques when their best use cases are triggered. The aim of this research is to investigate the use of fixed length encoding techniques to support general-purpose XML compressors. The results demonstrate the possibility of improving on compression size when a fixed length encoder is used to compressed most XML data types

    Discovering semantic aspects of socially constructed knowledge hierarchy to boost the relevance of Web searching

    Get PDF
    The research intends to boost the relevance of Web search results by classifyingWebsnippet into socially constructed hierarchical search concepts, such as the mostcomprehensive human edited knowledge structure, the Open Directory Project (ODP). Thesemantic aspects of the search concepts (categories) in the socially constructed hierarchicalknowledge repositories are extracted from the associated textual information contributed bysocieties. The textual information is explored and analyzed to construct a category-documentset, which is subsequently employed to represent the semantics of the socially constructedsearch concepts. Simple API for XML (SAX), a component of JAXP (Java API for XMLProcessing) is utilized to read in and analyze the two RDF format ODP data files, structure.rdfand content.rdf. kNN, which is trained by the constructed category-document set, is used tocategorized the Web search results. The categorized Web search results are then ontologicallyfiltered based on the interactions of Web information seekers. Initial experimental resultsdemonstrate that the proposed approach can improve precision by 23.5%

    Dynamic Assembly for System Adaptability, Dependability, and Assurance

    Get PDF
    (DASASA) ProjectAuthor-contributed print ite

    SIQXC: Schema Independent Queryable XML Compression for Smartphones

    Get PDF
    The explosive growth of XML use over the last decade has led to a lot of research on how to best store and access it. This growth has resulted in XML being described as a de facto standard for storage and exchange of data over the web. However, XML has high redundancy because of its self-­‐ describing nature making it verbose. The verbose nature of XML poses a storage problem. This has led to much research devoted to XML compression. It has become of more interest since the use of resource constrained devices is also on the rise. These devices are limited in storage space, processing power and also have finite energy. Therefore, these devices cannot cope with storing and processing large XML documents. XML queryable compression methods could be a solution but none of them has a query processor that runs on such devices. Currently, wireless connections are used to alleviate the problem but they have adverse effects on the battery life. They are therefore not a sustainable solution. This thesis describes an attempt to address this problem by proposing a queryable compressor (SIQXC) with a query processor that runs in a resource constrained environment thereby lowering wireless connection dependency yet alleviating the storage problem. It applies a novel simple 2 tuple integer encoding system, clustering and gzip. SIQXC achieves an average compression ratio of 70% which is higher than most queryable XML compressors and also supports a wide range of XPATH operators making it competitive approach. It was tested through a practical implementation evaluated against the real data that is usually used for XML benchmarking. The evaluation covered the compression ratio, compression time and query evaluation accuracy and response time. SIQXC allows users to some extent locally store and manipulate the otherwise verbose XML on their Smartphones

    Efficient Storage of XML - A Comparative Study

    Get PDF
    The purpose of this study is to predict the performance of XML storage in various real time scenarios. This study is a survey and comparative analysis of data storage using databases to store and retrieve XML, using Java objects representing XML and other storage mechanisms that may have not yet been explored. It also gives a high-level overview of how to use XML with databases or Java Objects and describes how the differences between data-centric and document-centric XML affect their usage, when used with databases and objects, and how XML is used with relational and object oriented databases, Java Objects, and the role of native XML databases (stand alone XML databases). A detailed comparative study on storage of XML using Relational DBMS, Native XML DBMS and processing into Java Objects using JAXB was conducted. The data models such as relational, hierarchical, document-driven were used as inputs to the study. There is no single tool that can manage all the aspects of XML data used in an application. Each technology provides interestingly unique features. There is a tremendous amount of research and development in progress, in the development of tools and technologies to use XML. It can be safely predicted that all the technologies will finally merge into one standard method of storage of XML that will incorporate all the features such as, faster searches, full-text searches, maintaining original document order, ability to maintain a collection of documents, ability to query and store or retrieve over the network using protocols such as HTTP, SOAP etc., provide integral support for casting of elements, support for processing valid and non-valid XML documents, all in a single tool. This study has successfully concluded that the most efficient way to store XML data lies in the context of its usage.Computer Science Departmen

    Indexing collections of XML documents with arbitrary links

    Get PDF
    In recent years, the popularity of XML has increased significantly. XML is the extensible markup language of the World Wide Web Consortium (W3C). XML is used to represent data in many areas, such as traditional database management systems, e-business environments, and the World Wide Web. XML data, unlike relational and object-oriented data, has no fixed schema known in advance and is stored separately from the data. XML data is self-describing and can model heterogeneity more naturally than relational or object-oriented data models. Moreover, XML data usually has XLinks or XPointers to data in other documents (e.g., global-links). In addition to XLink or XPointer links, the XML standard allows to add internal-links between different elements in the same XML document using the ID/IDREF attributes. The rise in popularity of XML has generated much interest in query processing over graph-structured data. In order to facilitate efficient evaluation of path expressions, structured indexes have been proposed. However, most variants of structured indexes ignore global- or interior-document references. They assume a tree-like structure of XML-documents, which do not contain such global-and internal-links. Extending these indexes to work with large XML graphs considering of global- or internal-document links, firstly requires a lot of computing power for the creation process. Secondly, this would also require a great deal of space in which to store the indexes. As a latter demonstrates, the efficient evaluation of ancestors-descendants queries over arbitrary graphs with long paths is indeed a complex issue. This thesis proposes the HID index (2-Hop cover path Index based on DAG) is based on the concept of a two-hop cover for a directed graph. The algorithms proposed for the HID index creation, in effect, scales down the original graph size substantially. As a result, a directed acyclic graph (DAG) with a smaller number of nodes and edges will emerge. This reduces the number of computing steps required for building the index. In addition to this, computing time and space will be reduced as well. The index also permits to efficiently evaluate ancestors-descendants relationships. Moreover, the proposed index has an advantage over other comparable indexes: it is optimized for descendants- or-self queries on arbitrary graphs with link relationship, a task that would stress any index structures. Our experiments with real life XML data show that, the HID index provides better performance than other indexes
    corecore