3,915 research outputs found

    XML documents clustering using a tensor space model

    Get PDF
    The traditional Vector Space Model (VSM) is not able to represent both the structure and the content of XML documents. This paper introduces a novel method of representing XML documents in a Tensor Space Model (TSM) and then utilizing it for clustering. Empirical analysis shows that the proposed method is scalable for large-sized datasets; as well, the factorized matrices produced from the proposed method help to improve the quality of clusters through the enriched document representation of both structure and content information

    Random Indexing K-tree

    Get PDF
    Random Indexing (RI) K-tree is the combination of two algorithms for clustering. Many large scale problems exist in document clustering. RI K-tree scales well with large inputs due to its low complexity. It also exhibits features that are useful for managing a changing collection. Furthermore, it solves previous issues with sparse document vectors when using K-tree. The algorithms and data structures are defined, explained and motivated. Specific modifications to K-tree are made for use with RI. Experiments have been executed to measure quality. The results indicate that RI K-tree improves document cluster quality over the original K-tree algorithm.Comment: 8 pages, ADCS 2009; Hyperref and cleveref LaTeX packages conflicted. Removed clevere

    Adaptive content mapping for internet navigation

    Get PDF
    The Internet as the biggest human library ever assembled keeps on growing. Although all kinds of information carriers (e.g. audio/video/hybrid file formats) are available, text based documents dominate. It is estimated that about 80% of all information worldwide stored electronically exists in (or can be converted into) text form. More and more, all kinds of documents are generated by means of a text processing system and are therefore available electronically. Nowadays, many printed journals are also published online and may even discontinue to appear in print form tomorrow. This development has many convincing advantages: the documents are both available faster (cf. prepress services) and cheaper, they can be searched more easily, the physical storage only needs a fraction of the space previously necessary and the medium will not age. For most people, fast and easy access is the most interesting feature of the new age; computer-aided search for specific documents or Web pages becomes the basic tool for information-oriented work. But this tool has problems. The current keyword based search machines available on the Internet are not really appropriate for such a task; either there are (way) too many documents matching the specified keywords are presented or none at all. The problem lies in the fact that it is often very difficult to choose appropriate terms describing the desired topic in the first place. This contribution discusses the current state-of-the-art techniques in content-based searching (along with common visualization/browsing approaches) and proposes a particular adaptive solution for intuitive Internet document navigation, which not only enables the user to provide full texts instead of manually selected keywords (if available), but also allows him/her to explore the whole database

    The State-of-the-arts in Focused Search

    Get PDF
    The continuous influx of various text data on the Web requires search engines to improve their retrieval abilities for more specific information. The need for relevant results to a userā€™s topic of interest has gone beyond search for domain or type specific documents to more focused result (e.g. document fragments or answers to a query). The introduction of XML provides a format standard for data representation, storage, and exchange. It helps focused search to be carried out at different granularities of a structured document with XML markups. This report aims at reviewing the state-of-the-arts in focused search, particularly techniques for topic-specific document retrieval, passage retrieval, XML retrieval, and entity ranking. It is concluded with highlight of open problems

    Accelerating data retrieval steps in XML documents

    Get PDF

    Clustering-based Labelling Scheme - A Hybrid Approach for Efficient Querying and Updating XML Documents

    Get PDF
    Extensible Markup Language (XML) has become a dominant technology for transferring data through the worldwide web. The XML labelling schemes play a key role in handling XML data efficiently and robustly. Thus, many labelling schemes have been proposed. However, these labelling schemes have limitations and shortcomings. Thus, the aim of this research was to investigate the existing XML labelling schemes and their limitations in order to address the issue of efficiency of XML query performance. This thesis investigated the existing labelling schemes and classified them into three categories based on certain criteria, in order to identify the limitations and challenges of these labelling schemes. Based on the outcomes of this investigation, this thesis proposed a state-of-theart labelling scheme, called clustering-based labelling scheme, to resolve or improve the key limitations such as the efficiency of the XML query processing, labelling XML nodes, and XML updates cost. This thesis argued that using certain existing labelling schemes to label nodes, and using the clustering-based techniques can improve query and labelling nodes efficiency. Theoretically, the proposed scheme is based on dividing the nodes of an XML document into clusters. Two existing labelling schemes, which are the Dewey and LLS labelling schemes, were selected for labelling these clusters and their nodes. Subsequently, the proposed scheme was designed and implemented. In addition, the Dewey and LLS labelling scheme were implemented for the purpose of evaluating the proposed scheme. Subsequently, four experiments were designed in order to test the proposed scheme against the Dewey and LLS labelling schemes. The results of these experiments suggest that the proposed scheme achieved better results than the Dewey and LLS schemes. Consequently, the research hypothesis was accepted overall with few exceptions, and the proposed scheme showed an improvement in the performance and all the targeted features and aspects

    Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

    Full text link
    Big data systems development is full of challenges in view of the variety of application areas and domains that this technology promises to serve. Typically, fundamental design decisions involved in big data systems design include choosing appropriate storage and computing infrastructures. In this age of heterogeneous systems that integrate different technologies for optimized solution to a specific real world problem, big data system are not an exception to any such rule. As far as the storage aspect of any big data system is concerned, the primary facet in this regard is a storage infrastructure and NoSQL seems to be the right technology that fulfills its requirements. However, every big data application has variable data characteristics and thus, the corresponding data fits into a different data model. This paper presents feature and use case analysis and comparison of the four main data models namely document oriented, key value, graph and wide column. Moreover, a feature analysis of 80 NoSQL solutions has been provided, elaborating on the criteria and points that a developer must consider while making a possible choice. Typically, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings forth second facet of big data storage, big data file formats, into picture. The second half of the research paper compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Decentralized storage and blockchain are seen as the next generation of big data storage and its challenges and future prospects have also been discussed
    • ā€¦
    corecore