986 research outputs found

    Content-Aware DataGuides for Indexing Large Collections of XML Documents

    Get PDF
    XML is well-suited for modelling structured data with textual content. However, most indexing approaches perform structure and content matching independently, combining the retrieved path and keyword occurrences in a third step. This paper shows that retrieval in XML documents can be accelerated significantly by processing text and structure simultaneously during all retrieval phases. To this end, the Content-Aware DataGuide (CADG) enhances the wellknown DataGuide with (1) simultaneous keyword and path matching and (2) a precomputed content/structure join. Extensive experiments prove the CADG to be 50-90% faster than the DataGuide for various sorts of query and document, including difficult cases such as poorly structured queries and recursive document paths. A new query classification scheme identifies precise query characteristics with a predominant influence on the performance of the individual indices. The experiments show that the CADG is applicable to many real-world applications, in particular large collections of heterogeneously structured XML documents

    Sensitivity of Semantic Signatures in Text Mining

    Get PDF
    The rapid development of the Internet and the ability to store data relatively inexpensively has contributed to an information explosion that did not exist a few years ago. Just a few keystrokes on search engines on any given subject will provide more web pages than any time before. As the amount of data available to us is so overwhelming, the ability to extract relevant information from it remains a challenge.;Since 80% of the available data stored world wide is text, we need advanced techniques to process this textual data and extract useful in formation. Text mining is one such process to address the information explosion problem that employs techniques such as natural language processing, information retrieval, machine learning algorithms and knowledge management. In text mining, the subjected text undergoes a transformation where essential attributes of the text are derived. The attributes that form interesting patterns are chosen and machine learning algorithms are used to find similar patterns in desired corpora. At the end, the resulting texts are evaluated and interpreted.;In this thesis we develop a new framework for the text mining process. An investigator chooses target content from training files, which is captured in semantic signatures. Semantic signatures characterize the target content derived from training files that we are looking for in testing files (whose content is unknown). The semantic signatures work as attributes to fetch and/or categorize the target content from a test corpus. A proof of concept software package, consisting of tools that aid an investigator in mining text data, is developed using Visual studio, C# and .NET framework.;Choosing keywords plays a major role in designing semantic signatures; careful selection of keywords leads to a more accurate analysis, especially in English, which is sensitive to semantics. It is interesting to note that when words appear in different contexts they carry a different meaning. We have incorporated stemming within the framework and its effectiveness is demonstrated using a large corpus. We have conducted experiments to demonstrate the sensitivity of semantic signatures to subtle content differences between closely related documents. These experiments show that the newly developed framework can identify subtle semantic differences substantially

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Containment queries on nested sets

    Get PDF

    The Tree Inclusion Problem: In Linear Space and Faster

    Full text link
    Given two rooted, ordered, and labeled trees PP and TT the tree inclusion problem is to determine if PP can be obtained from TT by deleting nodes in TT. This problem has recently been recognized as an important query primitive in XML databases. Kilpel\"ainen and Mannila [\emph{SIAM J. Comput. 1995}] presented the first polynomial time algorithm using quadratic time and space. Since then several improved results have been obtained for special cases when PP and TT have a small number of leaves or small depth. However, in the worst case these algorithms still use quadratic time and space. Let nSn_S, lSl_S, and dSd_S denote the number of nodes, the number of leaves, and the %maximum depth of a tree S∈{P,T}S \in \{P, T\}. In this paper we show that the tree inclusion problem can be solved in space O(nT)O(n_T) and time: O(\min(l_Pn_T, l_Pl_T\log \log n_T + n_T, \frac{n_Pn_T}{\log n_T} + n_{T}\log n_{T})). This improves or matches the best known time complexities while using only linear space instead of quadratic. This is particularly important in practical applications, such as XML databases, where the space is likely to be a bottleneck.Comment: Minor updates from last tim
    • …
    corecore