308,778 research outputs found

    Streaming Weighted Sampling over Join Queries

    Get PDF
    Join queries are a fundamental database tool, capturing a range of tasks that involve linking heterogeneous data sources. However, with massive table sizes, it is often impractical to keep these in memory, and we can only take one or few streaming passes over them. Moreover, building out the full join result (e.g., linking heterogeneous data sources along quasi-identifiers) can lead to a combinatorial explosion of results due to many-to-many links. Random sampling is a natural tool to boil this oversized result down to a representative subset with well-understood statistical properties, but turns out to be a challenging task due to the combinatorial nature of the sampling domain. Existing techniques in the literature focus solely on the setting with tabular data residing in main memory, and do not address aspects such as stream operation, weighted sampling and more general join operators that are urgently needed in a modern data processing context. The main contribution of this work is to meet these needs with more lightweight practical approaches. First, a bijection between the sampling problem and a graph problem is introduced to support weighted sampling and common join operators. Second, the sampling techniques are refined to minimise the number of streaming passes. Third, techniques are presented to deal with very large tables under limited memory. Finally, the proposed techniques are compared to existing approaches that rely on database indices and the results indicate substantial memory savings, reduced runtimes for ad-hoc queries and competitive amortised runtimes

    Towards G2G: Systems of Technology Database Systems

    Get PDF
    We present an approach and methodology for developing Government-to-Government (G2G) Systems of Technology Database Systems. G2G will deliver technologies for distributed and remote integration of technology data for internal use in analysis and planning as well as for external communications. G2G enables NASA managers, engineers, operational teams and information systems to "compose" technology roadmaps and plans by selecting, combining, extending, specializing and modifying components of technology database systems. G2G will interoperate information and knowledge that is distributed across organizational entities involved that is ideal for NASA future Exploration Enterprise. Key contributions of the G2G system will include the creation of an integrated approach to sustain effective management of technology investments that supports the ability of various technology database systems to be independently managed. The integration technology will comply with emerging open standards. Applications can thus be customized for local needs while enabling an integrated management of technology approach that serves the global needs of NASA. The G2G capabilities will use NASA s breakthrough in database "composition" and integration technology, will use and advance emerging open standards, and will use commercial information technologies to enable effective System of Technology Database systems

    A Unified Approach for Indexed and Non-Indexed Spatial Joins

    Get PDF
    The original publication is available at www.springerlink.comL. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, J. Vahrenhold, and J. S. Vitter. “A Unified Approach for Indexed and Non-Indexed Spatial Joins,” Proceedings of the 7th International Conference on Extending Database Technology (EDBT ’00), Konstanz, Germany, March 2000, published in Lecture Notes in Computer Science, Springer, 1777, Berlin, Germany, 413–429

    DEEP CONVOLUTIONAL NEURAL NETWORK USING A NEW DATASET FOR BERBER LANGUAGE

    Get PDF
    Currently, Handwritten Character Recognition (HCR) technology has become an interesting and immensely useful technology. It has been explored with highperformance in many languages. However, a few HCR systems are proposed for the Amazigh (Berber) language. Furthermore, the validation of any Amazighhandwritten recognition system remains a major challenge due to no availability of a robust Amazigh database. To address this problem, we first created two new datasets for Tifinagh and Amazigh Latin characters, by extending the well-known EMNIST database with the Amazigh alphabet. And then, we have proposed a handwritten character recognition system, which is based on a deep convolutional neural network to validate the created datasets. The proposed CNN has been trained and tested on our created datasets, and the experimental tests show that it achieves satisfactory results in terms of accuracy and recognition efficiency

    Distribution of the Object Oriented Databases. A Viewpoint of the MVDB Model's Methodology and Architecture

    Get PDF
    In databases, much work has been done towards extending models with advanced tools such as view technology, schema evolution support, multiple classification, role modeling and viewpoints. Over the past years, most of the research dealing with the object multiple representation and evolution has proposed to enrich the monolithic vision of the classical object approach in which an object belongs to one hierarchy class. In particular, the integration of the viewpoint mechanism to the conventional object-oriented data model gives it flexibility and allows one to improve the modeling power of objects. The viewpoint paradigm refers to the multiple descriptions, the distribution, and the evolution of object. Also, it can be an undeniable contribution for a distributed design of complex databases. The motivation of this paper is to define an object data model integrating viewpoints in databases and to present a federated database architecture integrating multiple viewpoint sources following a local-as-extended-view data integration approach.object-oriented data model, OQL language, LAEV data integration approach, MVDB model, federated databases, Local-As-View Strategy.

    A Selectivity based approach to Continuous Pattern Detection in Streaming Graphs

    Full text link
    Cyber security is one of the most significant technical challenges in current times. Detecting adversarial activities, prevention of theft of intellectual properties and customer data is a high priority for corporations and government agencies around the world. Cyber defenders need to analyze massive-scale, high-resolution network flows to identify, categorize, and mitigate attacks involving networks spanning institutional and national boundaries. Many of the cyber attacks can be described as subgraph patterns, with prominent examples being insider infiltrations (path queries), denial of service (parallel paths) and malicious spreads (tree queries). This motivates us to explore subgraph matching on streaming graphs in a continuous setting. The novelty of our work lies in using the subgraph distributional statistics collected from the streaming graph to determine the query processing strategy. We introduce a "Lazy Search" algorithm where the search strategy is decided on a vertex-to-vertex basis depending on the likelihood of a match in the vertex neighborhood. We also propose a metric named "Relative Selectivity" that is used to select between different query processing strategies. Our experiments performed on real online news, network traffic stream and a synthetic social network benchmark demonstrate 10-100x speedups over selectivity agnostic approaches.Comment: in 18th International Conference on Extending Database Technology (EDBT) (2015

    Extending, trimming and fusing WordNet for technical documents

    Get PDF
    This paper describes a tool for the automatic extension and trimming of a multilingual WordNet database for cross-lingual retrieval and multilingual ontology building in intranets and domain-specific document collections. Hierarchies, built from automatically extracted terms and combined with the WordNet relations, are trimmed with a disambiguation method based on the document salience of the words in the glosses. The disambiguation is tested in a cross-lingual retrieval task, showing considerable improvement (7%-11%). The condensed hierarchies can be used as browse-interfaces to the documents complementary to retrieval
    • …
    corecore