181 research outputs found

    P-LUPOSDATE: Using Precomputed Bloom Filters to Speed Up SPARQL Processing in the Cloud

    Get PDF
    Increasingly data on the Web is stored in the form of Semantic Web data. Because of today's information overload, it becomes very important to store and query these big datasets in a scalable way and hence in a distributed fashion. Cloud Computing offers such a distributed environment with dynamic reallocation of computing and storing resources based on needs. In this work we introduce a scalable distributed Semantic Web database in the Cloud. In order to reduce the number of (unnecessary) intermediate results early, we apply bloom filters. Instead of computing bloom filters, a time-consuming task during query processing as it has been done traditionally, we precompute the bloom filters as much as possible and store them in the indices besides the data. The experimental results with data sets up to 1 billion triples show that our approach speeds up query processing significantly and sometimes even reduces the processing time to less than half

    Efficient processing of similarity queries with applications

    Get PDF
    Today, a myriad of data sources, from the Internet to business operations to scientific instruments, produce large and different types of data. Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological applications, call for identifying and processing similarities in big data. As a result, it is imperative to develop new similarity query processing approaches and systems that scale from low dimensional data to high dimensional data, from single machine to clusters of hundreds of machines, and from disk-based to memory-based processing. This dissertation introduces and studies several similarity-aware query operators, analyzes and optimizes their performance. The first contribution of this dissertation is an SQL-based Similarity Group-by operator (SGB, for short) that extends the semantics of the standard SQL Group-by operator to group data with similar but not necessarily equal values. We realize these SGB operators by extending the Standard SQL Group-by and introduce two new SGB operators for multi-dimensional data. We implement and test the new SGB operators and their algorithms inside an open-source centralized database server (PostgreSQL). In the second contribution of this dissertation, we study how to efficiently process Hamming-distance-based similarity queries (Hamming-distance select and Hamming-distance join) that are crucial to many applications. We introduce a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries (namely, the selects and joins). In the third and last contribution of this dissertation, we develop a system for similarity query processing and optimization in an in-memory and distributed setup for big spatial data. We propose a query scheduler and a distributed query optimizer that use a new cost model to optimize the cost of similarity query processing in this in-memory distributed setup. The scheduler and query optimizer generates query execution plans that minimize the effect of query skew. The query scheduler employs new spatial indexing techniques based on bloom filters to forward queries to the appropriate local sites. The proposed query processing and optimization techniques are prototyped inside Spark, a distributed main-memory computation system

    Processing Rank-Aware Queries in Schema-Based P2P Systems

    Get PDF
    Effiziente Anfragebearbeitung in Datenintegrationssystemen sowie in P2P-Systemen ist bereits seit einigen Jahren ein Aspekt aktueller Forschung. Konventionelle Datenintegrationssysteme bestehen aus mehreren Datenquellen mit ggf. unterschiedlichen Schemata, sind hierarchisch aufgebaut und besitzen eine zentrale Komponente: den Mediator, der ein globales Schema verwaltet. Anfragen an das System werden auf diesem globalen Schema formuliert und vom Mediator bearbeitet, indem relevante Daten von den Datenquellen transparent für den Benutzer angefragt werden. Aufbauend auf diesen Systemen entstanden schließlich Peer-Daten-Management-Systeme (PDMSs) bzw. schemabasierte P2P-Systeme. An einem PDMS teilnehmende Knoten (Peers) können einerseits als Mediatoren agieren andererseits jedoch ebenso als Datenquellen. Darüber hinaus sind diese Peers autonom und können das Netzwerk jederzeit verlassen bzw. betreten. Die potentiell riesige Datenmenge, die in einem derartigen Netzwerk verfügbar ist, führt zudem in der Regel zu sehr großen Anfrageergebnissen, die nur schwer zu bewältigen sind. Daher ist das Bestimmen einer vollständigen Ergebnismenge in vielen Fällen äußerst aufwändig oder sogar unmöglich. In diesen Fällen bietet sich die Anwendung von Top-N- und Skyline-Operatoren, ggf. in Verbindung mit Approximationstechniken, an, da diese Operatoren lediglich diejenigen Datensätze als Ergebnis ausgeben, die aufgrund nutzerdefinierter Ranking-Funktionen am relevantesten für den Benutzer sind. Da durch die Anwendung dieser Operatoren zumeist nur ein kleiner Teil des Ergebnisses tatsächlich dem Benutzer ausgegeben wird, muss nicht zwangsläufig die vollständige Ergebnismenge berechnet werden sondern nur der Teil, der tatsächlich relevant für das Endergebnis ist. Die Frage ist nun, wie man derartige Anfragen durch die Ausnutzung dieser Erkenntnis effizient in PDMSs bearbeiten kann. Die Beantwortung dieser Frage ist das Hauptanliegen dieser Dissertation. Zur Lösung dieser Problemstellung stellen wir effiziente Anfragebearbeitungsstrategien in PDMSs vor, die die charakteristischen Eigenschaften ranking-basierter Operatoren sowie Approximationstechniken ausnutzen. Peers werden dabei sowohl auf Schema- als auch auf Datenebene hinsichtlich der Relevanz ihrer Daten geprüft und dementsprechend in die Anfragebearbeitung einbezogen oder ausgeschlossen. Durch die Heterogenität der Peers werden Techniken zum Umschreiben einer Anfrage von einem Schema in ein anderes nötig. Da existierende Techniken zum Umschreiben von Anfragen zumeist nur konjunktive Anfragen betrachten, stellen wir eine Erweiterung dieser Techniken vor, die Anfragen mit ranking-basierten Anfrageoperatoren berücksichtigt. Da PDMSs dynamische Systeme sind und teilnehmende Peers jederzeit ihre Daten ändern können, betrachten wir in dieser Dissertation nicht nur wie Routing-Indexe verwendet werden, um die Relevanz eines Peers auf Datenebene zu bestimmen, sondern auch wie sie gepflegt werden können. Schließlich stellen wir SmurfPDMS (SiMUlating enviRonment For Peer Data Management Systems) vor, ein System, welches im Rahmen dieser Dissertation entwickelt wurde und alle vorgestellten Techniken implementiert.In recent years, there has been considerable research with respect to query processing in data integration and P2P systems. Conventional data integration systems consist of multiple sources with possibly different schemas, adhere to a hierarchical structure, and have a central component (mediator) that manages a global schema. Queries are formulated against this global schema and the mediator processes them by retrieving relevant data from the sources transparently to the user. Arising from these systems, eventually Peer Data Management Systems (PDMSs), or schema-based P2P systems respectively, have attracted attention. Peers participating in a PDMS can act both as a mediator and as a data source, are autonomous, and might leave or join the network at will. Due to these reasons peers often hold incomplete or erroneous data sets and mappings. The possibly huge amount of data available in such a network often results in large query result sets that are hard to manage. Due to these reasons, retrieving the complete result set is in most cases difficult or even impossible. Applying rank-aware query operators such as top-N and skyline, possibly in conjunction with approximation techniques, is a remedy to these problems as these operators select only those result records that are most relevant to the user. Being aware that in most cases only a small fraction of the complete result set is actually output to the user, retrieving the complete set before evaluating such operators is obviously inefficient. Therefore, the questions we want to answer in this dissertation are how to compute such queries in PDMSs and how to do that efficiently. We propose strategies for efficient query processing in PDMSs that exploit the characteristics of rank-aware queries and optionally apply approximation techniques. A peer's relevance is determined on two levels: on schema-level and on data-level. According to its relevance a peer is either considered for query processing or not. Because of heterogeneity queries need to be rewritten, enabling cooperation between peers that use different schemas. As existing query rewriting techniques mostly consider conjunctive queries only, we present an extension that allows for rewriting queries involving rank-aware query operators. As PDMSs are dynamic systems and peers might update their local data, this dissertation addresses not only the problem of considering such structures within a query processing strategy but also the problem of keeping them up-to-date. Finally, we provide a system-level evaluation by presenting SmurfPDMS (SiMUlating enviRonment For Peer Data Management Systems) -- a system created in the context of this dissertation implementing all presented techniques

    Supporting Complex Queries in P2P Networks

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Congenial Web Search : A Conceptual Framework for Personalized, Collaborative, and Social Peer-to-Peer Retrieval

    Get PDF
    Traditional information retrieval methods fail to address the fact that information consumption and production are social activities. Most Web search engines do not consider the social-cultural environment of users' information needs and the collaboration between users. This dissertation addresses a new search paradigm for Web information retrieval denoted as Congenial Web Search. It emphasizes personalization, collaboration, and socialization methods in order to improve effectiveness. The client-server architecture of Web search engines only allows the consumption of information. A peer-to-peer system architecture has been developed in this research to improve information seeking. Each user is involved in an interactive process to produce meta-information. Based on a personalization strategy on each peer, the user is supported to give explicit feedback for relevant documents. His information need is expressed by a query that is stored in a Peer Search Memory. On one hand, query-document associations are incorporated in a personalized ranking method for repeated information needs. The performance is shown in a known-item retrieval setting. On the other hand, explicit feedback of each user is useful to discover collaborative information needs. A new method for a controlled grouping of query terms, links, and users was developed to maintain Virtual Knowledge Communities. The quality of this grouping represents the effectiveness of grouped terms and links. Both strategies, personalization and collaboration, tackle the problem of a missing socialization among searchers. Finally, a concept for integrated information seeking was developed. This incorporates an integrated representation to improve effectiveness of information retrieval and information filtering. An integrated information retrieval process explores a virtual search network of Peer Search Memories in order to accomplish a reputation-based ranking. In addition, the community structure is considered by an integrated information filtering process. Both concepts have been evaluated and shown to have a better performance than traditional techniques. The methods presented in this dissertation offer the potential towards more transparency, and control of Web search

    A Content-Addressable Network for Similarity Search in Metric Spaces

    Get PDF
    Because of the ongoing digital data explosion, more advanced search paradigms than the traditional exact match are needed for contentbased retrieval in huge and ever growing collections of data produced in application areas such as multimedia, molecular biology, marketing, computer-aided design and purchasing assistance. As the variety of data types is fast going towards creating a database utilized by people, the computer systems must be able to model human fundamental reasoning paradigms, which are naturally based on similarity. The ability to perceive similarities is crucial for recognition, classification, and learning, and it plays an important role in scientific discovery and creativity. Recently, the mathematical notion of metric space has become a useful abstraction of similarity and many similarity search indexes have been developed. In this thesis, we accept the metric space similarity paradigm and concentrate on the scalability issues. By exploiting computer networks and applying the Peer-to-Peer communication paradigms, we build a structured network of computers able to process similarity queries in parallel. Since no centralized entities are used, such architectures are fully scalable. Specifically, we propose a Peer-to-Peer system for similarity search in metric spaces called Metric Content-Addressable Network (MCAN) which is an extension of the well known Content-Addressable Network (CAN) used for hash lookup. A prototype implementation of MCAN was tested on real-life datasets of image features, protein symbols, and text — observed results are reported. We also compared the performance of MCAN with three other, recently proposed, distributed data structures for similarity search in metric spaces

    Cyber Security of Critical Infrastructures

    Get PDF
    Critical infrastructures are vital assets for public safety, economic welfare, and the national security of countries. The vulnerabilities of critical infrastructures have increased with the widespread use of information technologies. As Critical National Infrastructures are becoming more vulnerable to cyber-attacks, their protection becomes a significant issue for organizations as well as nations. The risks to continued operations, from failing to upgrade aging infrastructure or not meeting mandated regulatory regimes, are considered highly significant, given the demonstrable impact of such circumstances. Due to the rapid increase of sophisticated cyber threats targeting critical infrastructures with significant destructive effects, the cybersecurity of critical infrastructures has become an agenda item for academics, practitioners, and policy makers. A holistic view which covers technical, policy, human, and behavioural aspects is essential to handle cyber security of critical infrastructures effectively. Moreover, the ability to attribute crimes to criminals is a vital element of avoiding impunity in cyberspace. In this book, both research and practical aspects of cyber security considerations in critical infrastructures are presented. Aligned with the interdisciplinary nature of cyber security, authors from academia, government, and industry have contributed 13 chapters. The issues that are discussed and analysed include cybersecurity training, maturity assessment frameworks, malware analysis techniques, ransomware attacks, security solutions for industrial control systems, and privacy preservation methods

    Spatial Aspects of Metaphors for Information: Implications for Polycentric System Design

    Get PDF
    This dissertation presents three innovations that suggest an alternative approach to structuring information systems: a multidimensional heuristic workspace, a resonance metaphor for information, and a question-centered approach to structuring information relations. Motivated by the need for space to establish a question-centered learning environment, a heuristic workspace has been designed. Both the question-centered approach to information system design and the workspace have been conceived with the resonance metaphor in mind. This research stemmed from a set of questions aimed at learning how spatial concepts and related factors including geography may play a role in information sharing and public information access. In early stages of this work these concepts and relationships were explored through qualitative analysis of interviews centered on local small group and community users of geospatial data. Evaluation of the interviews led to the conclusion that spatial concepts are pervasive in our language, and they apply equally to phenomena that would be considered physical and geographic as they do to cognitive and social domains. Rather than deriving metaphorically from the physical world to the human, spatial concepts are native to all dimensions of human life. This revised view of the metaphors of space was accompanied by a critical evaluation of the prevailing metaphors for information processes, the conduit and pathway metaphors, which led to the emergence of an alternative, resonance metaphor. Whereas the dominant metaphors emphasized information as object and the movement of objects and people through networks and other limitless information spaces, the resonance metaphor suggests the existence of multiple centers in dynamic proximity relationships. This pointed toward the creation of a space for autonomous problem solving that might be related to other spaces through proximity relationships. It is suggested that a spatial approach involving discrete, discontinuous structures may serve as an alternative to approaches involving movement and transportation. The federation of multiple autonomous problem-solving spaces, toward goals such as establishing communities of questioners, has become an objective of this work. Future work will aim at accomplishing this federation, most likely by means of the IS0 Topic Maps standard or similar semantic networking strategies
    corecore