181 research outputs found
P-LUPOSDATE: Using Precomputed Bloom Filters to Speed Up SPARQL Processing in the Cloud
Increasingly data on the Web is stored in the form of Semantic Web data. Because of today's information overload, it becomes very important to store and query these big datasets in a scalable way and hence in a distributed fashion. Cloud Computing offers such a distributed environment with dynamic reallocation of computing and storing resources based on needs. In this work we introduce a scalable distributed Semantic Web database in the Cloud. In order to reduce the number of (unnecessary) intermediate results early, we apply bloom filters. Instead of computing bloom filters, a time-consuming task during query processing as it has been done traditionally, we precompute the bloom filters as much as possible and store them in the indices besides the data. The experimental results with data sets up to 1 billion triples show that our approach speeds up query processing significantly and sometimes even reduces the processing time to less than half
Efficient processing of similarity queries with applications
Today, a myriad of data sources, from the Internet to business operations to scientific instruments, produce large and different types of data. Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological applications, call for identifying and processing similarities in big data. As a result, it is imperative to develop new similarity query processing approaches and systems that scale from low dimensional data to high dimensional data, from single machine to clusters of hundreds of machines, and from disk-based to memory-based processing. This dissertation introduces and studies several similarity-aware query operators, analyzes and optimizes their performance.
The first contribution of this dissertation is an SQL-based Similarity Group-by operator (SGB, for short) that extends the semantics of the standard SQL Group-by operator to group data with similar but not necessarily equal values. We realize these SGB operators by extending the Standard SQL Group-by and introduce two new SGB operators for multi-dimensional data. We implement and test the new SGB operators and their algorithms inside an open-source centralized database server (PostgreSQL).
In the second contribution of this dissertation, we study how to efficiently process Hamming-distance-based similarity queries (Hamming-distance select and Hamming-distance join) that are crucial to many applications. We introduce a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries (namely, the selects and joins).
In the third and last contribution of this dissertation, we develop a system for similarity query processing and optimization in an in-memory and distributed setup for big spatial data. We propose a query scheduler and a distributed query optimizer that use a new cost model to optimize the cost of similarity query processing in this in-memory distributed setup. The scheduler and query optimizer generates query execution plans that minimize the effect of query skew. The query scheduler employs new spatial indexing techniques based on bloom filters to forward queries to the appropriate local sites. The proposed query processing and optimization techniques are prototyped inside Spark, a distributed main-memory computation system
Processing Rank-Aware Queries in Schema-Based P2P Systems
Effiziente Anfragebearbeitung in Datenintegrationssystemen sowie in
P2P-Systemen ist bereits seit einigen Jahren ein Aspekt aktueller
Forschung. Konventionelle Datenintegrationssysteme bestehen aus mehreren
Datenquellen mit ggf. unterschiedlichen Schemata, sind hierarchisch
aufgebaut und besitzen eine zentrale Komponente: den Mediator, der ein
globales Schema verwaltet. Anfragen an das System werden auf diesem
globalen Schema formuliert und vom Mediator bearbeitet, indem relevante
Daten von den Datenquellen transparent für den Benutzer angefragt werden.
Aufbauend auf diesen Systemen entstanden schließlich
Peer-Daten-Management-Systeme (PDMSs) bzw. schemabasierte P2P-Systeme. An
einem PDMS teilnehmende Knoten (Peers) können einerseits als Mediatoren
agieren andererseits jedoch ebenso als Datenquellen. Darüber hinaus sind
diese Peers autonom und können das Netzwerk jederzeit verlassen bzw.
betreten. Die potentiell riesige Datenmenge, die in einem derartigen
Netzwerk verfügbar ist, führt zudem in der Regel zu sehr großen
Anfrageergebnissen, die nur schwer zu bewältigen sind. Daher ist das
Bestimmen einer vollständigen Ergebnismenge in vielen Fällen äußerst
aufwändig oder sogar unmöglich. In diesen Fällen bietet sich die
Anwendung von Top-N- und Skyline-Operatoren, ggf. in Verbindung mit
Approximationstechniken, an, da diese Operatoren lediglich diejenigen
Datensätze als Ergebnis ausgeben, die aufgrund nutzerdefinierter
Ranking-Funktionen am relevantesten für den Benutzer sind. Da durch die
Anwendung dieser Operatoren zumeist nur ein kleiner Teil des Ergebnisses
tatsächlich dem Benutzer ausgegeben wird, muss nicht zwangsläufig die
vollständige Ergebnismenge berechnet werden sondern nur der Teil, der
tatsächlich relevant für das Endergebnis ist.
Die Frage ist nun, wie man derartige Anfragen durch die Ausnutzung dieser
Erkenntnis effizient in PDMSs bearbeiten kann. Die Beantwortung dieser
Frage ist das Hauptanliegen dieser Dissertation. Zur Lösung dieser
Problemstellung stellen wir effiziente Anfragebearbeitungsstrategien in
PDMSs vor, die die charakteristischen Eigenschaften ranking-basierter
Operatoren sowie Approximationstechniken ausnutzen. Peers werden dabei
sowohl auf Schema- als auch auf Datenebene hinsichtlich der Relevanz ihrer
Daten geprüft und dementsprechend in die Anfragebearbeitung einbezogen
oder ausgeschlossen. Durch die Heterogenität der Peers werden Techniken
zum Umschreiben einer Anfrage von einem Schema in ein anderes nötig. Da
existierende Techniken zum Umschreiben von Anfragen zumeist nur konjunktive
Anfragen betrachten, stellen wir eine Erweiterung dieser Techniken vor, die
Anfragen mit ranking-basierten Anfrageoperatoren berücksichtigt. Da PDMSs
dynamische Systeme sind und teilnehmende Peers jederzeit ihre Daten ändern
können, betrachten wir in dieser Dissertation nicht nur wie Routing-Indexe
verwendet werden, um die Relevanz eines Peers auf Datenebene zu bestimmen,
sondern auch wie sie gepflegt werden können. Schließlich stellen wir
SmurfPDMS (SiMUlating enviRonment For Peer Data Management Systems) vor,
ein System, welches im Rahmen dieser Dissertation entwickelt wurde und alle
vorgestellten Techniken implementiert.In recent years, there has been considerable research with respect to query
processing in data integration and P2P systems. Conventional data
integration systems consist of multiple sources with possibly different
schemas, adhere to a hierarchical structure, and have a central component
(mediator) that manages a global schema. Queries are formulated against
this global schema and the mediator processes them by retrieving relevant
data from the sources transparently to the user. Arising from these
systems, eventually Peer Data Management Systems (PDMSs), or schema-based
P2P systems respectively, have attracted attention. Peers participating in
a PDMS can act both as a mediator and as a data source, are autonomous, and
might leave or join the network at will. Due to these reasons peers often
hold incomplete or erroneous data sets and mappings. The possibly huge
amount of data available in such a network often results in large query
result sets that are hard to manage. Due to these reasons, retrieving the
complete result set is in most cases difficult or even impossible. Applying
rank-aware query operators such as top-N and skyline, possibly in
conjunction with approximation techniques, is a remedy to these problems as
these operators select only those result records that are most relevant to
the user. Being aware that in most cases only a small fraction of the
complete result set is actually output to the user, retrieving the complete
set before evaluating such operators is obviously inefficient.
Therefore, the questions we want to answer in this dissertation are how to
compute such queries in PDMSs and how to do that efficiently. We propose
strategies for efficient query processing in PDMSs that exploit the
characteristics of rank-aware queries and optionally apply approximation
techniques. A peer's relevance is determined on two levels: on schema-level
and on data-level. According to its relevance a peer is either considered
for query processing or not. Because of heterogeneity queries need to be
rewritten, enabling cooperation between peers that use different schemas.
As existing query rewriting techniques mostly consider conjunctive queries
only, we present an extension that allows for rewriting queries involving
rank-aware query operators. As PDMSs are dynamic systems and peers might
update their local data, this dissertation addresses not only the problem
of considering such structures within a query processing strategy but also
the problem of keeping them up-to-date. Finally, we provide a system-level
evaluation by presenting SmurfPDMS (SiMUlating enviRonment For Peer Data
Management Systems) -- a system created in the context of this dissertation
implementing all presented techniques
Supporting Complex Queries in P2P Networks
Ph.DDOCTOR OF PHILOSOPH
Congenial Web Search : A Conceptual Framework for Personalized, Collaborative, and Social Peer-to-Peer Retrieval
Traditional information retrieval methods fail to address the fact that information consumption and production are social activities. Most Web search engines do not consider the social-cultural environment of users' information needs and the collaboration between users. This dissertation addresses a new search paradigm for Web information retrieval denoted as Congenial Web Search. It emphasizes personalization, collaboration, and socialization methods in order to improve effectiveness. The client-server architecture of Web search engines only allows the consumption of information. A peer-to-peer system architecture has been developed in this research to improve information seeking. Each user is involved in an interactive process to produce meta-information. Based on a personalization strategy on each peer, the user is supported to give explicit feedback for relevant documents. His information need is expressed by a query that is stored in a Peer Search Memory. On one hand, query-document associations are incorporated in a personalized ranking method for repeated information needs. The performance is shown in a known-item retrieval setting. On the other hand, explicit feedback of each user is useful to discover collaborative information needs. A new method for a controlled grouping of query terms, links, and users was developed to maintain Virtual Knowledge Communities. The quality of this grouping represents the effectiveness of grouped terms and links. Both strategies, personalization and collaboration, tackle the problem of a missing socialization among searchers. Finally, a concept for integrated information seeking was developed. This incorporates an integrated representation to improve effectiveness of information retrieval and information filtering. An integrated information retrieval process explores a virtual search network of Peer Search Memories in order to accomplish a reputation-based ranking. In addition, the community structure is considered by an integrated information filtering process. Both concepts have been evaluated and shown to have a better performance than traditional techniques. The methods presented in this dissertation offer the potential towards more transparency, and control of Web search
Recommended from our members
Wireless mosaic eyes based robot path planning and control. Autonomous robot navigation using environment intelligence with distributed vision sensors.
As an attempt to steer away from developing an autonomous robot with complex centralised intelligence, this thesis proposes an intelligent environment infrastructure where intelligences are distributed in the environment through collaborative vision sensors mounted in a physical architecture, forming a wireless sensor network, to enable the navigation of unintelligent robots within that physical architecture. The aim is to avoid the bottleneck of centralised robot intelligence that hinders the application and exploitation of autonomous robot. A bio-mimetic snake algorithm is proposed to coordinate the distributed vision sensors for the generation of a collision free Reference-snake (R-snake) path during the path planning process. By following the R-snake path, a novel Accompanied snake (A-snake) method that complies with the robot's nonholonomic constraints for trajectory generation and motion control is introduced to generate real time robot motion commands to navigate the robot from its current position to the target position. A rolling window optimisation mechanism subject to control input saturation constraints is carried out for time-optimal control along the A-snake. A comprehensive simulation software and a practical distributed intelligent environment with vision sensors mounted on a building ceiling are developed. All the algorithms proposed in this thesis are first verified by the simulation and then implemented in the practical intelligent environment. A model car with less on-board intelligence is successfully controlled by the distributed vision sensors and demonstrated superior mobility
A Content-Addressable Network for Similarity Search in Metric Spaces
Because of the ongoing digital data explosion, more advanced search paradigms than the traditional exact match are needed for contentbased retrieval in huge and ever growing collections of data produced in application areas such as multimedia, molecular biology, marketing, computer-aided design and purchasing assistance. As the variety of data types is fast going towards creating a database utilized by people, the computer systems must be able to model human fundamental reasoning paradigms, which are naturally based on similarity. The ability to perceive similarities is crucial for recognition, classification, and learning, and it plays an important role in scientific discovery and creativity. Recently, the mathematical notion of metric space has become a useful abstraction of similarity and many similarity search indexes have been developed.
In this thesis, we accept the metric space similarity paradigm and concentrate on the scalability issues. By exploiting computer networks and applying the Peer-to-Peer communication paradigms, we build a structured network of computers able to process similarity queries in parallel. Since no centralized entities are used, such architectures are fully scalable. Specifically, we propose a Peer-to-Peer system for similarity search in metric spaces called Metric Content-Addressable Network (MCAN) which is an extension of the well known Content-Addressable Network (CAN) used for hash lookup. A prototype implementation of MCAN was tested on real-life datasets of image features, protein symbols, and text — observed results are reported. We also compared the performance of MCAN with three other, recently proposed, distributed data structures for similarity search in metric spaces
Cyber Security of Critical Infrastructures
Critical infrastructures are vital assets for public safety, economic welfare, and the national security of countries. The vulnerabilities of critical infrastructures have increased with the widespread use of information technologies. As Critical National Infrastructures are becoming more vulnerable to cyber-attacks, their protection becomes a significant issue for organizations as well as nations. The risks to continued operations, from failing to upgrade aging infrastructure or not meeting mandated regulatory regimes, are considered highly significant, given the demonstrable impact of such circumstances. Due to the rapid increase of sophisticated cyber threats targeting critical infrastructures with significant destructive effects, the cybersecurity of critical infrastructures has become an agenda item for academics, practitioners, and policy makers. A holistic view which covers technical, policy, human, and behavioural aspects is essential to handle cyber security of critical infrastructures effectively. Moreover, the ability to attribute crimes to criminals is a vital element of avoiding impunity in cyberspace. In this book, both research and practical aspects of cyber security considerations in critical infrastructures are presented. Aligned with the interdisciplinary nature of cyber security, authors from academia, government, and industry have contributed 13 chapters. The issues that are discussed and analysed include cybersecurity training, maturity assessment frameworks, malware analysis techniques, ransomware attacks, security solutions for industrial control systems, and privacy preservation methods
Spatial Aspects of Metaphors for Information: Implications for Polycentric System Design
This dissertation presents three innovations that suggest an alternative approach to structuring information systems: a multidimensional heuristic workspace, a resonance metaphor for information, and a question-centered approach to structuring information relations. Motivated by the need for space to establish a question-centered learning environment, a heuristic workspace has been designed. Both the question-centered approach to information system design and the workspace have been conceived with the resonance metaphor in mind. This research stemmed from a set of questions aimed at learning how spatial concepts and related factors including geography may play a role in information sharing and public information access. In early stages of this work these concepts and relationships were explored through qualitative analysis of interviews centered on local small group and community users of geospatial data. Evaluation of the interviews led to the conclusion that spatial concepts are pervasive in our language, and they apply equally to phenomena that would be considered physical and geographic as they do to cognitive and social domains. Rather than deriving metaphorically from the physical world to the human, spatial concepts are native to all dimensions of human life. This revised view of the metaphors of space was accompanied by a critical evaluation of the prevailing metaphors for information processes, the conduit and pathway metaphors, which led to the emergence of an alternative, resonance metaphor. Whereas the dominant metaphors emphasized information as object and the movement of objects and people through networks and other limitless information spaces, the resonance metaphor suggests the existence of multiple centers in dynamic proximity relationships. This pointed toward the creation of a space for autonomous problem solving that might be related to other spaces through proximity relationships. It is suggested that a spatial approach involving discrete, discontinuous structures may serve as an alternative to approaches involving movement and transportation. The federation of multiple autonomous problem-solving spaces, toward goals such as establishing communities of questioners, has become an objective of this work. Future work will aim at accomplishing this federation, most likely by means of the IS0 Topic Maps standard or similar semantic networking strategies
- …