48 research outputs found
Recent Advances in Industrial and Applied Mathematics
This open access book contains review papers authored by thirteen plenary invited speakers to the 9th International Congress on Industrial and Applied Mathematics (Valencia, July 15-19, 2019). Written by top-level scientists recognized worldwide, the scientific contributions cover a wide range of cutting-edge topics of industrial and applied mathematics: mathematical modeling, industrial and environmental mathematics, mathematical biology and medicine, reduced-order modeling and cryptography. The book also includes an introductory chapter summarizing the main features of the congress. This is the first volume of a thematic series dedicated to research results presented at ICIAM 2019-Valencia Congress
The Forgotten Document-Oriented Database Management Systems: An Overview and Benchmark of Native XML DODBMSes in Comparison with JSON DODBMSes
In the current context of Big Data, a multitude of new NoSQL solutions for
storing, managing, and extracting information and patterns from semi-structured
data have been proposed and implemented. These solutions were developed to
relieve the issue of rigid data structures present in relational databases, by
introducing semi-structured and flexible schema design. As current data
generated by different sources and devices, especially from IoT sensors and
actuators, use either XML or JSON format, depending on the application,
database technologies that store and query semi-structured data in XML format
are needed. Thus, Native XML Databases, which were initially designed to
manipulate XML data using standardized querying languages, i.e., XQuery and
XPath, were rebranded as NoSQL Document-Oriented Databases Systems. Currently,
the majority of these solutions have been replaced with the more modern JSON
based Database Management Systems. However, we believe that XML-based solutions
can still deliver performance in executing complex queries on heterogeneous
collections. Unfortunately nowadays, research lacks a clear comparison of the
scalability and performance for database technologies that store and query
documents in XML versus the more modern JSON format. Moreover, to the best of
our knowledge, there are no Big Data-compliant benchmarks for such database
technologies. In this paper, we present a comparison for selected
Document-Oriented Database Systems that either use the XML format to encode
documents, i.e., BaseX, eXist-db, and Sedna, or the JSON format, i.e., MongoDB,
CouchDB, and Couchbase. To underline the performance differences we also
propose a benchmark that uses a heterogeneous complex schema on a large DBLP
corpus.Comment: 28 pages, 6 figures, 7 table
Efficient processing of large-scale spatio-temporal data
Millionen Geräte, wie z.B. Mobiltelefone, Autos und Umweltsensoren senden ihre Positionen zusammen mit einem Zeitstempel und weiteren Nutzdaten an einen Server zu verschiedenen Analysezwecken. Die Positionsinformationen und übertragenen Ereignisinformationen werden als Punkte oder Polygone dargestellt. Eine weitere Art räumlicher Daten sind Rasterdaten, die zum Beispiel von Kameras und Sensoren produziert werden. Diese großen räumlich-zeitlichen Datenmengen können nur auf skalierbaren Plattformen wie Hadoop und Apache Spark verarbeitet werden, die jedoch z.B. die Nachbarschaftsinformation nicht ausnutzen können - was die Ausführung bestimmter Anfragen praktisch unmöglich macht. Die wiederholten Ausführungen der Analyseprogramme während ihrer Entwicklung und durch verschiedene Nutzer resultieren in langen Ausführungszeiten und hohen Kosten für gemietete Ressourcen, die durch die Wiederverwendung von Zwischenergebnissen reduziert werden können. Diese Arbeit beschäftigt sich mit den beiden oben beschriebenen Herausforderungen. Wir präsentieren zunächst das STARK Framework für die Verarbeitung räumlich-zeitlicher Vektor- und Rasterdaten in Apache Spark. Wir identifizieren verschiedene Algorithmen für Operatoren und analysieren, wie diese von den Eigenschaften der zugrundeliegenden Plattform profitieren können. Weiterhin wird untersucht, wie Indexe in der verteilten und parallelen Umgebung realisiert werden können. Außerdem vergleichen wir Partitionierungsmethoden, die unterschiedlich gut mit ungleichmäßiger Datenverteilung und der Größe der Datenmenge umgehen können und präsentieren einen Ansatz um die auf Operatorebene zu verarbeitende Datenmenge frühzeitig zu reduzieren. Um die Ausführungszeit von Programmen zu verkürzen, stellen wir einen Ansatz zur transparenten Materialisierung von Zwischenergebnissen vor. Dieser Ansatz benutzt ein Entscheidungsmodell, welches auf den tatsächlichen Operatorkosten basiert. In der Evaluierung vergleichen wir die verschiedenen Implementierungs- sowie Konfigurationsmöglichkeiten in STARK und identifizieren Szenarien wann Partitionierung und Indexierung eingesetzt werden sollten. Außerdem vergleichen wir STARK mit verwandten Systemen. Im zweiten Teil der Evaluierung zeigen wir, dass die transparente Wiederverwendung der materialisierten Zwischenergebnisse die Ausführungszeit der Programme signifikant verringern kann.Millions of location-aware devices, such as mobile phones, cars, and environmental sensors constantly report their positions often in combination with a timestamp to a server for different kinds of analyses. While the location information of the devices and reported events is represented as points and polygons, raster data is another type of spatial data, which is for example produced by cameras and sensors. This Big spatio-temporal Data needs to be processed on scalable platforms, such as Hadoop and Apache Spark, which, however, are unaware of, e.g., spatial neighborhood, what makes them practically impossible to use for this kind of data.
The repeated executions of the programs during development and by different users result in long execution times and potentially high costs in rented clusters, which can be reduced by reusing commonly computed intermediate results.
Within this thesis, we tackle the two challenges described above. First, we present the STARK framework for processing spatio-temporal vector and raster data on the Apache Spark stack. For operators, we identify several possible algorithms and study how they can benefit from the underlying platform's properties. We further investigate how indexes can be realized in the distributed and parallel architecture of Big Data processing engines and compare methods for data partitioning, which perform differently well with respect to data skew and data set size. Furthermore, an approach to reduce the amount of data to process at operator level is presented. In order to reduce the execution times, we introduce an approach to transparently recycle intermediate results of dataflow programs, based on operator costs. To compute the costs, we instrument the programs with profiling code to gather the execution time and result size of the operators.
In the evaluation, we first compare the various implementation and configuration possibilities in STARK and identify scenarios when and how partitioning and indexing should be applied. We further compare STARK to related systems and show that we can achieve significantly better execution times, not only when exploiting existing partitioning information. In the second part of the evaluation, we show that with the transparent cost-based materialization and recycling of intermediate results, the execution times of programs can be reduced significantly
On the Application of Formal Techniques for Dependable Concurrent Systems
The pervasiveness of computer systems in virtually every aspect of daily life entails a growing dependence on them. These systems have become integral parts of our societies as we continue to use and rely on them on a daily basis. This trend of digitalization is set to carry on, bringing forth the question of how dependable these systems are. Our dependence on these systems is in acute need for a justification based on rigorous and systematic methods as recommended by internationally recognized safety standards. Ensuring that the systems we depend on meet these recommendations is further complicated by the increasingly widespread use of concurrent systems, which are notoriously hard to analyze due to the substantial increase in complexity that the interactions between different processing entities engenders.
In this thesis, we introduce improvements on existing formal analysis techniques to aid in the development of dependable concurrent systems. Applying formal analysis techniques can help us avoid incidents with catastrophic consequences by uncovering their triggering causes well in advance. This work focuses on three types of analyses: data-flow analysis, model checking and error propagation analysis. Data-flow analysis is a general static analysis technique aimed at predicting the values that variables can take at various points in a program. Model checking is a well-established formal analysis technique that verifies whether a program satisfies its specification. Error propagation analysis (EPA) is a dynamic analysis whose purpose is to assess a program's ability to withstand unexpected behaviors of external components. We leverage data-flow analysis to assist in the design of highly available distributed applications. Given an application, our analysis infers rules to distribute its workload across multiple machines, improving the availability of the overall system. Furthermore, we propose improvements to both explicit and bounded model checking techniques by exploiting the structure of the specification under consideration. The core idea behind these improvements lies in the ability to abstract away aspects of the program that are not relevant to the specification, effectively shortening the verification time. Finally, we present a novel approach to EPA based on symbolic modeling of execution traces. The symbolic scheme uses a dynamic sanitizing algorithm to eliminate effects of non-determinism in the execution traces of multi-threaded programs.The proposed approach is the first to achieve a 0% rate of false positives for multi-threaded programs.
The work in this thesis constitutes an improvement over existing formal analysis techniques that can aid in the development of dependable concurrent systems, particularly with respect to availability and safety
Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools
This dissertation focuses on two fundamental sorting problems: string sorting
and suffix sorting. The first part considers parallel string sorting on
shared-memory multi-core machines, the second part external memory suffix
sorting using the induced sorting principle, and the third part distributed
external memory suffix sorting with a new distributed algorithmic big data
framework named Thrill.Comment: 396 pages, dissertation, Karlsruher Instituts f\"ur Technologie
(2018). arXiv admin note: text overlap with arXiv:1101.3448 by other author