290 research outputs found
On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark
Querying very large RDF data sets in an efficient manner requires a
sophisticated distribution strategy. Several innovative solutions have recently
been proposed for optimizing data distribution with predefined query workloads.
This paper presents an in-depth analysis and experimental comparison of five
representative and complementary distribution approaches. For achieving fair
experimental results, we are using Apache Spark as a common parallel computing
framework by rewriting the concerned algorithms using the Spark API. Spark
provides guarantees in terms of fault tolerance, high availability and
scalability which are essential in such systems. Our different implementations
aim to highlight the fundamental implementation-independent characteristics of
each approach in terms of data preparation, load balancing, data replication
and to some extent to query answering cost and performance. The presented
measures are obtained by testing each system on one synthetic and one
real-world data set over query workloads with differing characteristics and
different partitioning constraints.Comment: 16 pages, 3 figure
Proceedings of the Automated Reasoning Workshop (ARW 2019)
Preface
This volume contains the proceedings of ARW 2019, the twenty sixths Workshop on Automated Rea-
soning (2nd{3d September 2019) hosted by the Department of Computer Science, Middlesex University,
England (UK). Traditionally, this annual workshop which brings together, for a two-day intensive pro-
gramme, researchers from different areas of automated reasoning, covers both traditional and emerging
topics, disseminates achieved results or work in progress. During informal discussions at workshop ses-
sions, the attendees, whether they are established in the Automated Reasoning community or are only at
their early stages of their research career, gain invaluable feedback from colleagues. ARW always looks
at the ways of strengthening links between academia, industry and government; between theoretical and
practical advances. The 26th ARW is affiliated with TABLEAUX 2019 conference.
These proceedings contain forteen extended abstracts contributed by the participants of the workshop
and assembled in order of their presentations at the workshop. The abstracts cover a wide range of topics
including the development of reasoning techniques for Agents, Model-Checking, Proof Search for classical
and non-classical logics, Description Logics, development of Intelligent Prediction Models, application of
Machine Learning to theorem proving, applications of AR in Cloud Computing and Networking.
I would like to thank the members of the ARW Organising Committee for their advice and assis-
tance. I would also like to thank the organisers of TABLEAUX/FroCoS 2019, and Andrei Popescu, the
TABLEAUX Conference Chair, in particular, for the enormous work related to the organisation of this
affiliation. I would also like to thank Natalia Yerashenia for helping in preparing these proceedings.
London Alexander Bolotov
September 201
Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data
Thesis (Ph.D.) - Indiana University, Computer Sciences, 2015As Big Data processing problems evolve, many modern applications demonstrate special characteristics. Data exists in the form of both large historical datasets and high-speed real-time streams, and many analysis pipelines require integrated parallel batch processing and stream processing. Despite the large size of the whole dataset, most analyses focus on specific subsets according to certain criteria. Correspondingly, integrated support for efficient queries and post- query analysis is required.
To address the system-level requirements brought by such characteristics, this dissertation proposes a scalable architecture for integrated queries, batch analysis, and streaming analysis of Big Data in the cloud. We verify its effectiveness using a representative application domain - social media data analysis - and tackle related research challenges emerging from each module of the architecture by integrating and extending multiple state-of-the-art Big Data storage and processing systems.
In the storage layer, we reveal that existing text indexing techniques do not work well for the unique queries of social data, which put constraints on both textual content and social context. To address this issue, we propose a flexible indexing framework over NoSQL databases to support fully customizable index structures, which can embed necessary social context information for efficient queries.
The batch analysis module demonstrates that analysis workflows consist of multiple algorithms with different computation and communication patterns, which are suitable for different processing frameworks. To achieve efficient workflows, we build an integrated analysis stack based on YARN, and make novel use of customized indices in developing sophisticated analysis algorithms.
In the streaming analysis module, the high-dimensional data representation of social media streams poses special challenges to the problem of parallel stream clustering. Due to the sparsity of the high-dimensional data, traditional synchronization method becomes expensive and severely impacts the scalability of the algorithm. Therefore, we design a novel strategy that broadcasts the incremental changes rather than the whole centroids of the clusters to achieve scalable parallel stream clustering algorithms.
Performance tests using real applications show that our solutions for parallel data loading/indexing, queries, analysis tasks, and stream clustering all significantly outperform implementations using current state-of-the-art technologies
Study and optimization of the memory management in Memcached
Over the years the Internet has become more popular than ever and web applications
like Facebook and Twitter are gaining more users. This results in generation of more and
more data by the users which has to be efficiently managed, because access speed is an
important factor nowadays, a user will not wait no more than three seconds for a web
page to load before abandoning the site. In-memory key-value stores like Memcached
and Redis are used to speed up web applications by speeding up access to the data by
decreasing the number of accesses to the slower data storage’s. The first implementation
of Memcached, in the LiveJournal’s website, showed that by using 28 instances of Memcached
on ten unique hosts, caching the most popular 30GB of data can achieve a hit rate
around 92%, reducing the number of accesses to the database and reducing the response
time considerably.
Not all objects in cache take the same time to recompute, so this research is going to
study and present a new cost aware memory management that is easy to integrate in a
key-value store, with this approach being implemented in Memcached. The new memory
management and cache will give some priority to key-value pairs that take longer to be
recomputed. Instead of replacing Memcached’s replacement structure and its policy, we
simply add a new segment in each structure that is capable of storing the more costly
key-value pairs. Apart from this new segment in each replacement structure, we created
a new dynamic cost-aware rebalancing policy in Memcached, giving more memory to
store more costly key-value pairs.
With the implementations of our approaches, we were able to offer a prototype that
can be used to research the cost on the caching systems performance. In addition, we
were able to improve in certain scenarios the access latency of the user and the total
recomputation cost of the key-value stored in the system
On hierarchical clustering-based approach for RDDBS design
Distributed database system (DDBS) design is still an open challenge even after decades of research, especially in a dynamic network setting. Hence, to meet the demands of high-speed data gathering and for the management and preservation of huge systems, it is important to construct a distributed database for real-time data storage. Incidentally, some fragmentation schemes, such as horizontal, vertical, and hybrid, are widely used for DDBS design. At the same time, data allocation could not be done without first physically fragmenting the data because the fragmentation process is the foundation of the DDBS design. Extensive research have been conducted to develop effective solutions for DDBS design problems. But the great majority of them barely consider the RDDBS\u27s initial design. Therefore, this work aims at proposing a clustering-based horizontal fragmentation and allocation technique to handle both the early and late stages of the DDBS design. To ensure that each operation flows into the next without any increase in complexity, fragmentation and allocation are done simultaneously. With this approach, the main goals are to minimize communication expenses, response time, and irrelevant data access. Most importantly, it has been observed that the proposed approach may effectively expand RDDBS performance by simultaneously fragmenting and assigning various relations. Through simulations and experiments on synthetic and real databases, we demonstrate the viability of our strategy and how it considerably lowers communication costs for typical access patterns at both the early and late stages of design
- …