Search CORE

69 research outputs found

Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask

Author: Boncz P.A. (Peter)
Kemper A. (Alfons)
Kersten T. (Timo)
Leis V. (Viktor)
Neumann T. (Thomas)
Pavlo A. (Andrew)
Publication venue
Publication date: 01/01/2018
Field of study

Platforms for deployment of scalable on- and off-line data analytics.

Author: Coetzee Peter
Publication venue
Publication date
Field of study

The ability to exploit the intelligence concealed in bulk data to generate actionable insights is increasingly providing competitive advantages to businesses, government agencies, and charitable organisations. The burgeoning field of Data Science, and its related applications in the field of Data Analytics, finds broader applicability with each passing year. This expansion of users and applications is matched by an explosion in tools, platforms, and techniques designed to exploit more types of data in larger volumes, with more techniques, and at higher frequencies than ever before. This diversity in platforms and tools presents a new challenge for organisations aiming to integrate Data Science into their daily operations. Designing an analytic for a particular platform necessarily involves “lock-in” to that specific implementation – there are few opportunities for algorithmic portability. It is increasingly challenging to find engineers with experience in the diverse suite of tools available as well as understanding the precise details of the domain in which they work: the semantics of the data, the nature of queries and analyses to be executed, and the interpretation and presentation of results. The work presented in this thesis addresses these challenges by introducing a number of techniques to facilitate the creation of analytics for equivalent deployment across a variety of runtime frameworks and capabilities. In the first instance, this capability is demonstrated using the first Domain Specific Language and associated runtime environments to target multiple best-in-class frameworks for data analysis from the streaming and off-line paradigms. This capability is extended with a new approach to modelling analytics based around a semantically rich type system. An analytic planner using this model is detailed, thus empowering domain experts to build their own scalable analyses, without any specific programming or distributed systems knowledge. This planning technique is used to assemble complex ensembles of hybrid analytics: automatically applying multiple frameworks in a single workflow. Finally, this thesis demonstrates a novel approach to the speculative construction, compilation, and deployment of analytic jobs based around the observation of user interactions with an analytic planning system

Warwick Research Archives Portal Repository

Methodological approaches and techniques for designing ontologies in information systems requirements engineering

Author: Pereira Fernando Tiago Fernandes Silva
Publication venue
Publication date: 19/05/2023
Field of study

Programa doutoral em Information Systems and TechnologyThe way we interact with the world around us is changing as new challenges arise, embracing innovative business models, rethinking the organization and processes to maximize results, and evolving change management. Currently, and considering the projects executed, the methodologies used do not fully respond to the companies' needs. On the one hand, organizations are not familiar with the languages used in Information Systems, and on the other hand, they are often unable to validate requirements or business models. These are some of the difficulties encountered that lead us to think about formulating a new approach. Thus, the state of the art presented in this paper includes a study of the models involved in the software development process, where traditional methods and the rivalry of agile methods are present. In addition, a survey is made about Ontologies and what methods exist to conceive, transform, and represent them. Thus, after analyzing some of the various possibilities currently available, we began the process of evolving a method and developing an approach that would allow us to design ontologies. The method we evolved and adapted will allow us to derive terminologies from a specific domain, aggregating them in order to facilitate the construction of a catalog of terminologies. Next, the definition of an approach to designing ontologies will allow the construction of a domain-specific ontology. This approach allows in the first instance to integrate and store the data from different information systems of a given organization. In a second instance, the rules for mapping and building the ontology database are defined. Finally, a technological architecture is also proposed that will allow the mapping of an ontology through the construction of complex networks, allowing mapping and relating terminologies. This doctoral work encompasses numerous Research & Development (R&D) projects belonging to different domains such as Software Industry, Textile Industry, Robotic Industry and Smart Cities. Finally, a critical and descriptive analysis of the work done is performed, and we also point out perspectives for possible future work.A forma como interagimos com o mundo à nossa volta está a mudar à medida que novos desafios surgem, abraçando modelos empresariais inovadores, repensando a organização e os processos para maximizar os resultados, e evoluindo a gestão da mudança. Atualmente, e considerando os projetos executados, as metodologias utilizadas não respondem na totalidade às necessidades das empresas. Por um lado, as organizações não estão familiarizadas com as linguagens utilizadas nos Sistemas de Informação, por outro lado, são muitas vezes incapazes de validar requisitos ou modelos de negócio. Estas são algumas das dificuldades encontradas que nos levam a pensar na formulação de uma nova abordagem. Assim, o estado da arte apresentado neste documento inclui um estudo dos modelos envolvidos no processo de desenvolvimento de software, onde os métodos tradicionais e a rivalidade de métodos ágeis estão presentes. Além disso, é efetuado um levantamento sobre Ontologias e quais os métodos existentes para as conceber, transformar e representar. Assim, e após analisarmos algumas das várias possibilidades atualmente disponíveis, iniciou-se o processo de evolução de um método e desenvolvimento de uma abordagem que nos permitisse conceber ontologias. O método que evoluímos e adaptamos permitirá derivar terminologias de um domínio específico, agregando-as de forma a facilitar a construção de um catálogo de terminologias. Em seguida, a definição de uma abordagem para conceber ontologias permitirá a construção de uma ontologia de um domínio específico. Esta abordagem permite em primeira instância, integrar e armazenar os dados de diferentes sistemas de informação de uma determinada organização. Num segundo momento, são definidas as regras para o mapeamento e construção da base de dados ontológica. Finalmente, é também proposta uma arquitetura tecnológica que permitirá efetuar o mapeamento de uma ontologia através da construção de redes complexas, permitindo mapear e relacionar terminologias. Este trabalho de doutoramento engloba inúmeros projetos de Investigação & Desenvolvimento (I&D) pertencentes a diferentes domínios como por exemplo Indústria de Software, Indústria Têxtil, Indústria Robótica e Smart Cities. Finalmente, é realizada uma análise critica e descritiva do trabalho realizado, sendo que apontamos ainda perspetivas de possíveis trabalhos futuros

Universidade do Minho: RepositoriUM

Systems and Algorithms for Dynamic Graph Processing

Author: Ammar Khaled
Publication venue: 'University of Waterloo'
Publication date: 03/03/2023
Field of study

Data generated from human and systems interactions could be naturally represented as graph data. Several emerging applications rely on graph data, such as the semantic web, social networks, bioinformatics, finance, and trading among others. These applications require graph querying capabilities which are often implemented in graph database management systems (GDBMS). Many GDBMSs have capabilities to evaluate one-time versions of recursive or subgraph queries over static graphs – graphs that do not change or a single snapshot of a changing graph. They generally do not support incrementally maintaining queries as graphs change. However, most applications that employ graphs are dynamic in nature resulting in graphs that change over time, also known as dynamic graphs. This thesis investigates how to build a generic and scalable incremental computation solution that is oblivious to graph workloads. It focuses on two fundamental computations performed by many applications: recursive queries and subgraph queries. Specifically, for subgraph queries, this thesis presents the first approach that (i) performs joins with worstcase optimal computation and communication costs; and (ii) maintains a total memory footprint almost linear in the number of input edges. For recursive queries, this thesis studies optimizations for using differential computation (DC). DC is a general incremental computation that can maintain the output of a recursive dataflow computation upon changes. However, it requires a prohibitively large amount of memory because it maintains differences that track changes in queries input/output. The thesis proposes a suite of optimizations that are based on reducing the number of these differences and recomputing them when necessary. The techniques and optimizations in this thesis, for subgraph and recursive computations, represent a proposal for how to build a state-of-the-art generic and scalable GDBMS for dynamic graph data management

University of Waterloo's Institutional Repository

Communication-Efficient Probabilistic Algorithms: Selection, Sampling, and Checking

Author: Hübschle-Schneider Lorenz
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 14/12/2020
Field of study

Diese Dissertation behandelt drei grundlegende Klassen von Problemen in Big-Data-Systemen, für die wir kommunikationseffiziente probabilistische Algorithmen entwickeln. Im ersten Teil betrachten wir verschiedene Selektionsprobleme, im zweiten Teil das Ziehen gewichteter Stichproben (Weighted Sampling) und im dritten Teil die probabilistische Korrektheitsprüfung von Basisoperationen in Big-Data-Frameworks (Checking). Diese Arbeit ist durch einen wachsenden Bedarf an Kommunikationseffizienz motiviert, der daher rührt, dass der auf das Netzwerk und seine Nutzung zurückzuführende Anteil sowohl der Anschaffungskosten als auch des Energieverbrauchs von Supercomputern und der Laufzeit verteilter Anwendungen immer weiter wächst. Überraschend wenige kommunikationseffiziente Algorithmen sind für grundlegende Big-Data-Probleme bekannt. In dieser Arbeit schließen wir einige dieser Lücken. Zunächst betrachten wir verschiedene Selektionsprobleme, beginnend mit der verteilten Version des klassischen Selektionsproblems, d. h. dem Auffinden des Elements von Rang

k

in einer großen verteilten Eingabe. Wir zeigen, wie dieses Problem kommunikationseffizient gelöst werden kann, ohne anzunehmen, dass die Elemente der Eingabe zufällig verteilt seien. Hierzu ersetzen wir die Methode zur Pivotwahl in einem schon lange bekannten Algorithmus und zeigen, dass dies hinreichend ist. Anschließend zeigen wir, dass die Selektion aus lokal sortierten Folgen – multisequence selection – wesentlich schneller lösbar ist, wenn der genaue Rang des Ausgabeelements in einem gewissen Bereich variieren darf. Dies benutzen wir anschließend, um eine verteilte Prioritätswarteschlange mit Bulk-Operationen zu konstruieren. Später werden wir diese verwenden, um gewichtete Stichproben aus Datenströmen zu ziehen (Reservoir Sampling). Schließlich betrachten wir das Problem, die global häufigsten Objekte sowie die, deren zugehörige Werte die größten Summen ergeben, mit einem stichprobenbasierten Ansatz zu identifizieren. Im Kapitel über gewichtete Stichproben werden zunächst neue Konstruktionsalgorithmen für eine klassische Datenstruktur für dieses Problem, sogenannte Alias-Tabellen, vorgestellt. Zu Beginn stellen wir den ersten Linearzeit-Konstruktionsalgorithmus für diese Datenstruktur vor, der mit konstant viel Zusatzspeicher auskommt. Anschließend parallelisieren wir diesen Algorithmus für Shared Memory und erhalten so den ersten parallelen Konstruktionsalgorithmus für Aliastabellen. Hiernach zeigen wir, wie das Problem für verteilte Systeme mit einem zweistufigen Algorithmus angegangen werden kann. Anschließend stellen wir einen ausgabesensitiven Algorithmus für gewichtete Stichproben mit Zurücklegen vor. Ausgabesensitiv bedeutet, dass die Laufzeit des Algorithmus sich auf die Anzahl der eindeutigen Elemente in der Ausgabe bezieht und nicht auf die Größe der Stichprobe. Dieser Algorithmus kann sowohl sequentiell als auch auf Shared-Memory-Maschinen und verteilten Systemen eingesetzt werden und ist der erste derartige Algorithmus in allen drei Kategorien. Wir passen ihn anschließend an das Ziehen gewichteter Stichproben ohne Zurücklegen an, indem wir ihn mit einem Schätzer für die Anzahl der eindeutigen Elemente in einer Stichprobe mit Zurücklegen kombinieren. Poisson-Sampling, eine Verallgemeinerung des Bernoulli-Sampling auf gewichtete Elemente, kann auf ganzzahlige Sortierung zurückgeführt werden, und wir zeigen, wie ein bestehender Ansatz parallelisiert werden kann. Für das Sampling aus Datenströmen passen wir einen sequentiellen Algorithmus an und zeigen, wie er in einem Mini-Batch-Modell unter Verwendung unserer im Selektionskapitel eingeführten Bulk-Prioritätswarteschlange parallelisiert werden kann. Das Kapitel endet mit einer ausführlichen Evaluierung unserer Aliastabellen-Konstruktionsalgorithmen, unseres ausgabesensitiven Algorithmus für gewichtete Stichproben mit Zurücklegen und unseres Algorithmus für gewichtetes Reservoir-Sampling. Um die Korrektheit verteilter Algorithmen probabilistisch zu verifizieren, schlagen wir Checker für grundlegende Operationen von Big-Data-Frameworks vor. Wir zeigen, dass die Überprüfung zahlreicher Operationen auf zwei „Kern“-Checker reduziert werden kann, nämlich die Prüfung von Aggregationen und ob eine Folge eine Permutation einer anderen Folge ist. Während mehrere Ansätze für letzteres Problem seit geraumer Zeit bekannt sind und sich auch einfach parallelisieren lassen, ist unser Summenaggregations-Checker eine neuartige Anwendung der gleichen Datenstruktur, die auch zählenden Bloom-Filtern und dem Count-Min-Sketch zugrunde liegt. Wir haben beide Checker in Thrill, einem Big-Data-Framework, implementiert. Experimente mit absichtlich herbeigeführten Fehlern bestätigen die von unserer theoretischen Analyse vorhergesagte Erkennungsgenauigkeit. Dies gilt selbst dann, wenn wir häufig verwendete schnelle Hash-Funktionen mit in der Theorie suboptimalen Eigenschaften verwenden. Skalierungsexperimente auf einem Supercomputer zeigen, dass unsere Checker nur sehr geringen Laufzeit-Overhead haben, welcher im Bereich von

2\,\%

liegt und dabei die Korrektheit des Ergebnisses nahezu garantiert wird

KITopen

Advances in database technology - EDBT 2016: 19th International Conference on Extending Database Technology, Bordeaux, France, March 15-18, 2016 : proceedings

Author
Publication venue: University of Konstanz, University Library
Publication date: 01/01/2016
Field of study

Digitale Bibliothek Thüringen

Navigating Diverse Datasets in the Face of Uncertainty

Author: Alvarez-Ayllon Alejandro
Publication venue
Publication date: 11/09/2023
Field of study

When exploring big volumes of data, one of the challenging aspects is their diversity of origin. Multiple files that have not yet been ingested into a database system may contain information of interest to a researcher, who must curate, understand and sieve their content before being able to extract knowledge. Performance is one of the greatest difficulties in exploring these datasets. On the one hand, examining non-indexed, unprocessed files can be inefficient. On the other hand, any processing before its understanding introduces latency and potentially un- necessary work if the chosen schema matches poorly the data. We have surveyed the state-of-the-art and, fortunately, there exist multiple proposal of solutions to handle data in-situ performantly. Another major difficulty is matching files from multiple origins since their schema and layout may not be compatible or properly documented. Most surveyed solutions overlook this problem, especially for numeric, uncertain data, as is typical in fields like astronomy. The main objective of our research is to assist data scientists during the exploration of unprocessed, numerical, raw data distributed across multiple files based solely on its intrinsic distribution. In this thesis, we first introduce the concept of Equally-Distributed Dependencies, which provides the foundations to match this kind of dataset. We propose PresQ, a novel algorithm that finds quasi-cliques on hypergraphs based on their expected statistical properties. The probabilistic approach of PresQ can be successfully exploited to mine EDD between diverse datasets when the underlying populations can be assumed to be the same. Finally, we propose a two-sample statistical test based on Self-Organizing Maps (SOM). This method can outperform, in terms of power, other classifier-based two- sample tests, being in some cases comparable to kernel-based methods, with the advantage of being interpretable. Both PresQ and the SOM-based statistical test can provide insights that drive serendipitous discoveries

Repositorio de Objetos de Docencia e Investigación de la Universidad de Cádiz

Technologies and Applications for Big Data Value

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/05/2022
Field of study

This open access book explores cutting-edge solutions and best practices for big data and data-driven AI applications for the data-driven economy. It provides the reader with a basis for understanding how technical issues can be overcome to offer real-world solutions to major industrial areas. The book starts with an introductory chapter that provides an overview of the book by positioning the following chapters in terms of their contributions to technology frameworks which are key elements of the Big Data Value Public-Private Partnership and the upcoming Partnership on AI, Data and Robotics. The remainder of the book is then arranged in two parts. The first part “Technologies and Methods” contains horizontal contributions of technologies and methods that enable data value chains to be applied in any sector. The second part “Processes and Applications” details experience reports and lessons from using big data and data-driven approaches in processes and applications. Its chapters are co-authored with industry experts and cover domains including health, law, finance, retail, manufacturing, mobility, and smart cities. Contributions emanate from the Big Data Value Public-Private Partnership and the Big Data Value Association, which have acted as the European data community's nucleus to bring together businesses with leading researchers to harness the value of data to benefit society, business, science, and industry. The book is of interest to two primary audiences, first, undergraduate and postgraduate students and researchers in various fields, including big data, data science, data engineering, and machine learning and AI. Second, practitioners and industry experts engaged in data-driven systems, software design and deployment projects who are interested in employing these advanced methods to address real-world problems

Directory of Open Access Books (DOAB)

Technologies and Applications for Big Data Value

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

OAPEN Library

Design and Application of an Electronic Logbook for Space System Integration and Test Operations

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

Crossref