Search CORE

7,159 research outputs found

Profiling relational data: a survey

Author: Abedjan Ziawasch
Golab Lukasz
Naumann Felix
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 02/06/2015
Field of study

Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases

DSpace@MIT

Crossref

Adapting a Stress Testing Framework to a Multi-module Security-oriented Spring Application

Author: Proskurin Andrei
Publication venue
Publication date: 01/01/2017
Field of study

Programmeeritakse mitmekomponendilist süsteemi. Kolm põhikomponenti on järgmised: põhiserver (Spring rakendus), mobiilirakendused (iOS, Android), klienditeeninduse veebiportaalid. Kõige tähtsam süsteemi töös on põhiserver, kuna see on enamuse veebiportaalide ning mobiilirakenduste päringute sihtpunkt. See on mitmemooduliline projekt, kus kõik moodulid suhtlevad omavahel. Potentsiaalselt hakkab süsteemi kasutama sadu tuhandeid inimesi – kümneid tuhandeid paralleelseid sessioone. Seetõttu tuleb läbi viia süsteemi ulatuslik koormustestimine. Kahjuks on nii, et koormustestimise raamistikud oma originaalseisus ei sobi antud süsteemi testimiseks. Seega, koormustestimise raamistiku tuleb seadistada ning laiendada selleks, et see toetaks antud süsteemi spetsiifilisi protokolle ja võimaldaks testida kõiki komponente üheskoos. Hetkel on saadaval palju koormustestimise raamistikke. Mõned nendest on: Locust, Apache JMeter, Gatling Project. Need raamistikud erinevad üksteisest programmeerimiskeele, eriomaduste ning põhiloogika järgi. Kuna tegu on kommertsprojektiga, peab valitud koormustestimise raamistik vastama kliendi funktsionaalsete ja mittefunktsionaalsete nõuetele. Kuna koormustestimist viiakse läbi ainult põhiserveril, peab seadistama ja laiendama valitud raamistikku, et simuleerida teisi süsteemi komponente ja serveri protokolle. See töö annab kiire ülevaate varem mainitud koormustestimise raamistikest eriomaduste järgi, valib raamistiku, mida kohandatakse antud projekti raames koormustestimise läbi viimiseks ning kirjeldab kohandamise protsessi. Samuti toob see töö välja mõned koormustestimise raamistike piirangud ning kirjeldab meetodeid nende ületamiseks. Viimaks, süsteemi testitakse valitud raamistiku abil ning esitatakse ja valideeritakse tulemusi.A multi-component system is being build. Three main components are: backend server (Spring application), mobile applications (iOS, Android), customer service web portals. Our main concern is the backend server, because it is the destination of the majority of requests from customer service web portals and mobile applications. It is a multi-module project where all modules communicate to each other. The system is going to be used potentially by hundreds thousands of users with tens thousands of simultaneous usages. Therefore, extensive stress-testing must be conducted. Unfortunately, stress-testing frameworks in the original state are not suitable for the given system. Thus a stress-testing framework must be configured and extended to the point it supports the system’s specific protocols and can test all the system’s components together. There are numerous of stress-testing frameworks available. Some examples are: Locust, Apache JMeter, Gatling Project. These frameworks differ in terms of coding language, features and core logic. As it is a commercial project, the chosen stress-testing framework must also comply with client’s functional and non-functional requirements. Due to stress-testing being conducted only on the backend server component, the selected stress-testing framework must be configured/extended to simulate other components and the required server protocols. The thesis provides a brief comparison of the available stress-testing frameworks based on their features and written code language and define the one which is going to be adapted to conduct the stress-testing within the project and how the adaptation is done. The thesis also points out some of stress-testing frameworks’ limitations with techniques to overcome them. Finally, the system is tested using the selected testing framework and the results are presented and validated

DSpace at Tartu University Library

Cleaning Denial Constraint Violations through Relaxation

Author: Arenas Marcelo
Arocena Patricia C.
Fan Wenfei
Guagliardo Paolo
Krishnan Sanjay
Livshits Ester
Okcan Alper
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 15/04/2020
Field of study

Data cleaning is a time-consuming process that depends on the data analysis that users perform. Existing solutions treat data cleaning as a separate offline process that takes place before analysis begins. Applying data cleaning before analysis assumes a priori knowledge of the inconsistencies and the query workload, thereby requiring effort on understanding and cleaning the data that is unnecessary for the analysis. We propose an approach that performs probabilistic repair of denial constraint violations on-demand, driven by the exploratory analysis that users perform. We introduce Daisy, a system that seamlessly integrates data cleaning into the analysis by relaxing query results. Daisy executes analytical query-workloads over dirty data by weaving cleaning operators into the query plan. Our evaluation shows that Daisy adapts to the workload and outperforms traditional offline cleaning on both synthetic and real-world workloads.Comment: To appear in SIGMOD 2020 proceeding

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Analysis and Optimization of Scientific Applications through Set and Relation Abstractions

Author: Tohid (Rastegar Tohid Mohammed), M.
Publication venue: LSU Digital Commons
Publication date: 01/01/2017
Field of study

Writing high performance code has steadily become more challenging since the design of computing systems has moved toward parallel processors in forms of multi and many-core architectures. This trend has resulted in exceedingly more heterogeneous architectures and programming models. Moreover, the prevalence of distributed systems, especially in fields relying on supercomputers, has caused the programming of such diverse environment more difficulties. To mitigate such challenges, an assortment of tools and programming models have been introduced in the past decade or so. Some efforts focused on the characteristics of the code, such as polyhedral compilers (e.g. Pluto, PPCG, etc.) while others took in consideration the aspects of the application domain and proposed domain specific languages (DSLs). DSLs are developed either in the form of a stand-alone language, like Halide for image processing, or as a part of a general purpose language (e.g., Firedrake- a DSL embedded in Python for solving PDEs using FEM.) called embedded. All these approaches attempt to provide the best input to the underlying common programming models like MPI and OpenMP for distributed and shared memory systems respectively. This dissertation introduces Kaashi, a high-level run-time system, embedded in C++ language, designed to manage memory and execution order of programs with large input data and complex dependencies. Kaashi provides a uniform front-end to multiple back-ends focusing on distributed systems. Kaashi abstractions allows the programmer to define the problem’s data domain as a collection of sets and relations between pairs of such sets. The aforesaid level of abstraction could enable series of optimizations which, otherwise, are very expensive to detect or not feasible at all. Furthermore, Kaashi’s API helps novice programmers to write their code more structurally without getting involved in details of data management and communication

Louisiana State University

Contributions à l’Optimisation de Requêtes Multidimensionnelles

Author: Maabout Sofian
Publication venue: HAL CCSD
Publication date: 12/12/2014
Field of study

Analyser les données consiste à choisir un sous-ensemble des dimensions qui les décriventafin d'en extraire des informations utiles. Or, il est rare que l'on connaisse a priori les dimensions"intéressantes". L'analyse se transforme alors en une activité exploratoire où chaque passe traduit par une requête. Ainsi, il devient primordiale de proposer des solutions d'optimisationde requêtes qui ont une vision globale du processus plutôt que de chercher à optimiser chaque requêteindépendamment les unes des autres. Nous présentons nos contributions dans le cadre de cette approcheexploratoire en nous focalisant sur trois types de requêtes: (i) le calcul de bordures,(ii) les requêtes dites OLAP (On Line Analytical Processing) dans les cubes de données et (iii) les requêtesde préférence type skyline

Thèses en Ligne

HAL Descartes

Hal-Diderot

DQ2S- A framework for data quality-aware information management

Author: Dong Chao
Sampaio Pedro
Sampaio Sandra F Mendes
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

Crossref

The University of Manchester - Institutional Repository