    Profiling relational data: a survey

    Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases

    Adapting a Stress Testing Framework to a Multi-module Security-oriented Spring Application

    Programmeeritakse mitmekomponendilist sĂŒsteemi. Kolm pĂ”hikomponenti on jĂ€rgmised: pĂ”hiserver (Spring rakendus), mobiilirakendused (iOS, Android), klienditeeninduse veebiportaalid. KĂ”ige tĂ€htsam sĂŒsteemi töös on pĂ”hiserver, kuna see on enamuse veebiportaalide ning mobiilirakenduste pĂ€ringute sihtpunkt. See on mitmemooduliline projekt, kus kĂ”ik moodulid suhtlevad omavahel. Potentsiaalselt hakkab sĂŒsteemi kasutama sadu tuhandeid inimesi – kĂŒmneid tuhandeid paralleelseid sessioone. SeetĂ”ttu tuleb lĂ€bi viia sĂŒsteemi ulatuslik koormustestimine. Kahjuks on nii, et koormustestimise raamistikud oma originaalseisus ei sobi antud sĂŒsteemi testimiseks. Seega, koormustestimise raamistiku tuleb seadistada ning laiendada selleks, et see toetaks antud sĂŒsteemi spetsiifilisi protokolle ja vĂ”imaldaks testida kĂ”iki komponente ĂŒheskoos. Hetkel on saadaval palju koormustestimise raamistikke. MĂ”ned nendest on: Locust, Apache JMeter, Gatling Project. Need raamistikud erinevad ĂŒksteisest programmeerimiskeele, eriomaduste ning pĂ”hiloogika jĂ€rgi. Kuna tegu on kommertsprojektiga, peab valitud koormustestimise raamistik vastama kliendi funktsionaalsete ja mittefunktsionaalsete nĂ”uetele. Kuna koormustestimist viiakse lĂ€bi ainult pĂ”hiserveril, peab seadistama ja laiendama valitud raamistikku, et simuleerida teisi sĂŒsteemi komponente ja serveri protokolle. See töö annab kiire ĂŒlevaate varem mainitud koormustestimise raamistikest eriomaduste jĂ€rgi, valib raamistiku, mida kohandatakse antud projekti raames koormustestimise lĂ€bi viimiseks ning kirjeldab kohandamise protsessi. Samuti toob see töö vĂ€lja mĂ”ned koormustestimise raamistike piirangud ning kirjeldab meetodeid nende ĂŒletamiseks. Viimaks, sĂŒsteemi testitakse valitud raamistiku abil ning esitatakse ja valideeritakse tulemusi.A multi-component system is being build. Three main components are: backend server (Spring application), mobile applications (iOS, Android), customer service web portals. Our main concern is the backend server, because it is the destination of the majority of requests from customer service web portals and mobile applications. It is a multi-module project where all modules communicate to each other. The system is going to be used potentially by hundreds thousands of users with tens thousands of simultaneous usages. Therefore, extensive stress-testing must be conducted. Unfortunately, stress-testing frameworks in the original state are not suitable for the given system. Thus a stress-testing framework must be configured and extended to the point it supports the system’s specific protocols and can test all the system’s components together. There are numerous of stress-testing frameworks available. Some examples are: Locust, Apache JMeter, Gatling Project. These frameworks differ in terms of coding language, features and core logic. As it is a commercial project, the chosen stress-testing framework must also comply with client’s functional and non-functional requirements. Due to stress-testing being conducted only on the backend server component, the selected stress-testing framework must be configured/extended to simulate other components and the required server protocols. The thesis provides a brief comparison of the available stress-testing frameworks based on their features and written code language and define the one which is going to be adapted to conduct the stress-testing within the project and how the adaptation is done. The thesis also points out some of stress-testing frameworks’ limitations with techniques to overcome them. Finally, the system is tested using the selected testing framework and the results are presented and validated

    Cleaning Denial Constraint Violations through Relaxation

    Data cleaning is a time-consuming process that depends on the data analysis that users perform. Existing solutions treat data cleaning as a separate offline process that takes place before analysis begins. Applying data cleaning before analysis assumes a priori knowledge of the inconsistencies and the query workload, thereby requiring effort on understanding and cleaning the data that is unnecessary for the analysis. We propose an approach that performs probabilistic repair of denial constraint violations on-demand, driven by the exploratory analysis that users perform. We introduce Daisy, a system that seamlessly integrates data cleaning into the analysis by relaxing query results. Daisy executes analytical query-workloads over dirty data by weaving cleaning operators into the query plan. Our evaluation shows that Daisy adapts to the workload and outperforms traditional offline cleaning on both synthetic and real-world workloads.Comment: To appear in SIGMOD 2020 proceeding

    Analysis and Optimization of Scientific Applications through Set and Relation Abstractions

    Writing high performance code has steadily become more challenging since the design of computing systems has moved toward parallel processors in forms of multi and many-core architectures. This trend has resulted in exceedingly more heterogeneous architectures and programming models. Moreover, the prevalence of distributed systems, especially in fields relying on supercomputers, has caused the programming of such diverse environment more difficulties. To mitigate such challenges, an assortment of tools and programming models have been introduced in the past decade or so. Some efforts focused on the characteristics of the code, such as polyhedral compilers (e.g. Pluto, PPCG, etc.) while others took in consideration the aspects of the application domain and proposed domain specific languages (DSLs). DSLs are developed either in the form of a stand-alone language, like Halide for image processing, or as a part of a general purpose language (e.g., Firedrake- a DSL embedded in Python for solving PDEs using FEM.) called embedded. All these approaches attempt to provide the best input to the underlying common programming models like MPI and OpenMP for distributed and shared memory systems respectively. This dissertation introduces Kaashi, a high-level run-time system, embedded in C++ language, designed to manage memory and execution order of programs with large input data and complex dependencies. Kaashi provides a uniform front-end to multiple back-ends focusing on distributed systems. Kaashi abstractions allows the programmer to define the problem’s data domain as a collection of sets and relations between pairs of such sets. The aforesaid level of abstraction could enable series of optimizations which, otherwise, are very expensive to detect or not feasible at all. Furthermore, Kaashi’s API helps novice programmers to write their code more structurally without getting involved in details of data management and communication

    Contributions Ă  l’Optimisation de RequĂȘtes Multidimensionnelles

    Analyser les donnĂ©es consiste Ă  choisir un sous-ensemble des dimensions qui les dĂ©criventafin d'en extraire des informations utiles. Or, il est rare que l'on connaisse a priori les dimensions"intĂ©ressantes". L'analyse se transforme alors en une activitĂ© exploratoire oĂč chaque passe traduit par une requĂȘte. Ainsi, il devient primordiale de proposer des solutions d'optimisationde requĂȘtes qui ont une vision globale du processus plutĂŽt que de chercher Ă  optimiser chaque requĂȘteindĂ©pendamment les unes des autres. Nous prĂ©sentons nos contributions dans le cadre de cette approcheexploratoire en nous focalisant sur trois types de requĂȘtes: (i) le calcul de bordures,(ii) les requĂȘtes dites OLAP (On Line Analytical Processing) dans les cubes de donnĂ©es et (iii) les requĂȘtesde prĂ©fĂ©rence type skyline
