13,140 research outputs found

    Diagnostics and robust estimation in multivariate data transformations

    Get PDF
    This paper presents a method for detecting multivariate outliers which might be distorting theı estimation of a transformation to normality. A robust estimator of the transformation parameter is also proposed

    Bounded RDF Data Transformations

    Get PDF
    RDF data transformations are transformations of RDF graphs to RDF graphs which preserve in different degree the data content in the source to the target. These transformation therefore give special attention to the data elements in such graphs—under the assumption that data elements reside in the subjects and objects of RDF triples, and the peculiar fact that the set of vertices and set of edges in an RDF graph are not necessarily disjoint. Bounded homomorphisms are used to define these transformations, which not only ensure that data from the source is structurally preserved in the target, but also require, in various ways, the target data to be related back to the source. The result of this paper is a theoretical toolkit of transformation characteristics with which detailed control over the transformation target may be exercised. We explore these characteristics in two different RDF graph representations, and give an algorithm for checking existence of transformations

    Discrimination-aware data transformations

    Get PDF
    A deep use of people-related data in automated decision processes might lead to an amplification of inequities already implicit in real world data. Nowadays, the development of technological solutions satisfying nondiscriminatory requirements is therefore one of the main challenges for the data management and data analytics communities. Nondiscrimination can be characterized in terms of different properties, like fairness, diversity, and coverage. Such properties should be achieved through a holistic approach, incrementally enforcing nondiscrimination constraints along all the stages of the data processing life-cycle, through individually independent choices rather than as a constraint on the final result. In this respect, the design of discrimination-aware solutions for the initial phases of the data processing pipeline (like data preparation), is extremely relevant: the sooner you spot the problem fewer problems you will get in the last analytical steps of the chain. In this PhD thesis, we are interested in nondiscrimination constraints defined in terms of coverage. Coverage aims at guaranteeing that the input dataset includes enough examples for each (protected) category of interest, thus increasing diversity to limit the introduction of bias during the next analytical steps. While coverage constraints have been mainly used for repairing raw datasets, we investigate their effects on data transformations, during data preparation, through query execution. To this aim, we propose coverage-based queries, as a means to achieve coverage constraint satisfaction on the result of data transformations defined in terms of selection-based queries, and specific algorithms for their processing. The proposed solutions rely on query rewriting, a key approach for enforcing specific constraints while guaranteeing transparency and avoiding disparate treatment discrimination. As far as we know and according to recent surveys in this domain, no other solutions addressing coverage-based rewriting during data transformations have been proposed so far. To guarantee a good compromise between efficiency and accuracy, both precise and approximate algorithms for coverage-based query processing are proposed. The results of an extensive experimental evaluation, carried out on both synthetic and real datasets, shows the effectiveness and the efficiency of the proposed approaches. Coverage-based queries can be easily integrated in relational machine learning data processing environments; to show their applicability, we integrate some of the designed algorithms in a machine learning data processing Python toolkit

    Testing Data Transformations in MapReduce Programs

    Get PDF
    MapReduce is a parallel data processing paradigm oriented to process large volumes of information in data-intensive applications, such as Big Data environments. A characteristic of these applications is that they can have different data sources and data formats. For these reasons, the inputs could contain some poor quality data that could produce a failure if the program functionality does not handle properly the variety of input data. The output of these programs is obtained from a number of input transformations that represent the program logic. This paper proposes the testing technique called MRFlow that is based on data flow test criteria and oriented to transformations analysis between the input and the output in order to detect defects in MapReduce programs. MRFlow is applied over some MapReduce programs and detects several defect

    A Guide to Integrate Plant Cover Data From Two different Methods

    Get PDF
    There is a lack of consensus on how to monitor (measure) plant cover in tidal marshes. Multiple methods exist to estimate plant cover, which can confound interpretation when making comparisons across methods. Here, we provide a novel and more accurate approach, building off of traditional data transformations designed to integrate the two most common methods: Point Intercept and Ocular Cover
    corecore