13,140 research outputs found
Diagnostics and robust estimation in multivariate data transformations
This paper presents a method for detecting multivariate outliers which might be distorting theı estimation of a transformation to normality. A robust estimator of the transformation parameter is also proposed
Bounded RDF Data Transformations
RDF data transformations are transformations of RDF graphs to RDF graphs which preserve in different degree the data content in the source to the target. These transformation therefore give special attention to the data elements in such graphs—under the assumption that data elements reside in the subjects and objects of RDF triples, and the peculiar fact that the set of vertices and set of edges in an RDF graph are not necessarily disjoint. Bounded homomorphisms are used to define these transformations, which not only ensure that data from the source is structurally preserved in the target, but also require, in various ways, the target data to be related back to the source. The result of this paper is a theoretical toolkit of transformation characteristics with which detailed control over the transformation target may be exercised. We explore these characteristics in two different RDF graph representations, and give an algorithm for checking existence of transformations
Discrimination-aware data transformations
A deep use of people-related data in automated decision processes might lead to an amplification of inequities already implicit in real world data. Nowadays, the development of technological solutions satisfying nondiscriminatory requirements is therefore one of the main challenges for the data management and data analytics communities.
Nondiscrimination can be characterized in terms of different properties, like fairness, diversity, and coverage. Such properties should be achieved through a holistic approach, incrementally enforcing nondiscrimination constraints along all the stages of the data processing life-cycle, through individually independent choices rather than as a constraint on the final result. In this respect, the design of discrimination-aware solutions for the initial phases of the data processing pipeline (like data preparation), is extremely relevant: the sooner you spot the problem fewer problems you will get in the last analytical steps of the chain.
In this PhD thesis, we are interested in nondiscrimination constraints defined in terms of coverage. Coverage aims at guaranteeing that the input dataset includes enough examples for each (protected) category of interest, thus increasing diversity to limit the introduction of bias during the next analytical steps. While coverage constraints have been mainly used for repairing raw datasets, we investigate their effects on data transformations, during data preparation, through query execution. To this aim, we propose coverage-based queries, as a means to achieve coverage constraint satisfaction on the result of data transformations defined in terms of selection-based queries, and specific algorithms for their processing.
The proposed solutions rely on query rewriting, a key approach for enforcing specific constraints while guaranteeing transparency and avoiding disparate treatment discrimination. As far as we know and according to recent surveys in this domain, no other solutions addressing coverage-based rewriting during data transformations have been proposed so far.
To guarantee a good compromise between efficiency and accuracy, both precise and approximate algorithms for coverage-based query processing are proposed. The results of an extensive experimental evaluation, carried out on both synthetic and real datasets, shows the effectiveness and the efficiency of the proposed approaches.
Coverage-based queries can be easily integrated in relational machine learning data processing environments; to show their applicability, we integrate some of the designed algorithms in a machine learning data processing Python toolkit
Testing Data Transformations in MapReduce Programs
MapReduce is a parallel data processing paradigm oriented to process large volumes of information in data-intensive applications, such as Big Data environments. A characteristic of these applications is that they can have different data sources and data formats. For these reasons, the inputs could contain some poor quality data that could produce a failure if the program functionality does not handle properly the variety of input data. The output of these programs is obtained from a number of input transformations that represent the program logic. This paper proposes the testing technique called MRFlow that is based on data flow test criteria and oriented to transformations analysis between the input and the output in order to detect defects in MapReduce programs. MRFlow is applied over some MapReduce programs and detects several defect
A Guide to Integrate Plant Cover Data From Two different Methods
There is a lack of consensus on how to monitor (measure) plant cover in tidal marshes. Multiple methods exist to estimate plant cover, which can confound interpretation when making comparisons across methods. Here, we provide a novel and more accurate approach, building off of traditional data transformations designed to integrate the two most common methods: Point Intercept and Ocular Cover
Recommended from our members
Notes on the use of data transformations
Data transformations are commonly used tools that can serve many functions in quantitative analysis of data. The goal of this paper is to focus on the use of three data transformations most commonly discussed in statistics texts (square root, log, and inverse) for improving the normality of variables. While these are important options for analysts, they do fundamentally transform the nature of the variable, making the interpretation of the results somewhat more complex. Further, few (if any) statistical texts discuss the tremendous influence a distribution\u27s minimum value has on the efficacy of a transformation. The goal of this paper is to promote thoughtful and informed use of data transformations. Accessed 244,249 times on https://pareonline.net from May 30, 2002 to December 31, 2019. For downloads from January 1, 2020 forward, please click on the PlumX Metrics link to the right
Recommended from our members
Data Transformations for Inference with Linear Regression: Clarifications and Recommendations
Data transformations have been promoted as a popular and easy-to-implement remedy to address the assumption of normally distributed errors (in the population) in linear regression. However, the application of data transformations introduces non-ignorable complexities which should be fully appreciated before their implementation. This paper adds to existing Practical Research and Assessment Evaluation (PARE) publications on data transformations by providing a broad overview underlying the use of data transformations for the specific purpose of statistical inference and interpreting meaningful effect sizes. Data transformations not only potentially change the scale of the transformed variable; they also alter the fundamental relationships among variables while simultaneously changing the distribution of the errors. Given these repercussions, we clarify the nature of certain data transformations and strongly recommend the use of data transformations when they can enhance the interpretation of effect sizes. Accessed 5,515 times on https://pareonline.net from October 11, 2017 to December 31, 2019. For downloads from January 1, 2020 forward, please click on the PlumX Metrics link to the right
- …