Data pre-processing is one of the most time consuming and relevant steps in a
data analysis process (e.g., classification task). A given data pre-processing
operator (e.g., transformation) can have positive, negative or zero impact on
the final result of the analysis. Expert users have the required knowledge to
find the right pre-processing operators. However, when it comes to non-experts,
they are overwhelmed by the amount of pre-processing operators and it is
challenging for them to find operators that would positively impact their
analysis (e.g., increase the predictive accuracy of a classifier). Existing
solutions either assume that users have expert knowledge, or they recommend
pre-processing operators that are only "syntactically" applicable to a dataset,
without taking into account their impact on the final analysis. In this work,
we aim at providing assistance to non-expert users by recommending data
pre-processing operators that are ranked according to their impact on the final
analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the
impact of pre-processing operators on the performance (e.g., predictive
accuracy) of 5 different classification algorithms, such as J48, Naive Bayes,
PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the
recommendations provided by our tool, show that PRESISTANT can effectively help
non-experts in order to achieve improved results in their analytical tasks