Search CORE

10,317 research outputs found

PRESISTANT: Learning based assistant for data pre-processing

Author: Abelló Alberto
Aluja-Banet Tomàs
Bilalli Besim
Wrembel Robert
Publication venue
Publication date: 02/03/2018
Field of study

Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator (e.g., transformation) can have positive, negative or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operators. However, when it comes to non-experts, they are overwhelmed by the amount of pre-processing operators and it is challenging for them to find operators that would positively impact their analysis (e.g., increase the predictive accuracy of a classifier). Existing solutions either assume that users have expert knowledge, or they recommend pre-processing operators that are only "syntactically" applicable to a dataset, without taking into account their impact on the final analysis. In this work, we aim at providing assistance to non-expert users by recommending data pre-processing operators that are ranked according to their impact on the final analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the impact of pre-processing operators on the performance (e.g., predictive accuracy) of 5 different classification algorithms, such as J48, Naive Bayes, PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the recommendations provided by our tool, show that PRESISTANT can effectively help non-experts in order to achieve improved results in their analytical tasks

arXiv.org e-Print Archive

UPCommons. Portal del coneixement obert de la UPC

Actionable Recourse in Linear Classification

Author: Biran Or
Chouldechova Alexandra
Citron Danielle Keats
Crawford Kate
Edwards Lilian
Foundation Deutschland Open Knowledge
Poulin Brett
Ribeiro Marco Tulio
Spangher Alexander
Taylor Winnie F
Tramèr Florian
United
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 08/11/2019
Field of study

Machine learning models are increasingly used to automate decisions that affect humans - deciding who should receive a loan, a job interview, or a social service. In such applications, a person should have the ability to change the decision of a model. When a person is denied a loan by a credit score, for example, they should be able to alter its input variables in a way that guarantees approval. Otherwise, they will be denied the loan as long as the model is deployed. More importantly, they will lack the ability to influence a decision that affects their livelihood. In this paper, we frame these issues in terms of recourse, which we define as the ability of a person to change the decision of a model by altering actionable input variables (e.g., income vs. age or marital status). We present integer programming tools to ensure recourse in linear classification problems without interfering in model development. We demonstrate how our tools can inform stakeholders through experiments on credit scoring problems. Our results show that recourse can be significantly affected by standard practices in model development, and motivate the need to evaluate recourse in practice.Comment: Extended version. ACM Conference on Fairness, Accountability and Transparency [FAT2019

arXiv.org e-Print Archive

Crossref

QCBA: Postoptimization of Quantitative Attributes in Classifiers based on Association Rules

Author: Kliegr Tomas
Publication venue
Publication date: 18/10/2019
Field of study

The need to prediscretize numeric attributes before they can be used in association rule learning is a source of inefficiencies in the resulting classifier. This paper describes several new rule tuning steps aiming to recover information lost in the discretization of numeric (quantitative) attributes, and a new rule pruning strategy, which further reduces the size of the classification models. We demonstrate the effectiveness of the proposed methods on postoptimization of models generated by three state-of-the-art association rule classification algorithms: Classification based on Associations (Liu, 1998), Interpretable Decision Sets (Lakkaraju et al, 2016), and Scalable Bayesian Rule Lists (Yang, 2017). Benchmarks on 22 datasets from the UCI repository show that the postoptimized models are consistently smaller -- typically by about 50% -- and have better classification performance on most datasets

arXiv.org e-Print Archive

Research and Education in Computational Science and Engineering

Over the past two decades the field of computational science and engineering (CSE) has penetrated both basic and applied research in academia, industry, and laboratories to advance discovery, optimize systems, support decision-makers, and educate the scientific and engineering workforce. Informed by centuries of theory and experiment, CSE performs computational experiments to answer questions that neither theory nor experiment alone is equipped to answer. CSE provides scientists and engineers of all persuasions with algorithmic inventions and software systems that transcend disciplines and scales. Carried on a wave of digital technology, CSE brings the power of parallelism to bear on troves of data. Mathematics-based advanced computing has become a prevalent means of discovery and innovation in essentially all areas of science, engineering, technology, and society; and the CSE community is at the core of this transformation. However, a combination of disruptive developments---including the architectural complexity of extreme-scale computing, the data revolution that engulfs the planet, and the specialization required to follow the applications to new frontiers---is redefining the scope and reach of the CSE endeavor. This report describes the rapid expansion of CSE and the challenges to sustaining its bold advances. The report also presents strategies and directions for CSE research and education for the next decade.Comment: Major revision, to appear in SIAM Revie

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Development of code evaluation criteria for assessing predictive capability and performance

Author: Barson S. L.
Lin Shyi-Jang
Prueger G. H.
Sindir M. M.
Publication venue
Publication date
Field of study

Computational Fluid Dynamics (CFD), because of its unique ability to predict complex three-dimensional flows, is being applied with increasing frequency in the aerospace industry. Currently, no consistent code validation procedure is applied within the industry. Such a procedure is needed to increase confidence in CFD and reduce risk in the use of these codes as a design and analysis tool. This final contract report defines classifications for three levels of code validation, directly relating the use of CFD codes to the engineering design cycle. Evaluation criteria by which codes are measured and classified are recommended and discussed. Criteria for selecting experimental data against which CFD results can be compared are outlined. A four phase CFD code validation procedure is described in detail. Finally, the code validation procedure is demonstrated through application of the REACT CFD code to a series of cases culminating in a code to data comparison on the Space Shuttle Main Engine High Pressure Fuel Turbopump Impeller

NASA Technical Reports Server

Recommended from our members

Modeling Space-Time Data Using Stochastic Differential Equations

Author: Duan Jason A.
Gelfand Alan E.
Sirmans C. F.
Publication venue
Publication date: 01/01/2009
Field of study

This paper demonstrates the use and value of stochastic differential equations for modeling space-time data in two common settings. The first consists of point-referenced or geostatistical data where observations are collected at fixed locations and times. The second considers random point pattern data where the emergence of locations and times is random. For both cases, we employ stochastic differential equations to describe a latent process within a hierarchical model for the data. The intent is to view this latent process mechanistically and endow it with appropriate simple features and interpretable parameters. A motivating problem for the second setting is to model urban development through observed locations and times of new home construction; this gives rise to a space-time point pattern. We show that a spatio-temporal Cox process whose intensity is driven by a stochastic logistic equation is a viable mechanistic model that affords meaningful interpretation for the results of statistical inference. Other applications of stochastic logistic differential equations with space-time varying parameters include modeling population growth and product diffusion, which motivate our first, point-referenced data application. We propose a method to discretize both time and space in order to fit the model. We demonstrate the inference for the geostatistical model through a simulated dataset. Then, we fit the Cox process model to a real dataset taken from the greater Dallas metropolitan area.Business Administratio

Texas ScholarWorks