10 research outputs found

    A simple and efficient kNN variant with embedded feature selection

    Full text link
    [EN] Predictive modeling aims at providing estimates of an unknown variable, the target, from a set of known ones, the input. The k Nearest Neighbors (kNN) is one of the best-known predictive algorithms due to its simplicity and well behavior. However, this class of models has some drawbacks, such as the non-robustness to the existence of irrelevant input features or the need to transform qualitative variables into dummies, with the corresponding loss of information for ordinal ones. In this work, a kNN regression variant, easily adaptable for classification purposes, is suggested. The proposal allows dealing with all types of input variables while embedding feature selection in a simple and efficient manner, reducing the tuning phase. More precisely, making use of the weighted Gower distance, we develop a powerful tool to cope with these inconveniences by implementing different weighting schemes. The proposed method is applied to a collection of 20 data sets, different in size, data type and the distribution of the target variable. Moreover, the results are compared with previously proposed kNN variants, showing its supremacy, particularly when the weighting scheme is based on non-linear association measures and in datasets that contain at least one ordinal input variable.NextGenerationEU Funds, Programa Investigo, CT36/22-04-UCM-INVMoreno-Ribera, A.; Calviño, A. (2023). A simple and efficient kNN variant with embedded feature selection. Editorial Universitat Politècnica de València. 237-238. http://hdl.handle.net/10251/20179123723

    A state of the art of sensor location, flow observability, estimation, and prediction problems in traffic networks

    Get PDF
    A state-of-the-art review of flow observability, estimation, and prediction problems in traffic networks is performed. Since mathematical optimization provides a general framework for all of them, an integrated approach is used to perform the analysis of these problems and consider them as different optimization problems whose data, variables, constraints, and objective functions are the main elements that characterize the problems proposed by different authors. For example, counted, scanned or “a priori” data are the most common data sources; conservation laws, flow nonnegativity, link capacity, flow definition, observation, flow propagation, and specific model requirements form the most common constraints; and least squares, likelihood, possible relative error, mean absolute relative error, and so forth constitute the bases for the objective functions or metrics. The high number of possible combinations of these elements justifies the existence of a wide collection of methods for analyzing static and dynamic situations

    A Simple Method for Limiting Disclosure in Continuous Microdata Based on Principal Component Analysis

    No full text
    In this article we propose a simple and versatile method for limiting disclosure in continuous microdata based on Principal Component Analysis (PCA). Instead of perturbing the original variables, we propose to alter the principal components, as they contain the same information but are uncorrelated, which permits working on each component separately, reducing processing times. The number and weight of the perturbed components determine the level of protection and distortion of the masked data. The method provides preservation of the mean vector and the variance-covariance matrix. Furthermore, depending on the technique chosen to perturb the principal components, the proposed method can provide masked, hybrid or fully synthetic data sets. Some examples of application and comparison with other methods previously proposed in the literature (in terms of disclosure risk and data utility) are also included

    Algunas herramientas estadísticas y matemáticas para la modelización del tráfico

    Get PDF
    RESUMEN: Esta tesis presenta los siguientes modelos estadístico-matemáticos originales: - Dos modelos estáticos de asignación de tráfico con usuarios heterogéneos que permiten obtener los flujos de las rutas y los arcos, conocidos los flujos entre pares origen-destino. Dichos modelos consideran distintas clases de usuarios según su deseo de puntualidad y adelantamiento, respectivamente. - Un modelo bayesiano de estimación de matrices origen-destino basado en técnicas de optimización jerárquica. Las estimaciones se obtienen a partir de la información ofrecida por arcos escaneados. - Se calcula el mínimo conjunto de arcos que debe ser equipado con sensores para obtener observabilidad total a partir de los flujos en arcos. - Un modelo continuo para el problema dinámico de recarga de red incluyendo adelantamientos que proporciona los flujos y tiempos de viaje en los arcos de la red en cualquier instante del intervalo de tiempo en estudio. - Algunos métodos gráficos para analizar trayectorias de tráfico con y sin adelantamiento, que permiten analizar el estado de una red (velocidad, aceleración, etc.) a partir de las características físicas de los gráficos de trayectorias (pendiente, curvatura, etc.). Todos los modelos han sido evaluados en redes de tráfico ficticias y reales (las ciudades españolas de Cuenca y Ciudad Real), con el fin de analizar sus características y evaluar la validez de los resultados. Asimismo, se incluye una revisión de la literatura que permite contextualizar los modelos originales propuestos en esta tesis.ABSTRACT: In this thesis, the following original statistical and mathematical models are presented: - Two static traffic assignment model with heterogeneous users that permits obtaining the link and path flows from the flow on the origin-destination pairs. These models consider different users classes by their punctuality and overtaking desire, respectively. - A Bayesian origin-destination matrix estimation model based on hierarchical optimization. The estimates are obtained by means of counted links. - The minimum number of sensors to be installed on links for total link observability is derived. - A continuous dynamic network loading problem considering overtaking that gives link travel times and flows all over the network at any time of the period under study. - Some graphical methods to analyze trajectory plots with and without overtaking that lead to an evaluation of the system state (speed, acceleration, etc.) by means of the trajectories physical characteristics (slope, curvature, etc.). All these models have been tested in fictitious and real traffic networks (the Spanish cities of Cuenca and Ciudad Real), with the aim of analyzing its characteristics and performance. Furthermore, a literature review about existing traffic problems and the most widely used models to solve them is done

    A Posteriori Random Forests for Stochastic Downscaling of Precipitation by Predicting Probability Distributions

    No full text
    ABSTRACT: This work presents a comprehensive assessment of the suitability of random forests, a well-known machine learning technique, for the statistical downscaling of precipitation. Building on the experimental and validation framework proposed in the Experiment 1 of the COST action VALUE-the largest, most exhaustive intercomparison study of statistical downscaling methods to date-we introduce and thoroughly analyze a posteriori random forests (AP-RFs), which use all the information contained in the leaves to reliably predict the shape and scale parameters of the gamma probability distribution of precipitation on wet days. Therefore, as opposed to traditional random forests, which typically provide deterministic predictions, our AP-RFs allow realistic stochastic precipitation samples to be generated for wet days. Indeed, as compared to one particular implementation of a generalized linear model that exhibited an overall good performance in VALUE, our AP-RFs yield better distributional similarity with observations without loss of predictive power. Noteworthy, the new methodology proposed in this paper has substantial potential for hydrologists and other impact communities which are in need of local-scale, reliable stochastic climate information.The authors would like to acknowledge the projects MULTI-SDM (CGL2015-66583-R, MINECO/FEDER); IS-ENES3, funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement ID 824084 (https://is.enes.org) and Contribución a la nueva generación de proyecciones climáticas regionales de CORDEX mediante técnicas dinámicas y estadísticas (CORDyS):PID2020-116595RB-I00, funded by the Agencia Estatal de Investigación of the Spanish Government
    corecore