14,776 research outputs found

    Towards Data Wrangling Automation through Dynamically-Selected Background Knowledge

    Full text link
    [ES] El proceso de ciencia de datos es esencial para extraer valor de los datos. Sin embargo, la parte más tediosa del proceso, la preparación de los datos, implica una serie de formateos, limpieza e identificación de problemas que principalmente son tareas manuales. La preparación de datos todavía se resiste a la automatización en parte porque el problema depende en gran medida de la información del dominio, que se convierte en un cuello de botella para los sistemas de última generación a medida que aumenta la diversidad de dominios, formatos y estructuras de los datos. En esta tesis nos enfocamos en generar algoritmos que aprovechen el conocimiento del dominio para la automatización de partes del proceso de preparación de datos. Mostramos la forma en que las técnicas generales de inducción de programas, en lugar de los lenguajes específicos del dominio, se pueden aplicar de manera flexible a problemas donde el conocimiento es importante, mediante el uso dinámico de conocimiento específico del dominio. De manera más general, sostenemos que una combinación de enfoques de aprendizaje dinámicos y basados en conocimiento puede conducir a buenas soluciones. Proponemos varias estrategias para seleccionar o construir automáticamente el conocimiento previo apropiado en varios escenarios de preparación de datos. La idea principal se basa en elegir las mejores primitivas especializadas de acuerdo con el contexto del problema particular a resolver. Abordamos dos escenarios. En el primero, manejamos datos personales (nombres, fechas, teléfonos, etc.) que se presentan en formatos de cadena de texto muy diferentes y deben ser transformados a un formato unificado. El problema es cómo construir una transformación compositiva a partir de un gran conjunto de primitivas en el dominio (por ejemplo, manejar meses, años, días de la semana, etc.). Desarrollamos un sistema (BK-ADAPT) que guía la búsqueda a través del conocimiento previo extrayendo varias meta-características de los ejemplos que caracterizan el dominio de la columna. En el segundo escenario, nos enfrentamos a la transformación de matrices de datos en lenguajes de programación genéricos como R, utilizando como ejemplos una matriz de entrada y algunas celdas de la matriz de salida. También desarrollamos un sistema guiado por una búsqueda basada en árboles (AUTOMAT[R]IX) que usa varias restricciones, probabilidades previas para las primitivas y sugerencias textuales, para aprender eficientemente las transformaciones. Con estos sistemas, mostramos que la combinación de programación inductiva, con la selección dinámica de las primitivas apropiadas a partir del conocimiento previo, es capaz de mejorar los resultados de otras herramientas actuales específicas para la preparación de datos.[CA] El procés de ciència de dades és essencial per extraure valor de les dades. No obstant això, la part més tediosa del procés, la preparació de les dades, implica una sèrie de transformacions, neteja i identificació de problemes que principalment són tasques manuals. La preparació de dades encara es resisteix a l'automatització en part perquè el problema depén en gran manera de la informació del domini, que es converteix en un coll de botella per als sistemes d'última generació a mesura que augmenta la diversitat de dominis, formats i estructures de les dades. En aquesta tesi ens enfoquem a generar algorismes que aprofiten el coneixement del domini per a l'automatització de parts del procés de preparació de dades. Mostrem la forma en què les tècniques generals d'inducció de programes, en lloc dels llenguatges específics del domini, es poden aplicar de manera flexible a problemes on el coneixement és important, mitjançant l'ús dinàmic de coneixement específic del domini. De manera més general, sostenim que una combinació d'enfocaments d'aprenentatge dinàmics i basats en coneixement pot conduir a les bones solucions. Proposem diverses estratègies per seleccionar o construir automàticament el coneixement previ apropiat en diversos escenaris de preparació de dades. La idea principal es basa a triar les millors primitives especialitzades d'acord amb el context del problema particular a resoldre. Abordem dos escenaris. En el primer, manegem dades personals (noms, dates, telèfons, etc.) que es presenten en formats de cadena de text molt diferents i han de ser transformats a un format unificat. El problema és com construir una transformació compositiva a partir d'un gran conjunt de primitives en el domini (per exemple, manejar mesos, anys, dies de la setmana, etc.). Desenvolupem un sistema (BK-ADAPT) que guia la cerca a través del coneixement previ extraient diverses meta-característiques dels exemples que caracteritzen el domini de la columna. En el segon escenari, ens enfrontem a la transformació de matrius de dades en llenguatges de programació genèrics com a R, utilitzant com a exemples una matriu d'entrada i algunes dades de la matriu d'eixida. També desenvolupem un sistema guiat per una cerca basada en arbres (AUTOMAT[R]IX) que usa diverses restriccions, probabilitats prèvies per a les primitives i suggeriments textuals, per aprendre eficientment les transformacions. Amb aquests sistemes, mostrem que la combinació de programació inductiva amb la selecció dinàmica de les primitives apropiades a partir del coneixement previ, és capaç de millorar els resultats d'altres enfocaments de preparació de dades d'última generació i més específics.[EN] Data science is essential for the extraction of value from data. However, the most tedious part of the process, data wrangling, implies a range of mostly manual formatting, identification and cleansing manipulations. Data wrangling still resists automation partly because the problem strongly depends on domain information, which becomes a bottleneck for state-of-the-art systems as the diversity of domains, formats and structures of the data increases. In this thesis we focus on generating algorithms that take advantage of the domain knowledge for the automation of parts of the data wrangling process. We illustrate the way in which general program induction techniques, instead of domain-specific languages, can be applied flexibly to problems where knowledge is important, through the dynamic use of domain-specific knowledge. More generally, we argue that a combination of knowledge-based and dynamic learning approaches leads to successful solutions. We propose several strategies to automatically select or construct the appropriate background knowledge for several data wrangling scenarios. The key idea is based on choosing the best specialised background primitives according to the context of the particular problem to solve. We address two scenarios. In the first one, we handle personal data (names, dates, telephone numbers, etc.) that are presented in very different string formats and have to be transformed into a unified format. The problem is how to build a compositional transformation from a large set of primitives in the domain (e.g., handling months, years, days of the week, etc.). We develop a system (BK-ADAPT) that guides the search through the background knowledge by extracting several meta-features from the examples characterising the column domain. In the second scenario, we face the transformation of data matrices in generic programming languages such as R, using an input matrix and some cells of the output matrix as examples. We also develop a system guided by a tree-based search (AUTOMAT[R]IX) that uses several constraints, prior primitive probabilities and textual hints to efficiently learn the transformations. With these systems, we show that the combination of inductive programming with the dynamic selection of the appropriate primitives from the background knowledge is able to improve the results of other state-of-the-art and more specific data wrangling approaches.This research was supported by the Spanish MECD Grant FPU15/03219;and partially by the Spanish MINECO TIN2015-69175-C4-1-R (Lobass) and RTI2018-094403-B-C32-AR (FreeTech) in Spain; and by the ERC Advanced Grant Synthesising Inductive Data Models (Synth) in Belgium.Contreras Ochando, L. (2020). Towards Data Wrangling Automation through Dynamically-Selected Background Knowledge [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/160724TESI

    Grammar-Guided Genetic Programming For Fuzzy Rule-Based Classification in Credit Management

    Get PDF

    Qualitative System Identification from Imperfect Data

    Full text link
    Experience in the physical sciences suggests that the only realistic means of understanding complex systems is through the use of mathematical models. Typically, this has come to mean the identification of quantitative models expressed as differential equations. Quantitative modelling works best when the structure of the model (i.e., the form of the equations) is known; and the primary concern is one of estimating the values of the parameters in the model. For complex biological systems, the model-structure is rarely known and the modeler has to deal with both model-identification and parameter-estimation. In this paper we are concerned with providing automated assistance to the first of these problems. Specifically, we examine the identification by machine of the structural relationships between experimentally observed variables. These relationship will be expressed in the form of qualitative abstractions of a quantitative model. Such qualitative models may not only provide clues to the precise quantitative model, but also assist in understanding the essence of that model. Our position in this paper is that background knowledge incorporating system modelling principles can be used to constrain effectively the set of good qualitative models. Utilising the model-identification framework provided by Inductive Logic Programming (ILP) we present empirical support for this position using a series of increasingly complex artificial datasets. The results are obtained with qualitative and quantitative data subject to varying amounts of noise and different degrees of sparsity. The results also point to the presence of a set of qualitative states, which we term kernel subsets, that may be necessary for a qualitative model-learner to learn correct models. We demonstrate scalability of the method to biological system modelling by identification of the glycolysis metabolic pathway from data

    A new approach for discovering business process models from event logs.

    Get PDF
    Process mining is the automated acquisition of process models from the event logs of information systems. Although process mining has many useful applications, not all inherent difficulties have been sufficiently solved. A first difficulty is that process mining is often limited to a setting of non-supervised learnings since negative information is often not available. Moreover, state transitions in processes are often dependent on the traversed path, which limits the appropriateness of search techniques based on local information in the event log. Another difficulty is that case data and resource properties that can also influence state transitions are time-varying properties, such that they cannot be considered ascross-sectional.This article investigates the use of first-order, ILP classification learners for process mining and describes techniques for dealing with each of the above mentioned difficulties. To make process mining a supervised learning task, we propose to include negative events in the event log. When event logs contain no negative information, a technique is described to add artificial negative examples to a process log. To capture history-dependent behavior the article proposes to take advantage of the multi-relational nature of ILP classification learners. Multi-relational process mining allows to search for patterns among multiple event rows in the event log, effectively basing its search on global information. To deal with time-varying case data and resource properties, a closed-world version of the Event Calculus has to be added as background knowledge, transforming the event log effectively in a temporal database. First experiments on synthetic event logs show that first-order classification learners are capable of predicting the behavior with high accuracy, even under conditions of noise.Credit; Credit scoring; Models; Model; Applications; Performance; Space; Decision; Yield; Real life; Risk; Evaluation; Rules; Neural networks; Networks; Classification; Research; Business; Processes; Event; Information; Information systems; Systems; Learning; Data; Behavior; Patterns; IT; Event calculus; Knowledge; Database; Noise;

    A review of the state of the art in Machine Learning on the Semantic Web: Technical Report CSTR-05-003

    Get PDF
    corecore