7 research outputs found

    FlashProfile: A Framework for Synthesizing Data Profiles

    Get PDF
    We address the problem of learning a syntactic profile for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify the different formats is infeasible in standard big-data scenarios. Prior techniques are restricted to a small set of pre-defined patterns (e.g. digits, letters, words, etc.), and provide no control over granularity of profiles. We define syntactic profiling as a problem of clustering strings based on syntactic similarity, followed by identifying patterns that succinctly describe each cluster. We present a technique for synthesizing such profiles over a given language of patterns, that also allows for interactive refinement by requesting a desired number of clusters. Using a state-of-the-art inductive synthesis framework, PROSE, we have implemented our technique as FlashProfile. Across 153153 tasks over 7575 large real datasets, we observe a median profiling time of only ∼ 0.7 \sim\,0.7\,s. Furthermore, we show that access to syntactic profiles may allow for more accurate synthesis of programs, i.e. using fewer examples, in programming-by-example (PBE) workflows such as FlashFill.Comment: 28 pages, SPLASH (OOPSLA) 201

    Democratizing Self-Service Data Preparation through Example Guided Program Synthesis,

    Full text link
    The majority of real-world data we can access today have one thing in common: they are not immediately usable in their original state. Trapped in a swamp of data usability issues like non-standard data formats and heterogeneous data sources, most data analysts and machine learning practitioners have to burden themselves with "data janitor" work, writing ad-hoc Python, PERL or SQL scripts, which is tedious and inefficient. It is estimated that data scientists or analysts typically spend 80% of their time in preparing data, a significant amount of human effort that can be redirected to better goals. In this dissertation, we accomplish this task by harnessing knowledge such as examples and other useful hints from the end user. We develop program synthesis techniques guided by heuristics and machine learning, which effectively make data preparation less painful and more efficient to perform by data users, particularly those with little to no programming experience. Data transformation, also called data wrangling or data munging, is an important task in data preparation, seeking to convert data from one format to a different (often more structured) format. Our system Foofah shows that allowing end users to describe their desired transformation, through providing small input-output transformation examples, can significantly reduce the overall user effort. The underlying program synthesizer can often succeed in finding meaningful data transformation programs within a reasonably short amount of time. Our second system, CLX, demonstrates that sometimes the user does not even need to provide complete input-output examples, but only label ones that are desirable if they exist in the original dataset. The system is still capable of suggesting reasonable and explainable transformation operations to fix the non-standard data format issue in a dataset full of heterogeneous data with varied formats. PRISM, our third system, targets a data preparation task of data integration, i.e., combining multiple relations to formulate a desired schema. PRISM allows the user to describe the target schema using not only high-resolution (precise) constraints of complete example data records in the target schema, but also (imprecise) constraints of varied resolutions, such as incomplete data record examples with missing values, value ranges, or multiple possible values in each element (cell), so as to require less familiarity of the database contents from the end user.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163059/1/markjin_1.pd

    Just-In-Time Data Virtualization: Lightweight Data Management with ViDa

    Get PDF
    As the size of data and its heterogeneity increase, traditional database system architecture becomes an obstacle to data analysis. Integrating and ingesting (loading) data into databases is quickly becoming a bottleneck in face of massive data as well as increasingly heterogeneous data formats. Still, state-of-the-art approaches typically rely on copying and transforming data into one (or few) repositories. Queries, on the other hand, are often ad-hoc and supported by pre-cooked operators which are not adaptive enough to optimize access to data. As data formats and queries increasingly vary, there is a need to depart from the current status quo of static query processing primitives and build dynamic, fully adaptive architectures. We build ViDa, a system which reads data in its raw format and processes queries using adaptive, just-in-time operators. Our key insight is use of virtualization, i.e., abstracting data and manipulating it regardless of its original format, and dynamic generation of operators. ViDa's query engine is generated just-in-time; its caches and its query operators adapt to the current query and the workload, while also treating raw datasets as its native storage structures. Finally, ViDa features a language expressive enough to support heterogeneous data models, and to which existing languages can be translated. Users therefore have the power to choose the language best suited for an analysis

    Detecting anomalies in modern IT systems through the inference of structure and the detection of novelties in system logs

    Get PDF
    Les anomalies dans les logs des systèmes d’information sont souvent le signe de failles ou de vulnérabilités. Leur détection automatique est difficile à cause du manque de structure dans les logs, et de la complexité des anomalies. Les méthodes d’inférence de structure existantes sont peu flexibles : elles ne sont pas paramétriques, ou reposent sur des hypothèses syntaxiques fortes, qui s’avèrent parfois inadéquates. Les méthodes de détection d’anomalies adoptent quant à elles une représentation des données qui néglige le temps écoulé entre les logs, et sont donc inadaptées à la détection d’anomalies temporelles. La contribution de cette thèse est double. Nous proposons d’abord METING, une méthode d’inférence de structure paramétrique et modulable. METING ne repose sur aucune hypothèse syntaxique forte, mais se base sur l’exploration de motifs fréquents, en étudiant les n-grammes des logs. Nous montrons expérimentalement que METING surpasse les méthodes existantes, avec d’importantes améliorations sur certains jeux de données. Nous montrons également que la sensibilité de notre méthode à ses hyper-paramètres lui permet de s’adapter à l’hétérogénéité des jeux de données. Enfin, nous proposons une extension de METING au contexte de la racinisation en traitement automatique du texte, et montrons que notre approche fournit une racinisation multilingue, sans règle, et plus efficace que la méthode de Porter, référence de l’état de l’art. Nous présentons également NoTIL, une méthode de détection de nouveautés par apprentissage profond. NoTIL utilise une représentation des données capable de détecter les irrégularités temporelles dans les logs. Notre méthode repose sur l’apprentissage d’une tâche de prédiction intermédiaire pour modéliser le comportement nominal des logs. Nous comparons notre méthode à celles de l’état de l’art et concluons que NoTIL est la méthode capable de traiter la plus grande variété d’anomalies, grâce aux choix de sa représentation des données.The anomalies in the logs of information system are often the sign of faults and vulnerabilities. Their detection is challenging due to the lack of structure in logs and the complexity of the anomalies. Existing methods to infer the structure are poorly flexible: they are not parametric, or rely on strong syntactic assumptions, which sometimes prove to be inadequate. Anomaly detection methods adopt a data representation that neglects the time elapsed between the logs, and are therefore unsuitable for the detection of temporal anomalies. The contribution of this thesis is twofold. We first propose METING, a parametric and modular structure inference method. METING does not rely on any strong syntactic assumption, but is based on the mining of frequent patterns, through the study of n-grams. We experimentally show that METING surpasses the existing methods, with important improvements on some datasets. We also show the important sensitivity of our method to its hyper-parameters, which allows the exploration of many configurations, and the adaptation to the heterogeneity of datasets. Finally, we propose an extension of METING to the context of stemming in text mining, and show that our approach provides a stemming solution that is multilingual, rule-free, and more efficient than that of Porter, the state-of-the-art reference. We also present NoTIL, a novelty detection method based on deep learning. NoTIL uses a data representation capable of detecting temporal irregularities in the logs. Our method is based on the learning of an intermediate prediction task to model the nominal behavior of logs. We compare our method to the most up-to-date references and conclude that NoTIL is the method capable of dealing with the greatest variety of anomalies, thanks to its data representation

    Detecting anomalies in modern IT systems through the inference of structure and the detection of novelties in system logs

    Get PDF
    Les anomalies dans les logs des systèmes d’information sont souvent le signe de failles ou de vulnérabilités. Leur détection automatique est difficile à cause du manque de structure dans les logs, et de la complexité des anomalies. Les méthodes d’inférence de structure existantes sont peu flexibles : elles ne sont pas paramétriques, ou reposent sur des hypothèses syntaxiques fortes, qui s’avèrent parfois inadéquates. Les méthodes de détection d’anomalies adoptent quant à elles une représentation des données qui néglige le temps écoulé entre les logs, et sont donc inadaptées à la détection d’anomalies temporelles. La contribution de cette thèse est double. Nous proposons d’abord METING, une méthode d’inférence de structure paramétrique et modulable. METING ne repose sur aucune hypothèse syntaxique forte, mais se base sur l’exploration de motifs fréquents, en étudiant les n-grammes des logs. Nous montrons expérimentalement que METING surpasse les méthodes existantes, avec d’importantes améliorations sur certains jeux de données. Nous montrons également que la sensibilité de notre méthode à ses hyper-paramètres lui permet de s’adapter à l’hétérogénéité des jeux de données. Enfin, nous proposons une extension de METING au contexte de la racinisation en traitement automatique du texte, et montrons que notre approche fournit une racinisation multilingue, sans règle, et plus efficace que la méthode de Porter, référence de l’état de l’art. Nous présentons également NoTIL, une méthode de détection de nouveautés par apprentissage profond. NoTIL utilise une représentation des données capable de détecter les irrégularités temporelles dans les logs. Notre méthode repose sur l’apprentissage d’une tâche de prédiction intermédiaire pour modéliser le comportement nominal des logs. Nous comparons notre méthode à celles de l’état de l’art et concluons que NoTIL est la méthode capable de traiter la plus grande variété d’anomalies, grâce aux choix de sa représentation des données.The anomalies in the logs of information system are often the sign of faults and vulnerabilities. Their detection is challenging due to the lack of structure in logs and the complexity of the anomalies. Existing methods to infer the structure are poorly flexible: they are not parametric, or rely on strong syntactic assumptions, which sometimes prove to be inadequate. Anomaly detection methods adopt a data representation that neglects the time elapsed between the logs, and are therefore unsuitable for the detection of temporal anomalies. The contribution of this thesis is twofold. We first propose METING, a parametric and modular structure inference method. METING does not rely on any strong syntactic assumption, but is based on the mining of frequent patterns, through the study of n-grams. We experimentally show that METING surpasses the existing methods, with important improvements on some datasets. We also show the important sensitivity of our method to its hyper-parameters, which allows the exploration of many configurations, and the adaptation to the heterogeneity of datasets. Finally, we propose an extension of METING to the context of stemming in text mining, and show that our approach provides a stemming solution that is multilingual, rule-free, and more efficient than that of Porter, the state-of-the-art reference. We also present NoTIL, a novelty detection method based on deep learning. NoTIL uses a data representation capable of detecting temporal irregularities in the logs. Our method is based on the learning of an intermediate prediction task to model the nominal behavior of logs. We compare our method to the most up-to-date references and conclude that NoTIL is the method capable of dealing with the greatest variety of anomalies, thanks to its data representation

    LearnPADS ++ Incremental Inference of Ad Hoc Data Formats

    No full text
    Abstract. An ad hoc data source is any semi-structured, non-standard data source. The format of such data sources is often evolving and frequently lacking documentation. Consequently, off-the-shelf tools for processing such data often do not exist, forcing analysts to develop their own tools, a costly and time-consuming process. In this paper, we present an incremental algorithm that automatically infers the format of large-scale data sources. From the resulting format descriptions, we can generate a suite of data processing tools automatically. The system can handle large-scale or streaming data sources whose formats evolve over time. Furthermore, it allows analysts to modify inferred descriptions as desired and incorporates those changes in future revisions. 4
    corecore