6 research outputs found

    Probabilistic type inference for the construction of data dictionaries

    Get PDF
    The data understanding stage plays a central role in the entire process of data analytics, as it allows the analyst to gain familiarity with the data, identify data quality issues, and discover initial insights into the data before further analysis (Chapman et al., 2000). These tasks become easier in the presence of well-documented background information such as a data dictionary, which is defined as “a centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format” (McDaniel, 1994). However, data dictionaries are often missing or incomplete. In this thesis we focus on inference of data types (both syntactic and semantic), and develop probabilistic approaches that enable the automatic construction of a data dictionary for a given dataset. Unlike existing rule-based methods, our proposed methods allow us to express uncertainty in a principled way and can provide accurate type predictions even for messy datasets with missing and anomalous values. The thesis makes the following contributions: First, we present ptype - a probabilistic generative model that uses Probabilistic Finite-State Machines (PFSMs) to represent data types. By detecting missing and anomalous data, ptype infers syntactic data types accurately and improves over the performance of existing approaches for type inference. Moreover, it offers the advantage of generating weighted predictions when a column of messy data is consistent with more than one type assignment, in contrast to more familiar finite-state machines (e.g., regular expressions). Secondly, we propose ptype-cat which is an extension of ptype for a better detection of the categorical type. ptype treats non-Boolean categorical variables as either integers or strings. By combining the output of ptype and additional features that can indicate whether a column represents a categorical variable or not, ptype-cat can correctly detect the general categorical type (including non-Boolean variables). In addition, we adapt ptype to the task of identifying the values associated with the corresponding categorical variable. Finally, we present ptype-semantics to demonstrate how ptype can be enriched by semantic information. In this regard, we focus on dimension and unit inference, which are respectively the task of identifying the dimension of a data column and the task of identifying the units of its entries. Syntactic type inference methods including ptype do not address these tasks. However, ptype-semantic can extract extra semantic information (such as dimension and unit) about data columns and treat them as either floats or integers rather than strings

    ptype: probabilistic type inference

    Get PDF
    Type inference refers to the task of inferring the data type of a given column of data. Current approaches often fail when data contains missing data and anomalies, which are found commonly in real-world data sets. In this paper, we propose ptype, a probabilistic robust type inference method that allows us to detect such entries, and infer data types. We further show that the proposed method outperforms the existing methods

    A real-time SIP network simulation and monitoring system

    No full text
    In this work we present a real time SIP network simulation and monitoring system. The SIP network simulator is based on a probabilistic generative model that mimics a social network of VoIP subscribers calling each other at random times. The monitoring system, installed at a SIP server, provides services for collecting network data and server statistics in real time. The system provides a robust framework for developing SIP network applications such as security monitors. Keywords: SIP networks, Network simulation, DDoS attack detectio

    CodeCheck: How do our food choices affect climate change?

    Get PDF
    Different approaches were proposed to predict the carbon footprint of products from the different datasets provided by CodeCheck. Multivariate linear regression and random forest regression models perform well in predicting carbon footprint, especially when - in addition to the nutrition information - the product categories, learned through Latent Dirichlet Allocation (LDA), were used as extra features in the models. The prediction accuracy of the models that were considered varied across datasets. A potential way to display the footprint estimates in the app was proposed
    corecore