6 research outputs found
Probabilistic type inference for the construction of data dictionaries
The data understanding stage plays a central role in the entire process of data analytics,
as it allows the analyst to gain familiarity with the data, identify data quality issues,
and discover initial insights into the data before further analysis (Chapman et al., 2000).
These tasks become easier in the presence of well-documented background information such as a data dictionary, which is defined as “a centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format”
(McDaniel, 1994). However, data dictionaries are often missing or incomplete.
In this thesis we focus on inference of data types (both syntactic and semantic),
and develop probabilistic approaches that enable the automatic construction of a data
dictionary for a given dataset. Unlike existing rule-based methods, our proposed methods allow us to express uncertainty in a principled way and can provide accurate type
predictions even for messy datasets with missing and anomalous values.
The thesis makes the following contributions: First, we present ptype - a probabilistic generative model that uses Probabilistic Finite-State Machines (PFSMs) to
represent data types. By detecting missing and anomalous data, ptype infers syntactic
data types accurately and improves over the performance of existing approaches for
type inference. Moreover, it offers the advantage of generating weighted predictions
when a column of messy data is consistent with more than one type assignment, in
contrast to more familiar finite-state machines (e.g., regular expressions).
Secondly, we propose ptype-cat which is an extension of ptype for a better detection of the categorical type. ptype treats non-Boolean categorical variables as either
integers or strings. By combining the output of ptype and additional features that
can indicate whether a column represents a categorical variable or not, ptype-cat can
correctly detect the general categorical type (including non-Boolean variables). In
addition, we adapt ptype to the task of identifying the values associated with the corresponding categorical variable.
Finally, we present ptype-semantics to demonstrate how ptype can be enriched
by semantic information. In this regard, we focus on dimension and unit inference,
which are respectively the task of identifying the dimension of a data column and the
task of identifying the units of its entries. Syntactic type inference methods including
ptype do not address these tasks. However, ptype-semantic can extract extra semantic
information (such as dimension and unit) about data columns and treat them as either
floats or integers rather than strings
ptype: probabilistic type inference
Type inference refers to the task of inferring the data type of a given
column of data. Current approaches often fail when data contains missing data
and anomalies, which are found commonly in real-world data sets. In this paper,
we propose ptype, a probabilistic robust type inference method that allows us
to detect such entries, and infer data types. We further show that the proposed
method outperforms the existing methods
A real-time SIP network simulation and monitoring system
In this work we present a real time SIP network simulation and monitoring system. The SIP network simulator is based on a probabilistic generative model that mimics a social network of VoIP subscribers calling each other at random times. The monitoring system, installed at a SIP server, provides services for collecting network data and server statistics in real time. The system provides a robust framework for developing SIP network applications such as security monitors. Keywords: SIP networks, Network simulation, DDoS attack detectio
CodeCheck: How do our food choices affect climate change?
Different approaches were proposed to predict the carbon footprint of products from the different datasets provided by CodeCheck. Multivariate linear regression and random forest regression models perform well in predicting carbon footprint, especially when - in addition to the nutrition information - the product categories, learned through Latent Dirichlet Allocation (LDA), were used as extra features in the models. The prediction accuracy of the models that were considered varied across datasets. A potential way to display the footprint estimates in the app was proposed