Search CORE

6 research outputs found

Probabilistic type inference for the construction of data dictionaries

Author: Ceritli Taha Yusuf
Publication venue: The University of Edinburgh
Publication date: 30/11/2021
Field of study

The data understanding stage plays a central role in the entire process of data analytics, as it allows the analyst to gain familiarity with the data, identify data quality issues, and discover initial insights into the data before further analysis (Chapman et al., 2000). These tasks become easier in the presence of well-documented background information such as a data dictionary, which is defined as “a centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format” (McDaniel, 1994). However, data dictionaries are often missing or incomplete. In this thesis we focus on inference of data types (both syntactic and semantic), and develop probabilistic approaches that enable the automatic construction of a data dictionary for a given dataset. Unlike existing rule-based methods, our proposed methods allow us to express uncertainty in a principled way and can provide accurate type predictions even for messy datasets with missing and anomalous values. The thesis makes the following contributions: First, we present ptype - a probabilistic generative model that uses Probabilistic Finite-State Machines (PFSMs) to represent data types. By detecting missing and anomalous data, ptype infers syntactic data types accurately and improves over the performance of existing approaches for type inference. Moreover, it offers the advantage of generating weighted predictions when a column of messy data is consistent with more than one type assignment, in contrast to more familiar finite-state machines (e.g., regular expressions). Secondly, we propose ptype-cat which is an extension of ptype for a better detection of the categorical type. ptype treats non-Boolean categorical variables as either integers or strings. By combining the output of ptype and additional features that can indicate whether a column represents a categorical variable or not, ptype-cat can correctly detect the general categorical type (including non-Boolean variables). In addition, we adapt ptype to the task of identifying the values associated with the corresponding categorical variable. Finally, we present ptype-semantics to demonstrate how ptype can be enriched by semantic information. In this regard, we focus on dimension and unit inference, which are respectively the task of identifying the dimension of a data column and the task of identifying the units of its entries. Syntactic type inference methods including ptype do not address these tasks. However, ptype-semantic can extract extra semantic information (such as dimension and unit) about data columns and treat them as either floats or integers rather than strings

Edinburgh Research Archive

ptype: probabilistic type inference

Author: Ceritli Taha
Geddes James
Williams Christopher K. I.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 23/03/2020
Field of study

Type inference refers to the task of inferring the data type of a given column of data. Current approaches often fail when data contains missing data and anomalies, which are found commonly in real-world data sets. In this paper, we propose ptype, a probabilistic robust type inference method that allows us to detect such entries, and infer data types. We further show that the proposed method outperforms the existing methods

arXiv.org e-Print Archive

Edinburgh Research Explorer

AI Assistants: A Framework for Semi-Automated Data Wrangling

Author: Ceritli Taha
Jiménez-Ruiz Ernesto
Nazábal Alfredo
Petricek Tomas
van Den Burg Gerrit J.J.
Williams Christopher K I
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 16/11/2022
Field of study

Edinburgh Research Explorer

A real-time SIP network simulation and monitoring system

Author: Ali Taylan Cemgil
Barış Kurt
Bülent Sankur
Taha Yusuf Ceritli
Çağatay Yıldız
Publication venue: 'Elsevier BV'
Publication date: 01/07/2018
Field of study

In this work we present a real time SIP network simulation and monitoring system. The SIP network simulator is based on a probabilistic generative model that mimics a social network of VoIP subscribers calling each other at random times. The monitoring system, installed at a SIP server, provides services for collecting network data and server statistics in real time. The system provides a robust framework for developing SIP network applications such as security monitors. Keywords: SIP networks, Network simulation, DDoS attack detectio

Directory of Open Access Journals

CodeCheck: How do our food choices affect climate change?

Author: Arenas Diego
Boustati Ayman
Ceritli Taha
Chang Marina
de Wiljes Jan-Hendrik
Drikvandi Reza
Ezer Daphne
Groves Matthew
Varga Marton
Williams Angus
Publication venue: [No known commissioning body]
Publication date: 01/09/2018
Field of study

Different approaches were proposed to predict the carbon footprint of products from the different datasets provided by CodeCheck. Multivariate linear regression and random forest regression models perform well in predicting carbon footprint, especially when - in addition to the nutrition information - the product categories, learned through Latent Dirichlet Allocation (LDA), were used as extra features in the models. The prediction accuracy of the models that were considered varied across datasets. A potential way to display the footprint estimates in the app was proposed

Durham Research Online

E-space: Manchester Metropolitan University's Research Repository

A Bayesian change point model for detecting SIP-based DDoS attacks

Author: Akbar
Akbar
Ali Taylan Cemgil
Barber
Barış Kurt
Bouzida
Bülent Sankur
Chandola
Chang
Chen
Cooney
Eddy
Ehlert
Ehlert
Fearnhead
Geneiatakis
Geneiatakis
Goldenberg
Handley
Keromytis
Korolov
Minka
Minka
Mirkovic
Nassar
Peng
Rebahi
Rosenberg
Sanders
Schulzrinne
Sisalem
Taha Yusuf Ceritli
Tsiatsikas
Tsiatsikas
Voznak
Wu
Yıldız
Zhang
Çağatay Yıldız
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref