Search CORE

148 research outputs found

BClean: A Bayesian Data Cleaning System

Author: Huang Sifan
Mao Rui
Miao Yukai
Onizuka Makoto
Qin Jianbin
Wang Yaoshu
Xiao Chuan
Zhang Yifan
Zhu Jing
Publication venue
Publication date: 11/11/2023
Field of study

There is a considerable body of work on data cleaning which employs various principles to rectify erroneous data and transform a dirty dataset into a cleaner one. One of prevalent approaches is probabilistic methods, including Bayesian methods. However, existing probabilistic methods often assume a simplistic distribution (e.g., Gaussian distribution), which is frequently underfitted in practice, or they necessitate experts to provide a complex prior distribution (e.g., via a programming language). This requirement is both labor-intensive and costly, rendering these methods less suitable for real-world applications. In this paper, we propose BClean, a Bayesian Cleaning system that features automatic Bayesian network construction and user interaction. We recast the data cleaning problem as a Bayesian inference that fully exploits the relationships between attributes in the observed dataset and any prior information provided by users. To this end, we present an automatic Bayesian network construction method that extends a structure learning-based functional dependency discovery method with similarity functions to capture the relationships between attributes. Furthermore, our system allows users to modify the generated Bayesian network in order to specify prior information or correct inaccuracies identified by the automatic generation process. We also design an effective scoring model (called the compensative scoring model) necessary for the Bayesian inference. To enhance the efficiency of data cleaning, we propose several approximation strategies for the Bayesian inference, including graph partitioning, domain pruning, and pre-detection. By evaluating on both real-world and synthetic datasets, we demonstrate that BClean is capable of achieving an F-measure of up to 0.9 in data cleaning, outperforming existing Bayesian methods by 2% and other data cleaning methods by 15%.Comment: Our source code is available at https://github.com/yyssl88/BClea

arXiv.org e-Print Archive

Numerical methods and accurate computations with structured matrices

Author: Delgado Gracia Jorge
Orera Hernández Héctor
Peña Ferrández Juan Manuel
Publication venue: Universidad de Zaragoza, Prensas de la Universidad
Publication date: 01/01/2022
Field of study

Esta tesis doctoral es un compendio de 11 artículos científicos. El tema principal de la tesis es el Álgebra Lineal Numérica, con énfasis en dos clases de matrices estructuradas: las matrices totalmente positivas y las M-matrices. Para algunas subclases de estas matrices, es posible desarrollar algoritmos para resolver numéricamente varios de los problemas más comunes en álgebra lineal con alta precisión relativa independientemente del número de condición de la matriz. La clave para lograr cálculos precisos está en el uso de una parametrización diferente que represente la estructura especial de la matriz y en el desarrollo de algoritmos adaptados que trabajen con dicha parametrización.Las matrices totalmente positivas no singulares admiten una factorización única como producto de matrices bidiagonales no negativas llamada factorización bidiagonal. Si conocemos esta representación con alta precisión relativa, se puede utilizar para resolver ciertos sistemas de ecuaciones y para calcular la inversa, los valores propios y los valores singulares con alta precisión relativa. Nuestra contribución en este campo ha sido la obtención de la factorización bidiagonal con alta precisión relativa de matrices de colocación de polinomios de Laguerre generalizados, de matrices de colocación de polinomios de Bessel, de clases de matrices que generalizan la matriz de Pascal y de matrices de q-enteros. También hemos estudiado la extensión de varias propiedades óptimas de las matrices de colocación de B-bases normalizadas (que en particular son matrices totalmente positivas). En particular, hemos demostrado propiedades de optimalidad de las matrices de colocación del producto tensorial de B-bases normalizadas.Si conocemos las sumas de filas y las entradas extradiagonales de una M-matriz no singular diagonal dominante con alta precisión relativa, entonces podemos calcular su inversa, determinante y valores singulares también con alta precisión relativa. Hemos buscado nuevos métodos para lograr cálculos precisos con nuevas clases de M-matrices o matrices relacionadas. Hemos propuesto una parametrización para las Z-matrices de Nekrasov con entradas diagonales positivas que puede utilizarse para calcular su inversa y determinante con alta precisión relativa. También hemos estudiado la clase denominada B-matrices, que está muy relacionada con las M-matrices. Hemos obtenido un método para calcular los determinantes de esta clase con alta precisión relativa y otro para calcular los determinantes de las matrices de B-Nekrasov también con alta precisión relativa. Basándonos en la utilización de dos matrices de escalado que hemos introducido, hemos desarrollado nuevas cotas para la norma infinito de la inversa de una matriz de Nekrasov y para el error del problema de complementariedad lineal cuando su matriz asociada es de Nekrasov. También hemos obtenido nuevas cotas para la norma infinito de las inversas de Bpi-matrices, una clase que extiende a las B-matrices, y las hemos utilizado para obtener nuevas cotas del error para el problema de complementariedad lineal cuya matriz asociada es una Bpi-matriz. Algunas clases de matrices han sido generalizadas al caso de mayor dimensión para desarrollar una teoría para tensores extendiendo la conocida para el caso matricial. Por ejemplo, la definición de la clase de las B-matrices ha sido extendida a la clase de B-tensores, dando lugar a un criterio sencillo para identificar una nueva clase de tensores definidos positivos. Hemos propuesto una extensión de la clase de las Bpi-matrices a Bpi-tensores, definiendo así una nueva clase de tensores definidos positivos que puede ser identificada en base a un criterio sencillo basado solo en cálculos que involucran a las entradas del tensor. Finalmente, hemos caracterizado los casos en los que las matrices de Toeplitz tridiagonales son P-matrices y hemos estudiado cuándo pueden ser representadas en términos de una factorización bidiagonal que sirve como parametrización para lograr cálculos con alta precisión relativa.<br /

Repositorio Universidad de Zaragoza

Uncertainty in Artificial Intelligence: Proceedings of the Thirty-Fourth Conference

Author
Publication venue: AUAI Press
Publication date: 01/09/2018
Field of study

UCL Discovery

Recommended from our members

Supporting Scientific Analytics under Data Uncertainty and Query Uncertainty

Author: Peng Liping
Publication venue: ScholarWorks@UMass Amherst
Publication date: 23/03/2018
Field of study

Data management is becoming increasingly important in many applications, in particular, in large scientific databases where (1) data can be naturally modeled by continuous random variables, and (2) queries can involve complex predicates and/or be difficult for users to express explicitly. My thesis work aims to provide efficient support to both the data uncertainty and the query uncertainty . When data is uncertain, an important class of queries requires query answers to be returned if their existence probabilities pass a threshold. I start with optimizing such threshold query processing for continuous uncertain data in the relational model by (i) expediting selections by reducing dimensionality of integration and using faster filters, (ii) expediting joins using new indexes on uncertain data, and (iii) optimizing a query plan using a dynamic, per-tuple based approach. Evaluation results using real-world data and benchmark queries show the accuracy and efficiency of my techniques and the dynamic query planning has over 50% performance gains in most cases over a state-of-the-art threshold query optimizer and is very close to the optimal planning in all cases. Next I address uncertain data management in the array model, which has gained popu- larity for scientific data processing recently due to performance benefits. I define the formal semantics of array operations on uncertain data involving both value uncertainty within individual tuples and position uncertainty regarding where a tuple should belong in an array given uncertain dimension attributes, and propose a suite of storage and evaluation strategies for array operators, with a focus on a novel scheme that bounds the overhead of querying by strategically placing a few replicas of the tuples with large variances. Evaluation results show that for common workloads, my best-performing techniques outperform baselines up to 1 to 2 orders of magnitude while incurring only small storage overhead. Finally, to bridge the increasing gap between the fast growth of data and the limited human ability to comprehend data and help the user retrieve high-value content from data more effectively, I propose to build interactive data exploration as a new database service, using an approach called “explore-by-example”. To build an effective system, my work is grounded in a rigorous SVM-based active learning framework and focuses on the following three problems: (i) accuracy-based and convergence-based stopping criteria, (ii) expediting example acquisition in each iteration, and (iii) expediting the final result retrieval. Evaluation results using real-world data and query patterns show that my system significantly outperforms state-of-the-art systems in accuracy (18x accuracy improvement for 4-dimensional workloads) while achieving desired efficiency for interactive exploration (2 to 5 seconds per iteration)

ScholarWorks@UMass Amherst

Large-scale integrated quantum photonics:development and applications

Author: Paesani Stefano
Publication venue
Publication date: 25/06/2019
Field of study

Explore Bristol Research

Deep learning for internet of underwater things and ocean data analytics

Author: Jahanbakht Mohammad
Publication venue
Publication date: 01/01/2022
Field of study

The Internet of Underwater Things (IoUT) is an emerging technological ecosystem developed for connecting objects in maritime and underwater environments. IoUT technologies are empowered by an extreme number of deployed sensors and actuators. In this thesis, multiple IoUT sensory data are augmented with machine intelligence for forecasting purposes

ResearchOnline at James Cook University

PERICLES Deliverable 4.3:Content Semantics and Use Context Analysis Techniques

Author: Chatzilari E
Corubolo F
Darányi Sandor
De Weerdt David
Gill Alastair
Kontopoulos Efstratios
Maronidis A
Mitzias P
Nikopoulos S
Riga M
Sauter Christine
Tonkin Emma L.
Waddington Simon
Wittek Peter
Publication venue
Publication date: 01/01/2016
Field of study

The current deliverable summarises the work conducted within task T4.3 of WP4, focusing on the extraction and the subsequent analysis of semantic information from digital content, which is imperative for its preservability. More specifically, the deliverable defines content semantic information from a visual and textual perspective, explains how this information can be exploited in long-term digital preservation and proposes novel approaches for extracting this information in a scalable manner. Additionally, the deliverable discusses novel techniques for retrieving and analysing the context of use of digital objects. Although this topic has not been extensively studied by existing literature, we believe use context is vital in augmenting the semantic information and maintaining the usability and preservability of the digital objects, as well as their ability to be accurately interpreted as initially intended.PERICLE

University of Borås

Digitala Vetenskapliga Arkivet - Academic Archive On-line

King's Research Portal

Explore Bristol Research

Tools and Algorithms for the Construction and Analysis of Systems

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/04/2022
Field of study

This open access book constitutes the proceedings of the 28th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS 2022, which was held during April 2-7, 2022, in Munich, Germany, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022. The 46 full papers and 4 short papers presented in this volume were carefully reviewed and selected from 159 submissions. The proceedings also contain 16 tool papers of the affiliated competition SV-Comp and 1 paper consisting of the competition report. TACAS is a forum for researchers, developers, and users interested in rigorously based tools and algorithms for the construction and analysis of systems. The conference aims to bridge the gaps between different communities with this common interest and to support them in their quest to improve the utility, reliability, exibility, and efficiency of tools and algorithms for building computer-controlled systems

Directory of Open Access Books (DOAB)

Aggregated OD tracks of mobile phone data for the recognition of daily mobility spaces: an application to Lombardia region

Author: F. Manfredini
P. Pucci
P. Tagliolato
Publication venue
Publication date: 01/01/2013
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

Recognizing complex faces and gaits via novel probabilistic models

Author: Venkatasubramanian Ibrahim Venkat Krishnamurthy
Publication venue: Mathematical and Computer Science
Publication date: 01/10/2010
Field of study

In the field of computer vision, developing automated systems to recognize people under unconstrained scenarios is a partially solved problem. In unconstrained sce- narios a number of common variations and complexities such as occlusion, illumi- nation, cluttered background and so on impose vast uncertainty to the recognition process. Among the various biometrics that have been emerging recently, this dissertation focus on two of them namely face and gait recognition. Firstly we address the problem of recognizing faces with major occlusions amidst other variations such as pose, scale, expression and illumination using a novel PRObabilistic Component based Interpretation Model (PROCIM) inspired by key psychophysical principles that are closely related to reasoning under uncertainty. The model basically employs Bayesian Networks to establish, learn, interpret and exploit intrinsic similarity mappings from the face domain. Then, by incorporating e cient inference strategies, robust decisions are made for successfully recognizing faces under uncertainty. PROCIM reports improved recognition rates over recent approaches. Secondly we address the newly upcoming gait recognition problem and show that PROCIM can be easily adapted to the gait domain as well. We scienti cally de ne and formulate sub-gaits and propose a novel modular training scheme to e ciently learn subtle sub-gait characteristics from the gait domain. Our results show that the proposed model is robust to several uncertainties and yields sig- ni cant recognition performance. Apart from PROCIM, nally we show how a simple component based gait reasoning can be coherently modeled using the re- cently prominent Markov Logic Networks (MLNs) by intuitively fusing imaging, logic and graphs. We have discovered that face and gait domains exhibit interesting similarity map- pings between object entities and their components. We have proposed intuitive probabilistic methods to model these mappings to perform recognition under vari- ous uncertainty elements. Extensive experimental validations justi es the robust- ness of the proposed methods over the state-of-the-art techniques.

ROS: The Research Output Service. Heriot-Watt University Edinburgh