148 research outputs found
BClean: A Bayesian Data Cleaning System
There is a considerable body of work on data cleaning which employs various
principles to rectify erroneous data and transform a dirty dataset into a
cleaner one. One of prevalent approaches is probabilistic methods, including
Bayesian methods. However, existing probabilistic methods often assume a
simplistic distribution (e.g., Gaussian distribution), which is frequently
underfitted in practice, or they necessitate experts to provide a complex prior
distribution (e.g., via a programming language). This requirement is both
labor-intensive and costly, rendering these methods less suitable for
real-world applications. In this paper, we propose BClean, a Bayesian Cleaning
system that features automatic Bayesian network construction and user
interaction. We recast the data cleaning problem as a Bayesian inference that
fully exploits the relationships between attributes in the observed dataset and
any prior information provided by users. To this end, we present an automatic
Bayesian network construction method that extends a structure learning-based
functional dependency discovery method with similarity functions to capture the
relationships between attributes. Furthermore, our system allows users to
modify the generated Bayesian network in order to specify prior information or
correct inaccuracies identified by the automatic generation process. We also
design an effective scoring model (called the compensative scoring model)
necessary for the Bayesian inference. To enhance the efficiency of data
cleaning, we propose several approximation strategies for the Bayesian
inference, including graph partitioning, domain pruning, and pre-detection. By
evaluating on both real-world and synthetic datasets, we demonstrate that
BClean is capable of achieving an F-measure of up to 0.9 in data cleaning,
outperforming existing Bayesian methods by 2% and other data cleaning methods
by 15%.Comment: Our source code is available at https://github.com/yyssl88/BClea
Numerical methods and accurate computations with structured matrices
Esta tesis doctoral es un compendio de 11 artículos científicos. El tema principal de la tesis es el Álgebra Lineal Numérica, con énfasis en dos clases de matrices estructuradas: las matrices totalmente positivas y las M-matrices. Para algunas subclases de estas matrices, es posible desarrollar algoritmos para resolver numéricamente varios de los problemas más comunes en álgebra lineal con alta precisión relativa independientemente del número de condición de la matriz. La clave para lograr cálculos precisos está en el uso de una parametrización diferente que represente la estructura especial de la matriz y en el desarrollo de algoritmos adaptados que trabajen con dicha parametrización.Las matrices totalmente positivas no singulares admiten una factorización única como producto de matrices bidiagonales no negativas llamada factorización bidiagonal. Si conocemos esta representación con alta precisión relativa, se puede utilizar para resolver ciertos sistemas de ecuaciones y para calcular la inversa, los valores propios y los valores singulares con alta precisión relativa. Nuestra contribución en este campo ha sido la obtención de la factorización bidiagonal con alta precisión relativa de matrices de colocación de polinomios de Laguerre generalizados, de matrices de colocación de polinomios de Bessel, de clases de matrices que generalizan la matriz de Pascal y de matrices de q-enteros. También hemos estudiado la extensión de varias propiedades óptimas de las matrices de colocación de B-bases normalizadas (que en particular son matrices totalmente positivas). En particular, hemos demostrado propiedades de optimalidad de las matrices de colocación del producto tensorial de B-bases normalizadas.Si conocemos las sumas de filas y las entradas extradiagonales de una M-matriz no singular diagonal dominante con alta precisión relativa, entonces podemos calcular su inversa, determinante y valores singulares también con alta precisión relativa. Hemos buscado nuevos métodos para lograr cálculos precisos con nuevas clases de M-matrices o matrices relacionadas. Hemos propuesto una parametrización para las Z-matrices de Nekrasov con entradas diagonales positivas que puede utilizarse para calcular su inversa y determinante con alta precisión relativa. También hemos estudiado la clase denominada B-matrices, que está muy relacionada con las M-matrices. Hemos obtenido un método para calcular los determinantes de esta clase con alta precisión relativa y otro para calcular los determinantes de las matrices de B-Nekrasov también con alta precisión relativa. Basándonos en la utilización de dos matrices de escalado que hemos introducido, hemos desarrollado nuevas cotas para la norma infinito de la inversa de una matriz de Nekrasov y para el error del problema de complementariedad lineal cuando su matriz asociada es de Nekrasov. También hemos obtenido nuevas cotas para la norma infinito de las inversas de Bpi-matrices, una clase que extiende a las B-matrices, y las hemos utilizado para obtener nuevas cotas del error para el problema de complementariedad lineal cuya matriz asociada es una Bpi-matriz. Algunas clases de matrices han sido generalizadas al caso de mayor dimensión para desarrollar una teoría para tensores extendiendo la conocida para el caso matricial. Por ejemplo, la definición de la clase de las B-matrices ha sido extendida a la clase de B-tensores, dando lugar a un criterio sencillo para identificar una nueva clase de tensores definidos positivos. Hemos propuesto una extensión de la clase de las Bpi-matrices a Bpi-tensores, definiendo así una nueva clase de tensores definidos positivos que puede ser identificada en base a un criterio sencillo basado solo en cálculos que involucran a las entradas del tensor. Finalmente, hemos caracterizado los casos en los que las matrices de Toeplitz tridiagonales son P-matrices y hemos estudiado cuándo pueden ser representadas en términos de una factorización bidiagonal que sirve como parametrización para lograr cálculos con alta precisión relativa.<br /
Recommended from our members
Supporting Scientific Analytics under Data Uncertainty and Query Uncertainty
Data management is becoming increasingly important in many applications, in particular, in large scientific databases where (1) data can be naturally modeled by continuous random variables, and (2) queries can involve complex predicates and/or be difficult for users to express explicitly. My thesis work aims to provide efficient support to both the data uncertainty and the query uncertainty .
When data is uncertain, an important class of queries requires query answers to be returned if their existence probabilities pass a threshold. I start with optimizing such threshold query processing for continuous uncertain data in the relational model by (i) expediting selections by reducing dimensionality of integration and using faster filters, (ii) expediting joins using new indexes on uncertain data, and (iii) optimizing a query plan using a dynamic, per-tuple based approach. Evaluation results using real-world data and benchmark queries show the accuracy and efficiency of my techniques and the dynamic query planning has over 50% performance gains in most cases over a state-of-the-art threshold query optimizer and is very close to the optimal planning in all cases.
Next I address uncertain data management in the array model, which has gained popu- larity for scientific data processing recently due to performance benefits. I define the formal semantics of array operations on uncertain data involving both value uncertainty within individual tuples and position uncertainty regarding where a tuple should belong in an array given uncertain dimension attributes, and propose a suite of storage and evaluation strategies for array operators, with a focus on a novel scheme that bounds the overhead of querying by strategically placing a few replicas of the tuples with large variances. Evaluation results show that for common workloads, my best-performing techniques outperform baselines up to 1 to 2 orders of magnitude while incurring only small storage overhead.
Finally, to bridge the increasing gap between the fast growth of data and the limited human ability to comprehend data and help the user retrieve high-value content from data more effectively, I propose to build interactive data exploration as a new database service, using an approach called “explore-by-example”. To build an effective system, my work is grounded in a rigorous SVM-based active learning framework and focuses on the following three problems: (i) accuracy-based and convergence-based stopping criteria, (ii) expediting example acquisition in each iteration, and (iii) expediting the final result retrieval. Evaluation results using real-world data and query patterns show that my system significantly outperforms state-of-the-art systems in accuracy (18x accuracy improvement for 4-dimensional workloads) while achieving desired efficiency for interactive exploration (2 to 5 seconds per iteration)
Deep learning for internet of underwater things and ocean data analytics
The Internet of Underwater Things (IoUT) is an emerging technological ecosystem developed for connecting objects in maritime and underwater environments. IoUT technologies are empowered by an extreme number of deployed sensors and actuators. In this thesis, multiple IoUT sensory data are augmented with machine intelligence for forecasting purposes
PERICLES Deliverable 4.3:Content Semantics and Use Context Analysis Techniques
The current deliverable summarises the work conducted within task T4.3 of WP4, focusing on the extraction and the subsequent analysis of semantic information from digital content, which is imperative for its preservability. More specifically, the deliverable defines content semantic information from a visual and textual perspective, explains how this information can be exploited in long-term digital preservation and proposes novel approaches for extracting this information in a scalable manner. Additionally, the deliverable discusses novel techniques for retrieving and analysing the context of use of digital objects. Although this topic has not been extensively studied by existing literature, we believe use context is vital in augmenting the semantic information and maintaining the usability and preservability of the digital objects, as well as their ability to be accurately interpreted as initially intended.PERICLE
Tools and Algorithms for the Construction and Analysis of Systems
This open access book constitutes the proceedings of the 28th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS 2022, which was held during April 2-7, 2022, in Munich, Germany, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022. The 46 full papers and 4 short papers presented in this volume were carefully reviewed and selected from 159 submissions. The proceedings also contain 16 tool papers of the affiliated competition SV-Comp and 1 paper consisting of the competition report. TACAS is a forum for researchers, developers, and users interested in rigorously based tools and algorithms for the construction and analysis of systems. The conference aims to bridge the gaps between different communities with this common interest and to support them in their quest to improve the utility, reliability, exibility, and efficiency of tools and algorithms for building computer-controlled systems
Recognizing complex faces and gaits via novel probabilistic models
In the field of computer vision, developing automated systems to recognize people
under unconstrained scenarios is a partially solved problem. In unconstrained sce-
narios a number of common variations and complexities such as occlusion, illumi-
nation, cluttered background and so on impose vast uncertainty to the recognition
process. Among the various biometrics that have been emerging recently, this
dissertation focus on two of them namely face and gait recognition.
Firstly we address the problem of recognizing faces with major occlusions amidst
other variations such as pose, scale, expression and illumination using a novel
PRObabilistic Component based Interpretation Model (PROCIM) inspired by key
psychophysical principles that are closely related to reasoning under uncertainty.
The model basically employs Bayesian Networks to establish, learn, interpret and
exploit intrinsic similarity mappings from the face domain. Then, by incorporating
e cient inference strategies, robust decisions are made for successfully recognizing
faces under uncertainty. PROCIM reports improved recognition rates over recent
approaches.
Secondly we address the newly upcoming gait recognition problem and show that
PROCIM can be easily adapted to the gait domain as well. We scienti cally
de ne and formulate sub-gaits and propose a novel modular training scheme to
e ciently learn subtle sub-gait characteristics from the gait domain. Our results
show that the proposed model is robust to several uncertainties and yields sig-
ni cant recognition performance. Apart from PROCIM, nally we show how a
simple component based gait reasoning can be coherently modeled using the re-
cently prominent Markov Logic Networks (MLNs) by intuitively fusing imaging,
logic and graphs.
We have discovered that face and gait domains exhibit interesting similarity map-
pings between object entities and their components. We have proposed intuitive
probabilistic methods to model these mappings to perform recognition under vari-
ous uncertainty elements. Extensive experimental validations justi es the robust-
ness of the proposed methods over the state-of-the-art techniques.
- …