547 research outputs found
Consistent Query Answering without Repairs in Tables with Nulls and Functional Dependencies
In this paper, we study consistent query answering in tables with nulls and
functional dependencies. Given such a table T, we consider the set Tuples of
all tuples that can be built up from constants appearing in T, and we use set
theoretic semantics for tuples and functional dependencies to characterize the
tuples of Tuples in two orthogonal ways: first as true or false tuples, and
then as consistent or inconsistent tuples. Queries are issued against T and
evaluated in Tuples. In this setting, we consider a query Q: select X from T
where Condition over T and define its consistent answer to be the set of tuples
x in Tuples such that: x is a true and consistent tuple with schema X and there
exists a true super-tuple t of x in Tuples satisfying the condition. We show
that, depending on the status that the super-tuple t has in Tuples, there are
different types of consistent answer to Q. The main contributions of the paper
are: (a) a novel approach to consistent query answering not using table
repairs; (b) polynomial algorithms for computing the sets of true-false tuples
and the sets of consistent-inconsistent tuples of Tuples; (c) polynomial
algorithms in the size of T for computing different types of consistent answer
for both conjunctive and disjunctive queries; and (d) a detailed discussion of
the differences between our approach and the approaches using table repairs.Comment: 42 page
Assessing compounding risks across multiple systems and sectors: a socio-environmental systems risk-triage approach
Physical and societal risks across the natural, managed, and built environments are becoming increasingly complex, multi-faceted, and compounding. Such risks stem from socio-economic and environmental stresses that co-evolve and force tipping points and instabilities. Robust decision-making necessitates extensive analyses and model assessments for insights toward solutions. However, these exercises are consumptive in terms of computational and investigative resources. In practical terms, such exercises cannot be performed extensivelyâbut selectively in terms of priority and scale. Therefore, an efficient analysis platform is needed through which the variety of multi-systems/sector observational and simulated data can be readily incorporated, combined, diagnosed, visualized, and in doing so, identifies âhotspotsâ of salient compounding threats. In view of this, we have constructed a âtriage-basedâ visualization and data-sharing platformâthe System for the Triage of Risks from Environmental and Socio-Economic Stressors (STRESS)âthat brings together data across socio-environmental systems, economics, demographics, health, biodiversity, and infrastructure. Through the STRESS website, users can display risk indices that result from weighted combinations of risk metrics they can select. Currently, these risk metrics include land-, water-, and energy systems, biodiversity, as well as demographics, environmental equity, and transportation networks. We highlight the utility of the STRESS platform through several demonstrative analyses over the United States from the national to county level. The STRESS is an open-science tool and available to the community-at-large. We will continue to develop it with an open, accessible, and interactive approach, including academics, researchers, industry, and the general public
Toward relevant answers to queries on incomplete databases
Incomplete and uncertain information is ubiquitous in database management applications. However, the techniques specifically developed to handle incomplete data are
not sufficient. Even the evaluation of SQL queries on databases containing NULL
values remains a challenge after 40 years. There is no consensus on what an answer
to a query on an incomplete database should be, and the existing notions often have
limited applicability.
One of the most prevalent techniques in the literature is based on finding answers
that are certainly true, independently of how missing values are interpreted. However,
this notion has yielded several conflicting formal definitions for certain answers. Based
on the fact that incomplete data can be enriched by some additional knowledge, we
designed a notion able to unify and explain the different definitions for certain answers.
Moreover, the knowledge-preserving certain answers notion is able to provide the first
well-founded definition of certain answers for the relational bag data model and value-inventing queries, addressing some key limitations of previous approaches. However,
it doesnât provide any guarantee about the relevancy of the answers it captures.
To understand what would be relevant answers to queries on incomplete databases,
we designed and conducted a survey on the everyday usage of NULL values among
database users. One of the findings from this socio-technical study is that even when
users agree on the possible interpretation of NULL values, they may not agree on
what a satisfactory query answer is. Therefore, to be relevant, query evaluation on
incomplete databases must account for usersâ tasks and preferences.
We model usersâ preferences and tasks with the notion of regret. The regret function
captures the task-dependent loss a user endures when he considers a database as
ground truth instead of another. Thanks to this notion, we designed the first framework
able to provide a score accounting for the risk associated with query answers. It allows
us to define the risk-minimizing answers to queries on incomplete databases. We
show that for some regret functions, regret-minimizing answers coincide with certain
answers. Moreover, as the notion is more agile, it can capture more nuanced answers
and more interpretations of incompleteness.
A different approach to improve the relevancy of an answer is to explain its provenance.
We propose to partition the incompleteness into sources and measure their respective contribution to the risk of answer. As a first milestone, we study several models
to predict the evolution of the risk when we clean a source of incompleteness. We
implemented the framework, and it exhibits promising results on relational databases
and queries with aggregate and grouping operations. Indeed, the model allows us
to infer the risk reduction obtained by cleaning an attribute. Finally, by considering a
game theoretical approach, the model can provide an explanation for answers based
on the contribution of each attributes to the risk
Modern data analytics in the cloud era
Cloud Computing ist die dominante Technologie des letzten Jahrzehnts. Die Benutzerfreundlichkeit der verwalteten Umgebung in Kombination mit einer nahezu unbegrenzten Menge an Ressourcen und einem nutzungsabhĂ€ngigen Preismodell ermöglicht eine schnelle und kosteneffiziente Projektrealisierung fĂŒr ein breites Nutzerspektrum. Cloud Computing verĂ€ndert auch die Art und Weise wie Software entwickelt, bereitgestellt und genutzt wird. Diese Arbeit konzentriert sich auf Datenbanksysteme, die in der Cloud-Umgebung eingesetzt werden. Wir identifizieren drei Hauptinteraktionspunkte der Datenbank-Engine mit der Umgebung, die verĂ€nderte Anforderungen im Vergleich zu traditionellen On-Premise-Data-Warehouse-Lösungen aufweisen. Der erste Interaktionspunkt ist die Interaktion mit elastischen Ressourcen. Systeme in der Cloud sollten ElastizitĂ€t unterstĂŒtzen, um den Lastanforderungen zu entsprechen und dabei kosteneffizient zu sein. Wir stellen einen elastischen Skalierungsmechanismus fĂŒr verteilte Datenbank-Engines vor, kombiniert mit einem Partitionsmanager, der einen Lastausgleich bietet und gleichzeitig die Neuzuweisung von Partitionen im Falle einer elastischen Skalierung minimiert. DarĂŒber hinaus fĂŒhren wir eine Strategie zum initialen BefĂŒllen von Puffern ein, die es ermöglicht, skalierte Ressourcen unmittelbar nach der Skalierung auszunutzen. Cloudbasierte Systeme sind von fast ĂŒberall aus zugĂ€nglich und verfĂŒgbar. Daten werden hĂ€ufig von zahlreichen Endpunkten aus eingespeist, was sich von ETL-Pipelines in einer herkömmlichen Data-Warehouse-Lösung unterscheidet. Viele Benutzer verzichten auf die Definition von strikten Schemaanforderungen, um TransaktionsabbrĂŒche aufgrund von Konflikten zu vermeiden oder um den Ladeprozess von Daten zu beschleunigen. Wir fĂŒhren das Konzept der PatchIndexe ein, die die Definition von unscharfen Constraints ermöglichen. PatchIndexe verwalten Ausnahmen zu diesen Constraints, machen sie fĂŒr die Optimierung und AusfĂŒhrung von Anfragen nutzbar und bieten effiziente UnterstĂŒtzung bei Datenaktualisierungen. Das Konzept kann auf beliebige Constraints angewendet werden und wir geben Beispiele fĂŒr unscharfe Eindeutigkeits- und Sortierconstraints. DarĂŒber hinaus zeigen wir, wie PatchIndexe genutzt werden können, um fortgeschrittene Constraints wie eine unscharfe Multi-Key-Partitionierung zu definieren, die eine robuste Anfrageperformance bei Workloads mit unterschiedlichen Partitionsanforderungen bietet. Der dritte Interaktionspunkt ist die Nutzerinteraktion. Datengetriebene Anwendungen haben sich in den letzten Jahren verĂ€ndert. Neben den traditionellen SQL-Anfragen fĂŒr Business Intelligence sind heute auch datenwissenschaftliche Anwendungen von groĂer Bedeutung. In diesen FĂ€llen fungiert das Datenbanksystem oft nur als Datenlieferant, wĂ€hrend der Rechenaufwand in dedizierten Data-Science- oder Machine-Learning-Umgebungen stattfindet. Wir verfolgen das Ziel, fortgeschrittene Analysen in Richtung der Datenbank-Engine zu verlagern und stellen das Grizzly-Framework als DataFrame-zu-SQL-Transpiler vor. Auf dieser Grundlage identifizieren wir benutzerdefinierte Funktionen (UDFs) und maschinelles Lernen (ML) als wichtige Aufgaben, die von einer tieferen Integration in die Datenbank-Engine profitieren wĂŒrden. Daher untersuchen und bewerten wir AnsĂ€tze fĂŒr die datenbankinterne AusfĂŒhrung von Python-UDFs und datenbankinterne ML-Inferenz.Cloud computing has been the groundbreaking technology of the last decade. The ease-of-use of the managed environment in combination with nearly infinite amount of resources and a pay-per-use price model enables fast and cost-efficient project realization for a broad range of users. Cloud computing also changes the way software is designed, deployed and used. This thesis focuses on database systems deployed in the cloud environment. We identify three major interaction points of the database engine with the environment that show changed requirements compared to traditional on-premise data warehouse solutions. First, software is deployed on elastic resources. Consequently, systems should support elasticity in order to match workload requirements and be cost-effective. We present an elastic scaling mechanism for distributed database engines, combined with a partition manager that provides load balancing while minimizing partition reassignments in the case of elastic scaling. Furthermore we introduce a buffer pre-heating strategy that allows to mitigate a cold start after scaling and leads to an immediate performance benefit using scaling. Second, cloud based systems are accessible and available from nearly everywhere. Consequently, data is frequently ingested from numerous endpoints, which differs from bulk loads or ETL pipelines in a traditional data warehouse solution. Many users do not define database constraints in order to avoid transaction aborts due to conflicts or to speed up data ingestion. To mitigate this issue we introduce the concept of PatchIndexes, which allow the definition of approximate constraints. PatchIndexes maintain exceptions to constraints, make them usable in query optimization and execution and offer efficient update support. The concept can be applied to arbitrary constraints and we provide examples of approximate uniqueness and approximate sorting constraints. Moreover, we show how PatchIndexes can be exploited to define advanced constraints like an approximate multi-key partitioning, which offers robust query performance over workloads with different partition key requirements. Third, data-centric workloads changed over the last decade. Besides traditional SQL workloads for business intelligence, data science workloads are of significant importance nowadays. For these cases the database system might only act as data delivery, while the computational effort takes place in data science or machine learning (ML) environments. As this workflow has several drawbacks, we follow the goal of pushing advanced analytics towards the database engine and introduce the Grizzly framework as a DataFrame-to-SQL transpiler. Based on this we identify user-defined functions (UDFs) and machine learning inference as important tasks that would benefit from a deeper engine integration and investigate approaches to push these operations towards the database engine
Tree-Based Approaches for Predicting Financial Performance
The lending industry commonly relied on assessing borrowersâ repayment performance to make lending decisions. This is to safeguard their assets and maintain their profitability. With the rise of Artificial Intelligence, lenders resorted to Machine Learning (ML) algorithms to solve this problem.
In this study, the novelty introduced is applying MLâs Tree-based methods to a large dataset and accurately predicting financial repayment performance without using any repayment history, which was utilized in all literature reviewed. Instead, the attributes used were demographics and psychographics of applicants, only. The studyâs proprietary US-based dataset comprises an anonymous population whose owner does not wish to be disclosed and it contains the information of about half a million beneficiaries with a very balanced bimodal binary target distribution.
An Area Under the Curve of Receiver Characteristic Operator (ROC-AUC) of 85% was achieved with a binary classification target using CatBoost API. The study also experimented with a given tri-class target. Furthermore, this research used ML to gain insight into which attributes contribute the most to the repayment prediction. The study also tested whether similar results can be achieved with fewer attributes for the sake of the practicality of application by the data owner. The best model was applied to one of the biggest publicly available financial datasets for verification. The original research of said dataset had an accuracy score of 82%, this study achieved 79% using 5-fold Cross-Validation (CV). This result was achieved with Tree-Based models with a complexity of O(log n) compared to O(2n) in the original research, which is a significant efficiency enhancement
Incident validation and remote monitoring tools for a bioleaching plant
Aquesta tesi desenvolupa el procĂ©s de creaciĂł d'una aplicaciĂł web dissenyada per facilitar el seguiment remot i el control d'una planta quĂmica de biolixiviats. En aquest procĂ©s quĂmic, el coure es recupera dels dispositius elĂšctrics mitjançant lâexposiciĂł dâaquests a bacteris oxidants. L'aplicaciĂł a set dissenyada per satisfer les necessitats especĂfiques dels tĂšcnics i gestors de plantes, que volien accĂ©s remot a les dades dels sensors, control dels actuadors i rebut dâalertes per a certes anomalies del procĂ©s.
Aquest sistema consisteix en una aplicaciĂł web a la qual es pot accedir des de qualsevol navegador web, garantint l'accessibilitat i la flexibilitat a travĂ©s de mĂșltiples plataformes. Una interfĂcie de programaciĂł d'aplicacions (API) alimentada per Spring Boot (Java) serveix com el nucli del sistema, proporcionant una visiĂł de lâestat de la planta des de mĂșltiples interfĂcies d'usuari. La comunicaciĂł es mantĂ© entre el servidor, lâ API i el controlador de la planta quĂmica. LâaplicaciĂł permet activar bombes i vĂ lvules de solenoides, rebre i transmetre dades al sensor. A mĂ©s, una interfĂcie d'usuari intuĂŻtiva Ă©s proporcionada per React (JavaScript), millorant l'accessibilitat del sistema.
Tanmateix, sâha implementat un sistema de notificaciĂł per correu electrĂČnic per complir els requisits d'alerta, permetent als usuaris habilitar notificacions segons les seves preferĂšncies.
Es tracta d'un pas significatiu cap a la millora de l'eficiĂšncia i fiabilitat de les operacions de la planta quĂmica. A mĂ©s d'abordar les necessitats immediates del personal de les plantes, aquest treball estableix les bases per a futurs avanços en la vigilĂ ncia i control remots.An overview of the development process of a web application designed to facilitate remote monitoring and control of a bioleaching chemical plant is presented in this master's thesis. Copper is recovered from electrical devices by exposing them to oxidizing bacteria in this plant. The application was designed to meet the specific needs of plant technicians and managers, who wanted remote access to sensor data, control of actuators, and alerts for any process anomalies.
This system consists of a web application that can be accessed from any web browser, ensuring accessibility and flexibility across multiple platforms. An application programming interface (API) powered by Spring Boot (Java) serves as the core of the system, providing access to this valuable information from multiple user interfaces and for future applications. Communication is maintained between the server and the chemical plant controller, which orchestrates chemical processes, activates pumps and solenoid valves, transmits sensor data, and accepts remote commands. Furthermore, an intuitive user interface is provided by a React (JavaScript) front-end that enhances the system's accessibility.
An email notification system was implemented to meet the alerting requirements, allowing users to enable notifications according to their preferences. A system notification function was also added.
It is a significant step toward improving the efficiency and reliability of bioleaching chemical plant operations. In addition to addressing the immediate needs of plant personnel, this work lays the groundwork for future advancements in remote monitoring and control
Querying Incomplete Data : Complexity and Tractability via Datalog and First-Order Rewritings
To answer database queries over incomplete data the gold standard is finding
certain answers: those that are true regardless of how incomplete data is
interpreted. Such answers can be found efficiently for conjunctive queries and
their unions, even in the presence of constraints. With negation added, the
problem becomes intractable however. We concentrate on the complexity of
certain answers under constraints, and on effficiently answering queries
outside the usual classes of (unions) of conjunctive queries by means of
rewriting as Datalog and first-order queries. We first notice that there are
three different ways in which query answering can be cast as a decision
problem. We complete the existing picture and provide precise complexity bounds
on all versions of the decision problem, for certain and best answers. We then
study a well-behaved class of queries that extends unions of conjunctive
queries with a mild form of negation. We show that for them, certain answers
can be expressed in Datalog with negation, even in the presence of functional
dependencies, thus making them tractable in data complexity. We show that in
general Datalog cannot be replaced by first-order logic, but without
constraints such a rewriting can be done in first-order. The paper is under
consideration in Theory and Practice of Logic Programming (TPLP).Comment: Under consideration in Theory and Practice of Logic Programming
(TPLP
- âŠ