64 research outputs found
Models and algorithms for promoting diverse and fair query results
Ensuring fairness and diversity in search results are two key concerns in compelling search and recommendation applications. This work explicitly studies these two aspects given multiple users\u27 preferences as inputs, in an effort to create a single ranking or top-k result set that satisfies different fairness and diversity criteria. From group fairness standpoint, it adapts demographic parity like group fairness criteria and proposes new models that are suitable for ranking or producing top-k set of results. This dissertation also studies equitable exposure of individual search results in long tail data, a concept related to individual fairness. First, the dissertation focuses on aggregating ranks while achieving proportionate fairness (ensures proportionate representation of every group) for multiple protected groups. Then, the dissertation explores how to minimally modify original users\u27 preferences under plurality voting, aiming to produce top-k result set that satisfies complex fairness constraints. A concept referred to as manipulation by modifications is introduced, which involves making minimal changes to the original user preferences to ensure query satisfaction. This problem is formalized as the margin finding problem. A follow up work studies this problem considering a popular ranked choice voting mechanism, namely, the Instant Run-off Voting or IRV, as the preference aggregation method. From the standpoint of individual fairness, this dissertation studies an exposure concern that top-k set based algorithms exhibit when the underlying data has long tail properties, and designs techniques to make those results equitable. For result diversification, the work studies efficiency opportunities in existing diversification algorithms, and designs a generic access primitive called DivGetBatch() to enable that. The contributions of this dissertation lie in (a) formalizing principal problems and studying them analytically. (b) designing scalable algorithms with theoretical guarantees, and (c) extensive experimental study to evaluate the efficacy and scalability of the designed solutions by comparing them with the state-of-the-art solutions using large-scale datasets
AIUCD 2022 - Proceedings
L’undicesima edizione del Convegno Nazionale dell’AIUCD-Associazione di Informatica Umanistica ha per titolo Culture digitali. Intersezioni: filosofia, arti, media. Nel titolo è presente, in maniera esplicita, la richiesta di una riflessione, metodologica e teorica, sull’interrelazione tra tecnologie digitali, scienze dell’informazione, discipline filosofiche, mondo delle arti e cultural studies
Posuzování rizikového profilu retailového investora se zaměřením na doplnění investic na penzi – přístup s využitím fuzzy logiky
In order to provide the best services to their customers for undertaking investments for retirement as part of the third pillar of the retirement reform from the World Bank, financial institutions assess private investors' risk profiles. There are many approaches to assessing a risk profile. Since risk profile definitions are vague and assessment deals with many imprecisions, uncertainty, and missing information, we proposed an approach based on fuzzy logic, which proved capable of dealing with those phenomena. We described a framework where we can translate risk profile components into fuzzy sets and assess risk profiles using fuzzy inference. We argued for flexibility, understandability (to a non-technical audience), and transparency. We proposed that future work focuses on parts of our proposed framework, as we have given those very broadly to provide proof of conceptAby mohly finanční instituce poskytovat svým zákazníkům co nejlepší služby při investování na důchod v rámci třetího pilíře důchodové reformy Světové banky, posuzují rizikové profily soukromých (retailových) investorů. Existuje mnoho přístupů k posuzování rizikového profilu. Protože definice rizikového profilu jsou vágní a posuzování se potýká s mnoha nepřesnostmi, nejistotou a chybějícími informacemi, navrhli jsme přístup založený na fuzzy logice, který se ukázal jako schopný se s těmito jevy vypořádat. Popsali jsme rámec, v němž můžeme převést složky rizikového profilu na fuzzy množiny a posuzovat rizikové profily pomocí fuzzy inference. Argumentovali jsme flexibilitou, srozumitelností (pro netechnické publikum) a transparentností. Navrhli jsme, aby se budoucí práce zaměřila na části námi navrženého rámce, protože ty jsme uvedli velmi široce, abychom poskytli důkaz konceptu.157 - Katedra systémového inženýrstvívelmi dobř
La arquitectura RDBMS-only : una arquitectura database-centric para aplicaciones Web
Las arquitecturas de múltiples niveles han sido el estándar de facto para aplicaciones web, dejando poco lugar para arquitecturas alternativas. En la industria existe un producto para desarrollar y ejecutar aplicaciones web que sigue una arquitectura diferente, centrada en el RDBMS al extremo de no necesitar ningún otro componente para funcionar. En la academia no hay muchos trabajos que aborden las arquitecturas centradas en un RDBMS en general, y esta arquitectura extrema en particular no ha sido considerada. En este trabajo se analiza el estado del arte de las arquitecturas centradas en un RDBMS, y se analiza un ejemplo de arquitectura extrema centrada en el RDBMS. Se describe el caso general de la arquitectura que he llamado RDBMS-only, y se siguen los lineamientos de esta arquitectura en el desarrollo de un prototipo funcional. En base a la implementación de este prototipo se muestra la factibilidad de la arquitectura para una clase de aplicaciones y se realiza un análisis crítico de la arquitectura.Multi-tier architectures have been the de facto standard for web applications, leaving little room for alternative architectures. In the industry there is a product to develop and run web applications that follows a different architecture, centered on the RDBMS to the extreme of not needing any other component to function. There are not many papers in academia that addresses RDBMS-centric architectures in general, and this extreme
architecture in particular has not been considered. In this work, the state of the art of database-centric architectures is analyzed, and an example of extreme database-centric architecture is analyzed. The general case of the architecture that I have called RDBMS-only is described, and the guidelines of this architecture are followed in the development of a functional prototype. Based on the implementation of that prototype, the feasibility of the architecture for a class of applications is shown, and a critical analysis of the architecture is carried out
Accelerating Event Stream Processing in On- and Offline Systems
Due to a growing number of data producers and their ever-increasing data volume, the ability to ingest, analyze, and store potentially never-ending streams of data is a mission-critical task in today's data processing landscape.
A widespread form of data streams are event streams, which consist of continuously arriving notifications about some real-world phenomena. For example, a temperature sensor naturally generates an event stream by periodically measuring the temperature and reporting it with measurement time in case of a substantial change to the previous measurement.
In this thesis, we consider two kinds of event stream processing: online and offline. Online refers to processing events solely in main memory as soon as they arrive, while offline means processing event data previously persisted to non-volatile storage. Both modes are supported by widely used scale-out general-purpose stream processing engines (SPEs) like Apache Flink or Spark Streaming. However, such engines suffer from two significant deficiencies that severely limit their processing performance. First, for offline processing, they load the entire stream from non-volatile secondary storage and replay all data items into the associated online engine in order of their original arrival. While this naturally ensures unified query semantics for on- and offline processing, the costs for reading the entire stream from non-volatile storage quickly dominate the overall processing costs.
Second, modern SPEs focus on scaling out computations across the nodes of a cluster, but use only a fraction of the available resources of individual nodes. This thesis tackles those problems with three different approaches.
First, we present novel techniques for the offline processing of two important query types (windowed aggregation and sequential pattern matching). Our methods utilize well-understood indexing techniques to reduce the total amount of data to read from non-volatile storage. We show that this improves the overall query runtime significantly. In particular, this thesis develops the first index-based algorithms for pattern queries expressed with the Match_Recognize clause, a new and powerful language feature of SQL that has received little attention so far.
Second, we show how to maximize resource utilization of single nodes by exploiting the capabilities of modern hardware. Therefore, we develop a prototypical shared-memory CPU-GPU-enabled event processing system. The system provides implementations of all major event processing operators (filtering, windowed aggregation, windowed join, and sequential pattern matching). Our experiments reveal that regarding resource utilization and processing throughput, such a hardware-enabled system is superior to hardware-agnostic general-purpose engines.
Finally, we present TPStream, a new operator for pattern matching over temporal intervals. TPStream achieves low processing latency and, in contrast to sequential pattern matching, is easily parallelizable even for unpartitioned input streams. This results in maximized resource utilization, especially for modern CPUs with multiple cores
A Domain Specific Language for Digital Forensics and Incident Response Analysis
One of the longstanding conceptual problems in digital forensics is the dichotomy between the need for verifiable and reproducible forensic investigations, and the lack of practical mechanisms to accomplish them. With nearly four decades of professional digital forensic practice, investigator notes are still the primary source of reproducibility information, and much of it is tied to the functions of specific, often proprietary, tools.
The lack of a formal means of specification for digital forensic operations results in three major problems. Specifically, there is a critical lack of:
a) standardized and automated means to scientifically verify accuracy of digital forensic tools;
b) methods to reliably reproduce forensic computations (their results); and
c) framework for inter-operability among forensic tools.
Additionally, there is no standardized means for communicating software requirements between users, researchers and developers, resulting in a mismatch in expectations. Combined with the exponential growth in data volume and complexity of applications and systems to be investigated, all of these concerns result in major case backlogs and inherently reduce the reliability of the digital forensic analyses.
This work proposes a new approach to the specification of forensic computations, such that the above concerns can be addressed on a scientific basis with a new domain specific language (DSL) called nugget. DSLs are specialized languages that aim to address the concerns of particular domains by providing practical abstractions. Successful DSLs, such as SQL, can transform an application domain by providing a standardized way for users to communicate what they need without specifying how the computation should be performed.
This is the first effort to build a DSL for (digital) forensic computations with the following research goals:
1) provide an intuitive formal specification language that covers core types of forensic computations and common data types;
2) provide a mechanism to extend the language that can incorporate arbitrary computations;
3) provide a prototype execution environment that allows the fully automatic execution of the computation;
4) provide a complete, formal, and auditable log of computations that can be used to reproduce an investigation;
5) demonstrate cloud-ready processing that can match the growth in data volumes and complexity
Functional inferences over heterogeneous data
Inference enables an agent to create new knowledge from old or discover implicit
relationships between concepts in a knowledge base (KB), provided that appropriate
techniques are employed to deal with ambiguous, incomplete and sometimes erroneous
data.
The ever-increasing volumes of KBs on the web, available for use by automated
systems, present an opportunity to leverage the available knowledge in order to improve
the inference process in automated query answering systems. This thesis focuses
on the FRANK (Functional Reasoning for Acquiring Novel Knowledge) framework
that responds to queries where no suitable answer is readily contained in any available
data source, using a variety of inference operations.
Most question answering and information retrieval systems assume that answers
to queries are stored in some form in the KB, thereby limiting the range of answers
they can find. We take an approach motivated by rich forms of inference using techniques,
such as regression, for prediction. For instance, FRANK can answer “what
country in Europe will have the largest population in 2021?" by decomposing Europe
geo-spatially, using regression on country population for past years and selecting the
country with the largest predicted value. Our technique, which we refer to as Rich
Inference, combines heuristics, logic and statistical methods to infer novel answers
to queries. It also determines what facts are needed for inference, searches for them,
and then integrates the diverse facts and their formalisms into a local query-specific
inference tree.
Our primary contribution in this thesis is the inference algorithm on which FRANK
works. This includes (1) the process of recursively decomposing queries in way that
allows variables in the query to be instantiated by facts in KBs; (2) the use of aggregate
functions to perform arithmetic and statistical operations (e.g. prediction) to infer new
values from child nodes; and (3) the estimation and propagation of uncertainty values
into the returned answer based on errors introduced by noise in the KBs or errors
introduced by aggregate functions.
We also discuss many of the core concepts and modules that constitute FRANK.
We explain the internal “alist” representation of FRANK that gives it the required
flexibility to tackle different kinds of problems with minimal changes to its internal
representation. We discuss the grammar for a simple query language that allows users
to express queries in a formal way, such that we avoid the complexities of natural
language queries, a problem that falls outside the scope of this thesis. We evaluate the
framework with datasets from open sources
Software for the collaborative editing of the Greek new testament
This project was responsible for developing the Virtual Manuscript Room Collaborative Research Environment (VMR CRE), which offers a facility for the critical editing workflow from raw data collection, through processing, to publication, within an open and online collaborative framework for the Institut für Neutestamentliche Textforschung (INTF) and their global partners while editing the Editio Critica Maior (ECM)-- the paramount critical edition of the Greek New Testament which analyses over 5600 Greek witnesses and includes a comprehensive apparatus of chosen manuscripts, weighted by quotations and early translations. Additionally, this project produced the first digital edition of the ECM. This case study, transitioning the workflow at the INTF to an online collaborative research environment, seeks to convey successful methods and lessons learned through describing a professional software engineer’s foray into the world of academic digital humanities. It compares development roles and practices in the software industry with the academic environment and offers insights to how this software engineer found a software team therein, suggests how a fledgling online community can successfully achieve critical mass, provides an outsider’s perspective on what a digital critical scholarly edition might be, and hopes to offer useful software, datasets, and a thriving online community for manuscript researchers
Land information systems : an overview and outline of software requirements
This thesis looks at some aspects of land information systems. The introduction gives the rationale for this study, and the second chapter outlines the development of land information systems with particular reference to the cadastre. In the third chapter the software requirements for the development of land information systems are considered. Programming language and databases are discussed. The fouth chapter deals with the organisation and hardware needed for a land information system. Finally, in the fifth chapter some of the algorithms used in land information systems are presented. Four appendices cover the programmes which were developed in the course of this study, the software specification for an operational system, an example of LIS-related data in a large organisation, and the syntax of Modula-2, the programming language used for the examples
- …