151 research outputs found
Predictive Querying for Autoregressive Neural Sequence Models
In reasoning about sequential events it is natural to pose probabilistic
queries such as "when will event A occur next" or "what is the probability of A
occurring before B", with applications in areas such as user modeling,
medicine, and finance. However, with machine learning shifting towards neural
autoregressive models such as RNNs and transformers, probabilistic querying has
been largely restricted to simple cases such as next-event prediction. This is
in part due to the fact that future querying involves marginalization over
large path spaces, which is not straightforward to do efficiently in such
models. In this paper we introduce a general typology for predictive queries in
neural autoregressive sequence models and show that such queries can be
systematically represented by sets of elementary building blocks. We leverage
this typology to develop new query estimation methods based on beam search,
importance sampling, and hybrids. Across four large-scale sequence datasets
from different application domains, as well as for the GPT-2 language model, we
demonstrate the ability to make query answering tractable for arbitrary queries
in exponentially-large predictive path-spaces, and find clear differences in
cost-accuracy tradeoffs between search and sampling methods.Comment: Oral Presentation at the Intl. Conference on Neural Information
Processing Systems (NeurIPS 2022
Probabilistic data types
Dissertação de mestrado integrado em Engenharia InformáticaConflict-Free Replicated Data Types (CRDTs) provide deterministic outcomes from concurrent
executions. The conflict resolution mechanism uses information on the ordering of the last
operations performed, which indicates if a given operation is known by a replica, typically
using some variant of version vectors. This thesis will explore the construction of CRDTs
that use a novel stochastic mechanism that can track with high accuracy knowledge of the
occurrence of recently performed operations and with less accuracy for older operations.
The aim is to obtain better scaling properties and avoid the use of metadata that is linear on
the number of replicas.Conflict-Free Replicated Data Types (CRDTs) oferecem resultados determinĂsticos de execuções
concorrentes. O mecanismo de resolução de conflitos usa informação sobre a ordenação das últimas operações realizadas, que indica se uma dada operação é conhecida por uma réplica, geralmente usando alguma variante de version vectors. Esta tese explorara a construção de CRDTs que utilizam um novo mecanismo estocástico que pode identificar com alta precisão
o conhecimento sobre a ocorrência de operações realizadas recentemente e com menor
precisão para operações mais antigas. O objetivo é a obtenção de melhores propriedades de escalabilidade e evitar o uso de metadados em quantidade linear em relação ao número de réplicas
Evaluating Methods for Privacy-Preserving Data Sharing in Genomics
The availability of genomic data is often essential to progress in biomedical re- search, personalized medicine, drug development, etc. However, its extreme sensitivity makes it problematic, if not outright impossible, to publish or share it. In this dissertation, we study and build systems that are geared towards privacy preserving genomic data sharing. We first look at the Matchmaker Exchange, a platform that connects multiple distributed databases through an API and allows researchers to query for genetic variants in other databases through the network. However, queries are broadcast to all researchers that made a similar query in any of the connected databases, which can lead to a reluctance to use the platform, due to loss of privacy or competitive advantage. In order to overcome this reluctance, we propose a framework to support anonymous querying on the platform. Since genomic data’s sensitivity does not degrade over time, we analyze the real-world guarantees provided by the only tool available for long term genomic data storage. We find that the system offers low security when the adversary has access to side information, and we support our claims by empirical evidence. We also study the viability of synthetic data for privacy preserving data sharing. Since for genomic data research, the utility of the data provided is of the utmost importance, we first perform a utility evaluation on generative models for different types of datasets (i.e., financial data, images, and locations). Then, we propose a privacy evaluation framework for synthetic data. We then perform a measurement study assessing state-of-the-art generative models specifically geared for human genomic data, looking at both utility and privacy perspectives. Overall, we find that there is no single approach for generating synthetic data that performs well across the board from both utility and privacy perspectives
TopX : efficient and versatile top-k query processing for text, structured, and semistructured data
TopX is a top-k retrieval engine for text and XML data. Unlike Boolean engines, it stops query processing as soon as it can safely determine the k top-ranked result objects according to a monotonous score aggregation function with respect to a multidimensional query. The main contributions of the thesis unfold into four main points, confirmed by previous publications at international conferences or workshops:
• Top-k query processing with probabilistic guarantees.
• Index-access optimized top-k query processing.
• Dynamic and self-tuning, incremental query expansion for top-k query
processing.
• Efficient support for ranked XML retrieval and full-text search.
Our experiments demonstrate the viability and improved efficiency of our approach compared to existing related work for a broad variety of retrieval scenarios.TopX ist eine Top-k Suchmaschine fĂĽr Text und XML Daten. Im Gegensatz
zu Boole\u27; schen Suchmaschinen terminiert TopX die Anfragebearbeitung,
sobald die k besten Ergebnisobjekte im Hinblick auf eine mehrdimensionale
Anfrage gefunden wurden. Die Hauptbeiträge dieser Arbeit teilen sich in
vier Schwerpunkte basierend auf vorherigen Veröffentlichungen bei internationalen
Konferenzen oder Workshops:
• Top-k Anfragebearbeitung mit probabilistischen Garantien.
• Zugriffsoptimierte Top-k Anfragebearbeitung.
• Dynamische und selbstoptimierende, inkrementelle Anfrageexpansion für Top-k Anfragebearbeitung.
• Effiziente Unterstützung für XML-Anfragen und Volltextsuche.
Unsere Experimente bestätigen die Vielseitigkeit und gesteigerte Effizienz unserer Verfahren gegenüber existierenden, führenden Ansätzen für eine weite
Bandbreite von Anwendungen in der Informationssuche
Analysis and improvement of security and privacy techniques for genomic information
The purpose of this thesis is to review the current literature of privacy preserving techniques for genomic information on the last years. Based on the analysis, we propose a long-term classification system for the reviewed techniques. We also develop a security improvement proposal for the Beacon system without hindering research utility
Temporal multimodal video and lifelog retrieval
The past decades have seen exponential growth of both consumption and production of data, with multimedia such as images and videos contributing significantly to said growth. The widespread proliferation of smartphones has provided everyday users with the ability to consume and produce such content easily. As the complexity and diversity of multimedia data has grown, so has the need for more complex retrieval models which address the information needs of users. Finding relevant multimedia content is central in many scenarios, from internet search engines and medical retrieval to querying one's personal multimedia archive, also called lifelog. Traditional retrieval models have often focused on queries targeting small units of retrieval, yet users usually remember temporal context and expect results to include this. However, there is little research into enabling these information needs in interactive multimedia retrieval.
In this thesis, we aim to close this research gap by making several contributions to multimedia retrieval with a focus on two scenarios, namely video and lifelog retrieval. We provide a retrieval model for complex information needs with temporal components, including a data model for multimedia retrieval, a query model for complex information needs, and a modular and adaptable query execution model which includes novel algorithms for result fusion. The concepts and models are implemented in vitrivr, an open-source multimodal multimedia retrieval system, which covers all aspects from extraction to query formulation and browsing. vitrivr has proven its usefulness in evaluation campaigns and is now used in two large-scale interdisciplinary research projects. We show the feasibility and effectiveness of our contributions in two ways: firstly, through results from user-centric evaluations which pit different user-system combinations against one another. Secondly, we perform a system-centric evaluation by creating a new dataset for temporal information needs in video and lifelog retrieval with which we quantitatively evaluate our models.
The results show significant benefits for systems that enable users to specify more complex information needs with temporal components. Participation in interactive retrieval evaluation campaigns over multiple years provides insight into possible future developments and challenges of such campaigns
Typilus: Neural Type Hints
Type inference over partial contexts in dynamically typed languages is
challenging. In this work, we present a graph neural network model that
predicts types by probabilistically reasoning over a program's structure,
names, and patterns. The network uses deep similarity learning to learn a
TypeSpace -- a continuous relaxation of the discrete space of types -- and how
to embed the type properties of a symbol (i.e. identifier) into it.
Importantly, our model can employ one-shot learning to predict an open
vocabulary of types, including rare and user-defined ones. We realise our
approach in Typilus for Python that combines the TypeSpace with an optional
type checker. We show that Typilus accurately predicts types. Typilus
confidently predicts types for 70% of all annotatable symbols; when it predicts
a type, that type optionally type checks 95% of the time. Typilus can also find
incorrect type annotations; two important and popular open source libraries,
fairseq and allennlp, accepted our pull requests that fixed the annotation
errors Typilus discovered.Comment: Accepted to PLDI 202
- …