151 research outputs found

    Predictive Querying for Autoregressive Neural Sequence Models

    Full text link
    In reasoning about sequential events it is natural to pose probabilistic queries such as "when will event A occur next" or "what is the probability of A occurring before B", with applications in areas such as user modeling, medicine, and finance. However, with machine learning shifting towards neural autoregressive models such as RNNs and transformers, probabilistic querying has been largely restricted to simple cases such as next-event prediction. This is in part due to the fact that future querying involves marginalization over large path spaces, which is not straightforward to do efficiently in such models. In this paper we introduce a general typology for predictive queries in neural autoregressive sequence models and show that such queries can be systematically represented by sets of elementary building blocks. We leverage this typology to develop new query estimation methods based on beam search, importance sampling, and hybrids. Across four large-scale sequence datasets from different application domains, as well as for the GPT-2 language model, we demonstrate the ability to make query answering tractable for arbitrary queries in exponentially-large predictive path-spaces, and find clear differences in cost-accuracy tradeoffs between search and sampling methods.Comment: Oral Presentation at the Intl. Conference on Neural Information Processing Systems (NeurIPS 2022

    Probabilistic data types

    Get PDF
    Dissertação de mestrado integrado em Engenharia InformáticaConflict-Free Replicated Data Types (CRDTs) provide deterministic outcomes from concurrent executions. The conflict resolution mechanism uses information on the ordering of the last operations performed, which indicates if a given operation is known by a replica, typically using some variant of version vectors. This thesis will explore the construction of CRDTs that use a novel stochastic mechanism that can track with high accuracy knowledge of the occurrence of recently performed operations and with less accuracy for older operations. The aim is to obtain better scaling properties and avoid the use of metadata that is linear on the number of replicas.Conflict-Free Replicated Data Types (CRDTs) oferecem resultados determinísticos de execuções concorrentes. O mecanismo de resolução de conflitos usa informação sobre a ordenação das últimas operações realizadas, que indica se uma dada operação é conhecida por uma réplica, geralmente usando alguma variante de version vectors. Esta tese explorara a construção de CRDTs que utilizam um novo mecanismo estocástico que pode identificar com alta precisão o conhecimento sobre a ocorrência de operações realizadas recentemente e com menor precisão para operações mais antigas. O objetivo é a obtenção de melhores propriedades de escalabilidade e evitar o uso de metadados em quantidade linear em relação ao número de réplicas

    Evaluating Methods for Privacy-Preserving Data Sharing in Genomics

    Get PDF
    The availability of genomic data is often essential to progress in biomedical re- search, personalized medicine, drug development, etc. However, its extreme sensitivity makes it problematic, if not outright impossible, to publish or share it. In this dissertation, we study and build systems that are geared towards privacy preserving genomic data sharing. We first look at the Matchmaker Exchange, a platform that connects multiple distributed databases through an API and allows researchers to query for genetic variants in other databases through the network. However, queries are broadcast to all researchers that made a similar query in any of the connected databases, which can lead to a reluctance to use the platform, due to loss of privacy or competitive advantage. In order to overcome this reluctance, we propose a framework to support anonymous querying on the platform. Since genomic data’s sensitivity does not degrade over time, we analyze the real-world guarantees provided by the only tool available for long term genomic data storage. We find that the system offers low security when the adversary has access to side information, and we support our claims by empirical evidence. We also study the viability of synthetic data for privacy preserving data sharing. Since for genomic data research, the utility of the data provided is of the utmost importance, we first perform a utility evaluation on generative models for different types of datasets (i.e., financial data, images, and locations). Then, we propose a privacy evaluation framework for synthetic data. We then perform a measurement study assessing state-of-the-art generative models specifically geared for human genomic data, looking at both utility and privacy perspectives. Overall, we find that there is no single approach for generating synthetic data that performs well across the board from both utility and privacy perspectives

    Spatial Keyword Querying: Ranking Evaluation and Efficient Query Processing

    Get PDF

    TopX : efficient and versatile top-k query processing for text, structured, and semistructured data

    Get PDF
    TopX is a top-k retrieval engine for text and XML data. Unlike Boolean engines, it stops query processing as soon as it can safely determine the k top-ranked result objects according to a monotonous score aggregation function with respect to a multidimensional query. The main contributions of the thesis unfold into four main points, confirmed by previous publications at international conferences or workshops: • Top-k query processing with probabilistic guarantees. • Index-access optimized top-k query processing. • Dynamic and self-tuning, incremental query expansion for top-k query processing. • Efficient support for ranked XML retrieval and full-text search. Our experiments demonstrate the viability and improved efficiency of our approach compared to existing related work for a broad variety of retrieval scenarios.TopX ist eine Top-k Suchmaschine für Text und XML Daten. Im Gegensatz zu Boole\u27; schen Suchmaschinen terminiert TopX die Anfragebearbeitung, sobald die k besten Ergebnisobjekte im Hinblick auf eine mehrdimensionale Anfrage gefunden wurden. Die Hauptbeiträge dieser Arbeit teilen sich in vier Schwerpunkte basierend auf vorherigen Veröffentlichungen bei internationalen Konferenzen oder Workshops: • Top-k Anfragebearbeitung mit probabilistischen Garantien. • Zugriffsoptimierte Top-k Anfragebearbeitung. • Dynamische und selbstoptimierende, inkrementelle Anfrageexpansion für Top-k Anfragebearbeitung. • Effiziente Unterstützung für XML-Anfragen und Volltextsuche. Unsere Experimente bestätigen die Vielseitigkeit und gesteigerte Effizienz unserer Verfahren gegenüber existierenden, führenden Ansätzen für eine weite Bandbreite von Anwendungen in der Informationssuche

    Analysis and improvement of security and privacy techniques for genomic information

    Get PDF
    The purpose of this thesis is to review the current literature of privacy preserving techniques for genomic information on the last years. Based on the analysis, we propose a long-term classification system for the reviewed techniques. We also develop a security improvement proposal for the Beacon system without hindering research utility

    Temporal multimodal video and lifelog retrieval

    Get PDF
    The past decades have seen exponential growth of both consumption and production of data, with multimedia such as images and videos contributing significantly to said growth. The widespread proliferation of smartphones has provided everyday users with the ability to consume and produce such content easily. As the complexity and diversity of multimedia data has grown, so has the need for more complex retrieval models which address the information needs of users. Finding relevant multimedia content is central in many scenarios, from internet search engines and medical retrieval to querying one's personal multimedia archive, also called lifelog. Traditional retrieval models have often focused on queries targeting small units of retrieval, yet users usually remember temporal context and expect results to include this. However, there is little research into enabling these information needs in interactive multimedia retrieval. In this thesis, we aim to close this research gap by making several contributions to multimedia retrieval with a focus on two scenarios, namely video and lifelog retrieval. We provide a retrieval model for complex information needs with temporal components, including a data model for multimedia retrieval, a query model for complex information needs, and a modular and adaptable query execution model which includes novel algorithms for result fusion. The concepts and models are implemented in vitrivr, an open-source multimodal multimedia retrieval system, which covers all aspects from extraction to query formulation and browsing. vitrivr has proven its usefulness in evaluation campaigns and is now used in two large-scale interdisciplinary research projects. We show the feasibility and effectiveness of our contributions in two ways: firstly, through results from user-centric evaluations which pit different user-system combinations against one another. Secondly, we perform a system-centric evaluation by creating a new dataset for temporal information needs in video and lifelog retrieval with which we quantitatively evaluate our models. The results show significant benefits for systems that enable users to specify more complex information needs with temporal components. Participation in interactive retrieval evaluation campaigns over multiple years provides insight into possible future developments and challenges of such campaigns

    Typilus: Neural Type Hints

    Get PDF
    Type inference over partial contexts in dynamically typed languages is challenging. In this work, we present a graph neural network model that predicts types by probabilistically reasoning over a program's structure, names, and patterns. The network uses deep similarity learning to learn a TypeSpace -- a continuous relaxation of the discrete space of types -- and how to embed the type properties of a symbol (i.e. identifier) into it. Importantly, our model can employ one-shot learning to predict an open vocabulary of types, including rare and user-defined ones. We realise our approach in Typilus for Python that combines the TypeSpace with an optional type checker. We show that Typilus accurately predicts types. Typilus confidently predicts types for 70% of all annotatable symbols; when it predicts a type, that type optionally type checks 95% of the time. Typilus can also find incorrect type annotations; two important and popular open source libraries, fairseq and allennlp, accepted our pull requests that fixed the annotation errors Typilus discovered.Comment: Accepted to PLDI 202
    • …
    corecore