589 research outputs found

    Navigating Diverse Datasets in the Face of Uncertainty

    Get PDF
    When exploring big volumes of data, one of the challenging aspects is their diversity of origin. Multiple files that have not yet been ingested into a database system may contain information of interest to a researcher, who must curate, understand and sieve their content before being able to extract knowledge. Performance is one of the greatest difficulties in exploring these datasets. On the one hand, examining non-indexed, unprocessed files can be inefficient. On the other hand, any processing before its understanding introduces latency and potentially un- necessary work if the chosen schema matches poorly the data. We have surveyed the state-of-the-art and, fortunately, there exist multiple proposal of solutions to handle data in-situ performantly. Another major difficulty is matching files from multiple origins since their schema and layout may not be compatible or properly documented. Most surveyed solutions overlook this problem, especially for numeric, uncertain data, as is typical in fields like astronomy. The main objective of our research is to assist data scientists during the exploration of unprocessed, numerical, raw data distributed across multiple files based solely on its intrinsic distribution. In this thesis, we first introduce the concept of Equally-Distributed Dependencies, which provides the foundations to match this kind of dataset. We propose PresQ, a novel algorithm that finds quasi-cliques on hypergraphs based on their expected statistical properties. The probabilistic approach of PresQ can be successfully exploited to mine EDD between diverse datasets when the underlying populations can be assumed to be the same. Finally, we propose a two-sample statistical test based on Self-Organizing Maps (SOM). This method can outperform, in terms of power, other classifier-based two- sample tests, being in some cases comparable to kernel-based methods, with the advantage of being interpretable. Both PresQ and the SOM-based statistical test can provide insights that drive serendipitous discoveries

    Navigating diverse datasets in the face of uncertainty

    Get PDF
    When exploring big volumes of data, one of the challenging aspects is their diversity of origin. Multiple files that have not yet been ingested into a database system may contain information of interest to a researcher, who must curate, understand and sieve their content before being able to extract knowledge. Performance is one of the greatest difficulties in exploring these datasets. On the one hand, examining non-indexed, unprocessed files can be inefficient. On the other hand, any processing before its understanding introduces latency and potentially unnecessary work if the chosen schema matches poorly the data. We have surveyed the state-of-the-art and, fortunately, there exist multiple proposal of solutions to handle data in-situ performantly. Another major difficulty is matching files from multiple origins since their schema and layout may not be compatible or properly documented. Most surveyed solutions overlook this problem, especially for numeric, uncertain data, as is typical in fields like astronomy. The main objective of our research is to assist data scientists during the exploration of unprocessed, numerical, raw data distributed across multiple files based solely on its intrinsic distribution. In this thesis, we first introduce the concept of Equally-Distributed Dependencies, which provides the foundations to match this kind of dataset. We propose PresQ, a novel algorithm that finds quasi-cliques on hypergraphs based on their expected statistical properties. The probabilistic approach of PresQ can be successfully exploited to mine EDD between diverse datasets when the underlying populations can be assumed to be the same. Finally, we propose a two-sample statistical test based on Self-Organizing Maps (SOM). This method can outperform, in terms of power, other classifier-based twosample tests, being in some cases comparable to kernel-based methods, with the advantage of being interpretable. Both PresQ and the SOM-based statistical test can provide insights that drive serendipitous discoveries.Uno de los mayores problemas del big data es el origen diverso de los datos. Un investigador puede estar interesado en agregar datos provenientes de múltiples ficheros que aún no han sido pre-procesados e insertados en un sistema de bases de datos, debiendo depurar y filtrar el contenido antes de poder extraer conocimiento. La exploración directa de estos ficheros presentará serios problemas de rendimiento: examinar archivos sin ningún tipo de preparación ni indexación puede ser ineficiente tanto en términos de lectura de datos como de tiempo de ejecución. Por otro lado, ingerirlos en un sistema de base de datos antes de entenderlos introduce latencia y trabajo potencialmente redundante si el esquema elegido no se ajusta a las consultas que se ejecutarán. Afortunadamente, nuestra revisión del estado del arte demuestra que existen múltiples soluciones posibles para explorar datos in-situ de manera efectiva. Otra gran dificultad es la gestión de archivos de diversas procedencias, ya que su esquema y disposición pueden no ser compatibles, o no estar correctamente documentados. La mayoría de las soluciones encontradas pasan por alto esta problemática, especialmente en lo referente a datos numéricos e inciertos, como, por ejemplo, aquellos relacionados con atributos físicos generados en campos como la astronomía. Nuestro objetivo principal es ayudar a los investigadores a explorar este tipo de datos sin procesamiento previo, almacenados en múltiples archivos, y empleando únicamente su distribución intrínseca. En esta tesis primero introducimos el concepto de Equally-Distributed Dependencies (EDD) (Dependencias de Igualdad de Distribución), estableciendo las bases necesarias para ser capaz de emparejar conjuntos de datos con esquemas diferentes, pero con atributos en común. Luego, presentamos PresQ, un nuevo algoritmo probabilístico de búsqueda de quasi-cliques en hiper-grafos. El enfoque estadístico de PresQ permite proyectar el problema de búsqueda de EDD en el de búsqueda de quasi-cliques. Por último, proponemos una prueba estadística basada en Self-Organizing Maps (SOM) (Mapa autoorganizado). Este método puede superar, en términos de poder estadístico, otras técnicas basadas en clasificadores, siendo en algunos casos comparable a métodos basados en kernels, con la ventaja adicional de ser interpretable. Tanto PresQ como la prueba estadística basada en SOM pueden impulsar descubrimientos serendípicos.211 página

    BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees

    Full text link
    The rising volume of datasets has made training machine learning (ML) models a major computational cost in the enterprise. Given the iterative nature of model and parameter tuning, many analysts use a small sample of their entire data during their initial stage of analysis to make quick decisions (e.g., what features or hyperparameters to use) and use the entire dataset only in later stages (i.e., when they have converged to a specific model). This sampling, however, is performed in an ad-hoc fashion. Most practitioners cannot precisely capture the effect of sampling on the quality of their model, and eventually on their decision-making process during the tuning phase. Moreover, without systematic support for sampling operators, many optimizations and reuse opportunities are lost. In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML training. BlinkML allows users to make error-computation tradeoffs: instead of training a model on their full data (i.e., full model), BlinkML can quickly train an approximate model with quality guarantees using a sample. The quality guarantees ensure that, with high probability, the approximate model makes the same predictions as the full model. BlinkML currently supports any ML model that relies on maximum likelihood estimation (MLE), which includes Generalized Linear Models (e.g., linear regression, logistic regression, max entropy classifier, Poisson regression) as well as PPCA (Probabilistic Principal Component Analysis). Our experiments show that BlinkML can speed up the training of large-scale ML tasks by 6.26x-629x while guaranteeing the same predictions, with 95% probability, as the full model.Comment: 22 pages, SIGMOD 201

    Similarity processing in multi-observation data

    Get PDF
    Many real-world application domains such as sensor-monitoring systems for environmental research or medical diagnostic systems are dealing with data that is represented by multiple observations. In contrast to single-observation data, where each object is assigned to exactly one occurrence, multi-observation data is based on several occurrences that are subject to two key properties: temporal variability and uncertainty. When defining similarity between data objects, these properties play a significant role. In general, methods designed for single-observation data hardly apply for multi-observation data, as they are either not supported by the data models or do not provide sufficiently efficient or effective solutions. Prominent directions incorporating the key properties are the fields of time series, where data is created by temporally successive observations, and uncertain data, where observations are mutually exclusive. This thesis provides research contributions for similarity processing - similarity search and data mining - on time series and uncertain data. The first part of this thesis focuses on similarity processing in time series databases. A variety of similarity measures have recently been proposed that support similarity processing w.r.t. various aspects. In particular, this part deals with time series that consist of periodic occurrences of patterns. Examining an application scenario from the medical domain, a solution for activity recognition is presented. Finally, the extraction of feature vectors allows the application of spatial index structures, which support the acceleration of search and mining tasks resulting in a significant efficiency gain. As feature vectors are potentially of high dimensionality, this part introduces indexing approaches for the high-dimensional space for the full-dimensional case as well as for arbitrary subspaces. The second part of this thesis focuses on similarity processing in probabilistic databases. The presence of uncertainty is inherent in many applications dealing with data collected by sensing devices. Often, the collected information is noisy or incomplete due to measurement or transmission errors. Furthermore, data may be rendered uncertain due to privacy-preserving issues with the presence of confidential information. This creates a number of challenges in terms of effectively and efficiently querying and mining uncertain data. Existing work in this field either neglects the presence of dependencies or provides only approximate results while applying methods designed for certain data. Other approaches dealing with uncertain data are not able to provide efficient solutions. This part presents query processing approaches that outperform existing solutions of probabilistic similarity ranking. This part finally leads to the application of the introduced techniques to data mining tasks, such as the prominent problem of probabilistic frequent itemset mining.Viele Anwendungsgebiete, wie beispielsweise die Umweltforschung oder die medizinische Diagnostik, nutzen Systeme der Sensorüberwachung. Solche Systeme müssen oftmals in der Lage sein, mit Daten umzugehen, welche durch mehrere Beobachtungen repräsentiert werden. Im Gegensatz zu Daten mit nur einer Beobachtung (Single-Observation Data) basieren Daten aus mehreren Beobachtungen (Multi-Observation Data) auf einer Vielzahl von Beobachtungen, welche zwei Schlüsseleigenschaften unterliegen: Zeitliche Veränderlichkeit und Datenunsicherheit. Im Bereich der Ähnlichkeitssuche und im Data Mining spielen diese Eigenschaften eine wichtige Rolle. Gängige Lösungen in diesen Bereichen, die für Single-Observation Data entwickelt wurden, sind in der Regel für den Umgang mit mehreren Beobachtungen pro Objekt nicht anwendbar. Der Grund dafür liegt darin, dass diese Ansätze entweder nicht mit den Datenmodellen vereinbar sind oder keine Lösungen anbieten, die den aktuellen Ansprüchen an Lösungsqualität oder Effizienz genügen. Bekannte Forschungsrichtungen, die sich mit Multi-Observation Data und deren Schlüsseleigenschaften beschäftigen, sind die Analyse von Zeitreihen und die Ähnlichkeitssuche in probabilistischen Datenbanken. Während erstere Richtung eine zeitliche Ordnung der Beobachtungen eines Objekts voraussetzt, basieren unsichere Datenobjekte auf Beobachtungen, die sich gegenseitig bedingen oder ausschließen. Diese Dissertation umfasst aktuelle Forschungsbeiträge aus den beiden genannten Bereichen, wobei Methoden zur Ähnlichkeitssuche und zur Anwendung im Data Mining vorgestellt werden. Der erste Teil dieser Arbeit beschäftigt sich mit Ähnlichkeitssuche und Data Mining in Zeitreihendatenbanken. Insbesondere werden Zeitreihen betrachtet, welche aus periodisch auftretenden Mustern bestehen. Im Kontext eines medizinischen Anwendungsszenarios wird ein Ansatz zur Aktivitätserkennung vorgestellt. Dieser erlaubt mittels Merkmalsextraktion eine effiziente Speicherung und Analyse mit Hilfe von räumlichen Indexstrukturen. Für den Fall hochdimensionaler Merkmalsvektoren stellt dieser Teil zwei Indexierungsmethoden zur Beschleunigung von ähnlichkeitsanfragen vor. Die erste Methode berücksichtigt alle Attribute der Merkmalsvektoren, während die zweite Methode eine Projektion der Anfrage auf eine benutzerdefinierten Unterraum des Vektorraums erlaubt. Im zweiten Teil dieser Arbeit wird die Ähnlichkeitssuche im Kontext probabilistischer Datenbanken behandelt. Daten aus Sensormessungen besitzen häufig Eigenschaften, die einer gewissen Unsicherheit unterliegen. Aufgrund von Mess- oder übertragungsfehlern sind gemessene Werte oftmals unvollständig oder mit Rauschen behaftet. In diversen Szenarien, wie beispielsweise mit persönlichen oder medizinisch vertraulichen Daten, können Daten auch nachträglich von Hand verrauscht werden, so dass eine genaue Rekonstruktion der ursprünglichen Informationen nicht möglich ist. Diese Gegebenheiten stellen Anfragetechniken und Methoden des Data Mining vor einige Herausforderungen. In bestehenden Forschungsarbeiten aus dem Bereich der unsicheren Datenbanken werden diverse Probleme oftmals nicht beachtet. Entweder wird die Präsenz von Abhängigkeiten ignoriert, oder es werden lediglich approximative Lösungen angeboten, welche die Anwendung von Methoden für sichere Daten erlaubt. Andere Ansätze berechnen genaue Lösungen, liefern die Antworten aber nicht in annehmbarer Laufzeit zurück. Dieser Teil der Arbeit präsentiert effiziente Methoden zur Beantwortung von Ähnlichkeitsanfragen, welche die Ergebnisse absteigend nach ihrer Relevanz, also eine Rangliste der Ergebnisse, zurückliefern. Die angewandten Techniken werden schließlich auf Problemstellungen im probabilistischen Data Mining übertragen, um beispielsweise das Problem des Frequent Itemset Mining unter Berücksichtigung des vollen Gehalts an Unsicherheitsinformation zu lösen

    SHiFT: An Efficient, Flexible Search Engine for Transfer Learning

    Full text link
    Transfer learning can be seen as a data- and compute-efficient alternative to training models from scratch. The emergence of rich model repositories, such as TensorFlow Hub, enables practitioners and researchers to unleash the potential of these models across a wide range of downstream tasks. As these repositories keep growing exponentially, efficiently selecting a good model for the task at hand becomes paramount. By carefully comparing various selection and search strategies, we realize that no single method outperforms the others, and hybrid or mixed strategies can be beneficial. Therefore, we propose SHiFT, the first downstream task-aware, flexible, and efficient model search engine for transfer learning. These properties are enabled by a custom query language SHiFT-QL together with a cost-based decision maker, which we empirically validate. Motivated by the iterative nature of machine learning development, we further support efficient incremental executions of our queries, which requires a careful implementation when jointly used with our optimizations

    Recherche d'images par le contenu, analyse multirésolution et modèles de régression logistique

    Get PDF
    Cette thèse, présente l'ensemble de nos contributions relatives à la recherche d'images par le contenu à l'aide de l'analyse multirésolution ainsi qu'à la classification linéaire et nonlinéaire. Dans la première partie, nous proposons une méthode simple et rapide de recherche d'images par le contenu. Pour représenter les images couleurs, nous introduisons de nouveaux descripteurs de caractéristiques qui sont des histogrammes pondérés par le gradient multispectral. Afin de mesurer le degré de similarité entre deux images d'une façon rapide et efficace, nous utilisons une pseudo-métrique pondérée qui utilise la décomposition en ondelettes et la compression des histogrammes extraits des images. Les poids de la pseudo-métrique sont ajustés à l'aide du modèle classique de régression logistique afin d'améliorer sa capacité à discriminer et la précision de la recherche. Dans la deuxième partie, nous proposons un nouveau modèle bayésien de régression logistique fondé sur une méthode variationnelle. Une comparaison de ce nouveau modèle au modèle classique de régression logistique est effectuée dans le cadre de la recherche d'images. Nous illustrons par la suite que le modèle bayésien permet par rapport au modèle classique une amélioration notoire de la capacité à discriminer de la pseudo-métrique et de la précision de recherche. Dans la troisième partie, nous détaillons la dérivation du nouveau modèle bayésien de régression logistique fondé sur une méthode variationnelle et nous comparons ce modèle au modèle classique de régression logistique ainsi qu'à d'autres classificateurs linéaires présents dans la littérature. Nous comparons par la suite, notre méthode de recherche, utilisant le modèle bayésien de régression logistique, à d'autres méthodes de recherches déjà publiées. Dans la quatrième partie, nous introduisons la sélection des caractéristiques pour améliorer notre méthode de recherche utilisant le modèle introduit ci-dessus. En effet, la sélection des caractéristiques permet de donner automatiquement plus d'importance aux caractéristiques qui discriminent le plus et moins d'importance aux caractéristiques qui discriminent le moins. Finalement, dans la cinquième partie, nous proposons un nouveau modèle bayésien d'analyse discriminante logistique construit à l'aide de noyaux permettant ainsi une classification nonlinéaire flexible

    Learned Sorted Table Search and Static Indexes in Small-Space Data Models †

    Get PDF
    Machine-learning techniques, properly combined with data structures, have resulted in Learned Static Indexes, innovative and powerful tools that speed up Binary Searches with the use of additional space with respect to the table being searched into. Such space is devoted to the machine-learning models. Although in their infancy, these are methodologically and practically important, due to the pervasiveness of Sorted Table Search procedures. In modern applications, model space is a key factor, and a major open question concerning this area is to assess to what extent one can enjoy the speeding up of Binary Searches achieved by Learned Indexes while using constant or nearly constant-space models. In this paper, we investigate the mentioned question by (a) introducing two new models, i.e., the Learned k-ary Search Model and the Synoptic Recursive Model Index; and (b) systematically exploring the time–space trade-offs of a hierarchy of existing models, i.e., the ones in the reference software platform Searching on Sorted Data, together with the new ones proposed here. We document a novel and rather complex time–space trade-off picture, which is informative for users as well as designers of Learned Indexing data structures. By adhering to and extending the current benchmarking methodology, we experimentally show that the Learned k-ary Search Model is competitive in time with respect to Binary Search in constant additional space. Our second model, together with the bi-criteria Piece-wise Geometric Model Index, can achieve speeding up of Binary Search with a model space of (Formula presented.) more than the one taken by the table, thereby, being competitive in terms of the time–space trade-off with existing proposals. The Synoptic Recursive Model Index and the bi-criteria Piece-wise Geometric Model complement each other quite well across the various levels of the internal memory hierarchy. Finally, our findings stimulate research in this area since they highlight the need for further studies regarding the time–space relation in Learned Indexes
    corecore