323 research outputs found

    Method and system for data clustering for very large databases

    Get PDF
    Multi-dimensional data contained in very large databases is efficiently and accurately clustered to determine patterns therein and extract useful information from such patterns. Conventional computer processors may be used which have limited memory capacity and conventional operating speed, allowing massive data sets to be processed in a reasonable time and with reasonable computer resources. The clustering process is organized using a clustering feature tree structure wherein each clustering feature comprises the number of data points in the cluster, the linear sum of the data points in the cluster, and the square sum of the data points in the cluster. A dense region of data points is treated collectively as a single cluster, and points in sparsely occupied regions can be treated as outliers and removed from the clustering feature tree. The clustering can be carried out continuously with new data points being received and processed, and with the clustering feature tree being restructured as necessary to accommodate the information from the newly received data points

    Genome inventory and analysis of nuclear hormone receptors in Tetraodon nigroviridis

    Get PDF
    Nuclear hormone receptors (NRs) form a large superfamily of ligand-activated transcription factors, which regulate genes underlying a wide range of (patho) physiological phenomena. Availability of the full genome sequence of Tetraodon nigroviridis facilitated a genome wide analysis of the NRs in fish genome. Seventy one NRs were found in Tetraodon and were compared with mammalian and fish NR family members. In general, there is a higher representation of NRs in fish genomes compared to mammalian ones. They showed high diversity across classes as observed by phylogenetic analysis. Nucleotide substitution rates show strong negative selection among fish NRs except for pregnane X receptor (PXR), estrogen receptor (ER) and liver X receptor (LXR). This may be attributed to crucial role played by them in metabolism and detoxification of xenobiotic and endobiotic compounds and might have resulted in slight positive selection. Chromosomal mapping and pairwise comparisons of NR distribution in Tetraodon and humans led to the identification of nine synthenic NR regions, of which three are common among fully sequenced vertebrate genomes. Gene structure analysis shows strong conservation of exon structures among orthologoues. Whereas paralogous members show different splicing patterns with intron gain or loss and addition or substitution of exons played a major role in evolution of NR superfamily

    Srql: Sorted relational query language

    Get PDF
    A relation is an unordered collection of records. Often, however, there is an underlying order (e.g., a sequence of stock prices), and users want to pose queries that reflect this order (e.g., find a weekly moving average). SQL provides no support for posing such queries. In this paper, we show how a rich class of queries reflecting sort order can be naturally expressed and efficiently executed with simple extensions to SQL. 1

    XTable in Action: Seamless Interoperability in Data Lakes

    Full text link
    Contemporary approaches to data management are increasingly relying on unified analytics and AI platforms to foster collaboration, interoperability, seamless access to reliable data, and high performance. Data Lakes featuring open standard table formats such as Delta Lake, Apache Hudi, and Apache Iceberg are central components of these data architectures. Choosing the right format for managing a table is crucial for achieving the objectives mentioned above. The challenge lies in selecting the best format, a task that is onerous and can yield temporary results, as the ideal choice may shift over time with data growth, evolving workloads, and the competitive development of table formats and processing engines. Moreover, restricting data access to a single format can hinder data sharing resulting in diminished business value over the long term. The ability to seamlessly interoperate between formats and with negligible overhead can effectively address these challenges. Our solution in this direction is an innovative omni-directional translator, XTable, that facilitates writing data in one format and reading it in any format, thus achieving the desired format interoperability. In this work, we demonstrate the effectiveness of XTable through application scenarios inspired by real-world use cases

    Iterative MapReduce for Large Scale Machine Learning

    Full text link
    Large datasets ("Big Data") are becoming ubiquitous because the potential value in deriving insights from data, across a wide range of business and scientific applications, is increasingly recognized. In particular, machine learning - one of the foundational disciplines for data analysis, summarization and inference - on Big Data has become routine at most organizations that operate large clouds, usually based on systems such as Hadoop that support the MapReduce programming paradigm. It is now widely recognized that while MapReduce is highly scalable, it suffers from a critical weakness for machine learning: it does not support iteration. Consequently, one has to program around this limitation, leading to fragile, inefficient code. Further, reliance on the programmer is inherently flawed in a multi-tenanted cloud environment, since the programmer does not have visibility into the state of the system when his or her program executes. Prior work has sought to address this problem by either developing specialized systems aimed at stylized applications, or by augmenting MapReduce with ad hoc support for saving state across iterations (driven by an external loop). In this paper, we advocate support for looping as a first-class construct, and propose an extension of the MapReduce programming paradigm called {\em Iterative MapReduce}. We then develop an optimizer for a class of Iterative MapReduce programs that cover most machine learning techniques, provide theoretical justifications for the key optimization steps, and empirically demonstrate that system-optimized programs for significant machine learning tasks are competitive with state-of-the-art specialized solutions

    The EDAM Project: Mining Atmospheric Aerosol Datasets

    Get PDF
    Data mining has been a very active area of research in the database, machine learning, and mathematical programming communities in recent years. EDAM (Exploratory Data Analysis and Management) is a joint project between researchers in Atmospheric Chemistry and Computer Science at Carleton College and the University of Wisconsin-Madison that aims to develop data mining techniques for advancing the state of the art in analyzing atmospheric aerosol datasets. There is a great need to better understand the sources, dynamics, and compositions of atmospheric aerosols. The traditional approach for particle measurement, which is the collection of bulk samples of particulates on filters, is not adequate for studying particle dynamics and real-time correlations. This has led to the development of a new generation of real-time instruments that provide continuous or semi-continuous streams of data about certain aerosol properties. However, these instruments have added a significant level of complexity to atmospheric aerosol data, and dramatically increased the amounts of data to be collected, managed, and analyzed. Our abilit y to integrate the data from all of these new and complex instruments now lags far behind our data-collection capabilities, and severely limits our ability to understand the data and act upon it in a timely manner. In this paper, we present an overview of the EDAM project. The goal of the project, which is in its early stages, is to develop novel data mining algorithms and approaches to managing and monitoring multiple complex data streams. An important objective is data quality assurance, and real-time data mining offers great potential. The approach that we take should also provide good techniques to deal with gas-phase and semi-volatile data. While atmospheric aerosol analysis is an important and challenging domain that motivates us with real problems and serves as a concrete test of our results, our objective is to develop techniques that have broader applicability, and to explore some fundamental challenges in data mining that are not specific to any given application domain

    LST-Bench: Benchmarking Log-Structured Tables in the Cloud

    Full text link
    Log-Structured Tables (LSTs), also commonly referred to as table formats, have recently emerged to bring consistency and isolation to object stores. With the separation of compute and storage, object stores have become the go-to for highly scalable and durable storage. However, this comes with its own set of challenges, such as the lack of recovery and concurrency management that traditional database management systems provide. This is where LSTs such as Delta Lake, Apache Iceberg, and Apache Hudi come into play, providing an automatic metadata layer that manages tables defined over object stores, effectively addressing these challenges. A paradigm shift in the design of these systems necessitates the updating of evaluation methodologies. In this paper, we examine the characteristics of LSTs and propose extensions to existing benchmarks, including workload patterns and metrics, to accurately capture their performance. We introduce our framework, LST-Bench, which enables users to execute benchmarks tailored for the evaluation of LSTs. Our evaluation demonstrates how these benchmarks can be utilized to evaluate the performance, efficiency, and stability of LSTs. The code for LST-Bench is open sourced and is available at https://github.com/microsoft/lst-bench/
    • …
    corecore