323 research outputs found
Method and system for data clustering for very large databases
Multi-dimensional data contained in very large databases is efficiently and accurately clustered to determine patterns therein and extract useful information from such patterns. Conventional computer processors may be used which have limited memory capacity and conventional operating speed, allowing massive data sets to be processed in a reasonable time and with reasonable computer resources. The clustering process is organized using a clustering feature tree structure wherein each clustering feature comprises the number of data points in the cluster, the linear sum of the data points in the cluster, and the square sum of the data points in the cluster. A dense region of data points is treated collectively as a single cluster, and points in sparsely occupied regions can be treated as outliers and removed from the clustering feature tree. The clustering can be carried out continuously with new data points being received and processed, and with the clustering feature tree being restructured as necessary to accommodate the information from the newly received data points
Genome inventory and analysis of nuclear hormone receptors in Tetraodon nigroviridis
Nuclear hormone receptors (NRs) form a large superfamily of ligand-activated transcription factors, which regulate genes underlying a wide range of (patho) physiological phenomena. Availability of the full genome sequence of Tetraodon nigroviridis facilitated a genome wide analysis of the NRs in fish genome. Seventy one NRs were found in Tetraodon and were compared with mammalian and fish NR family members. In general, there is a higher representation of NRs in fish genomes compared to mammalian ones. They showed high diversity across classes as observed by phylogenetic analysis. Nucleotide substitution rates show strong negative selection among fish NRs except for pregnane X receptor (PXR), estrogen receptor (ER) and liver X receptor (LXR). This may be attributed to crucial role played by them in metabolism and detoxification of xenobiotic and endobiotic compounds and might have resulted in slight positive selection. Chromosomal mapping and pairwise comparisons of NR distribution in Tetraodon and humans led to the identification of nine synthenic NR regions, of which three are common among fully sequenced vertebrate genomes. Gene structure analysis shows strong conservation of exon structures among orthologoues. Whereas paralogous members show different splicing patterns with intron gain or loss and addition or substitution of exons played a major role in evolution of NR superfamily
Srql: Sorted relational query language
A relation is an unordered collection of records. Often, however, there is an underlying order (e.g., a sequence of stock prices), and users want to pose queries that reflect this order (e.g., find a weekly moving average). SQL provides no support for posing such queries. In this paper, we show how a rich class of queries reflecting sort order can be naturally expressed and efficiently executed with simple extensions to SQL. 1
XTable in Action: Seamless Interoperability in Data Lakes
Contemporary approaches to data management are increasingly relying on
unified analytics and AI platforms to foster collaboration, interoperability,
seamless access to reliable data, and high performance. Data Lakes featuring
open standard table formats such as Delta Lake, Apache Hudi, and Apache Iceberg
are central components of these data architectures. Choosing the right format
for managing a table is crucial for achieving the objectives mentioned above.
The challenge lies in selecting the best format, a task that is onerous and can
yield temporary results, as the ideal choice may shift over time with data
growth, evolving workloads, and the competitive development of table formats
and processing engines. Moreover, restricting data access to a single format
can hinder data sharing resulting in diminished business value over the long
term. The ability to seamlessly interoperate between formats and with
negligible overhead can effectively address these challenges. Our solution in
this direction is an innovative omni-directional translator, XTable, that
facilitates writing data in one format and reading it in any format, thus
achieving the desired format interoperability. In this work, we demonstrate the
effectiveness of XTable through application scenarios inspired by real-world
use cases
Iterative MapReduce for Large Scale Machine Learning
Large datasets ("Big Data") are becoming ubiquitous because the potential
value in deriving insights from data, across a wide range of business and
scientific applications, is increasingly recognized. In particular, machine
learning - one of the foundational disciplines for data analysis, summarization
and inference - on Big Data has become routine at most organizations that
operate large clouds, usually based on systems such as Hadoop that support the
MapReduce programming paradigm. It is now widely recognized that while
MapReduce is highly scalable, it suffers from a critical weakness for machine
learning: it does not support iteration. Consequently, one has to program
around this limitation, leading to fragile, inefficient code. Further, reliance
on the programmer is inherently flawed in a multi-tenanted cloud environment,
since the programmer does not have visibility into the state of the system when
his or her program executes. Prior work has sought to address this problem by
either developing specialized systems aimed at stylized applications, or by
augmenting MapReduce with ad hoc support for saving state across iterations
(driven by an external loop). In this paper, we advocate support for looping as
a first-class construct, and propose an extension of the MapReduce programming
paradigm called {\em Iterative MapReduce}. We then develop an optimizer for a
class of Iterative MapReduce programs that cover most machine learning
techniques, provide theoretical justifications for the key optimization steps,
and empirically demonstrate that system-optimized programs for significant
machine learning tasks are competitive with state-of-the-art specialized
solutions
The EDAM Project: Mining Atmospheric Aerosol Datasets
Data mining has been a very active area of research in the database, machine learning, and mathematical programming communities in recent years. EDAM (Exploratory Data Analysis and Management) is a joint project between researchers in Atmospheric Chemistry and Computer Science at Carleton College and the University of Wisconsin-Madison that aims to develop data mining techniques for advancing the state of the art in analyzing atmospheric aerosol datasets. There is a great need to better understand the sources, dynamics, and compositions of atmospheric aerosols. The traditional approach for particle measurement, which is the collection of bulk samples of particulates on filters, is not adequate for studying particle dynamics and real-time correlations. This has led to the development of a new generation of real-time instruments that provide continuous or semi-continuous streams of data about certain aerosol properties. However, these instruments have added a significant level of complexity to atmospheric aerosol data, and dramatically increased the amounts of data to be collected, managed, and analyzed. Our abilit y to integrate the data from all of these new and complex instruments now lags far behind our data-collection capabilities, and severely limits our ability to understand the data and act upon it in a timely manner. In this paper, we present an overview of the EDAM project. The goal of the project, which is in its early stages, is to develop novel data mining algorithms and approaches to managing and monitoring multiple complex data streams. An important objective is data quality assurance, and real-time data mining offers great potential. The approach that we take should also provide good techniques to deal with gas-phase and semi-volatile data. While atmospheric aerosol analysis is an important and challenging domain that motivates us with real problems and serves as a concrete test of our results, our objective is to develop techniques that have broader applicability, and to explore some fundamental challenges in data mining that are not specific to any given application domain
LST-Bench: Benchmarking Log-Structured Tables in the Cloud
Log-Structured Tables (LSTs), also commonly referred to as table formats,
have recently emerged to bring consistency and isolation to object stores. With
the separation of compute and storage, object stores have become the go-to for
highly scalable and durable storage. However, this comes with its own set of
challenges, such as the lack of recovery and concurrency management that
traditional database management systems provide. This is where LSTs such as
Delta Lake, Apache Iceberg, and Apache Hudi come into play, providing an
automatic metadata layer that manages tables defined over object stores,
effectively addressing these challenges. A paradigm shift in the design of
these systems necessitates the updating of evaluation methodologies. In this
paper, we examine the characteristics of LSTs and propose extensions to
existing benchmarks, including workload patterns and metrics, to accurately
capture their performance. We introduce our framework, LST-Bench, which enables
users to execute benchmarks tailored for the evaluation of LSTs. Our evaluation
demonstrates how these benchmarks can be utilized to evaluate the performance,
efficiency, and stability of LSTs. The code for LST-Bench is open sourced and
is available at https://github.com/microsoft/lst-bench/
- …