600 research outputs found
In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes
The Data Science domain has expanded monumentally in both research and
industry communities during the past decade, predominantly owing to the Big
Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are
bringing more complexities to data engineering applications, which are now
integrated into data processing pipelines to process terabytes of data.
Typically, a significant amount of time is spent on data preprocessing in these
pipelines, and hence improving its e fficiency directly impacts the overall
pipeline performance. The community has recently embraced the concept of
Dataframes as the de-facto data structure for data representation and
manipulation. However, the most widely used serial Dataframes today (R, pandas)
experience performance limitations while working on even moderately large data
sets. We believe that there is plenty of room for improvement by taking a look
at this problem from a high-performance computing point of view. In a prior
publication, we presented a set of parallel processing patterns for distributed
dataframe operators and the reference runtime implementation, Cylon [1]. In
this paper, we are expanding on the initial concept by introducing a cost model
for evaluating the said patterns. Furthermore, we evaluate the performance of
Cylon on the ORNL Summit supercomputer
DALiuGE: A Graph Execution Framework for Harnessing the Astronomical Data Deluge
The Data Activated Liu Graph Engine - DALiuGE - is an execution framework for
processing large astronomical datasets at a scale required by the Square
Kilometre Array Phase 1 (SKA1). It includes an interface for expressing complex
data reduction pipelines consisting of both data sets and algorithmic
components and an implementation run-time to execute such pipelines on
distributed resources. By mapping the logical view of a pipeline to its
physical realisation, DALiuGE separates the concerns of multiple stakeholders,
allowing them to collectively optimise large-scale data processing solutions in
a coherent manner. The execution in DALiuGE is data-activated, where each
individual data item autonomously triggers the processing on itself. Such
decentralisation also makes the execution framework very scalable and flexible,
supporting pipeline sizes ranging from less than ten tasks running on a laptop
to tens of millions of concurrent tasks on the second fastest supercomputer in
the world. DALiuGE has been used in production for reducing interferometry data
sets from the Karl E. Jansky Very Large Array and the Mingantu Ultrawide
Spectral Radioheliograph; and is being developed as the execution framework
prototype for the Science Data Processor (SDP) consortium of the Square
Kilometre Array (SKA) telescope. This paper presents a technical overview of
DALiuGE and discusses case studies from the CHILES and MUSER projects that use
DALiuGE to execute production pipelines. In a companion paper, we provide
in-depth analysis of DALiuGE's scalability to very large numbers of tasks on
two supercomputing facilities.Comment: 31 pages, 12 figures, currently under review by Astronomy and
Computin
Earth and environmental science in the 1980's: Part 1: Environmental data systems, supercomputer facilities and networks
Overview descriptions of on-line environmental data systems, supercomputer facilities, and networks are presented. Each description addresses the concepts of content, capability, and user access relevant to the point of view of potential utilization by the Earth and environmental science community. The information on similar systems or facilities is presented in parallel fashion to encourage and facilitate intercomparison. In addition, summary sheets are given for each description, and a summary table precedes each section
Recommended from our members
Parallel computing in information retrieval - An updated review
The progress of parallel computing in Information Retrieval (IR) is reviewed. In particular we stress the importance of the motivation in using parallel computing for Text Retrieval. We analyse parallel IR systems using a classification due to Rasmussen [1] and describe some parallel IR systems. We give a description of the retrieval models used in parallel Information Processing.. We describe areas of research which we believe are needed
Neural RELAGGS
Multi-relational databases are the basis of most consolidated data
collections in science and industry today. Most learning and mining algorithms,
however, require data to be represented in a propositional form. While there is
a variety of specialized machine learning algorithms that can operate directly
on multi-relational data sets, propositionalization algorithms transform
multi-relational databases into propositional data sets, thereby allowing the
application of traditional machine learning and data mining algorithms without
their modification. One prominent propositionalization algorithm is RELAGGS by
Krogel and Wrobel, which transforms the data by nested aggregations. We propose
a new neural network based algorithm in the spirit of RELAGGS that employs
trainable composite aggregate functions instead of the static aggregate
functions used in the original approach. In this way, we can jointly train the
propositionalization with the prediction model, or, alternatively, use the
learned aggegrations as embeddings in other algorithms. We demonstrate the
increased predictive performance by comparing N-RELAGGS with RELAGGS and
multiple other state-of-the-art algorithms.Comment: Submitted to Machine Learning Journa
Corporate influence and the academic computer science discipline. [4: CMU]
Prosopographical work on the four major centers for computer
research in the United States has now been conducted, resulting in big
questions about the independence of, so called, computer science
Text-to-SQL Error Correction with Language Models of Code
Despite recent progress in text-to-SQL parsing, current semantic parsers are
still not accurate enough for practical use. In this paper, we investigate how
to build automatic text-to-SQL error correction models. Noticing that
token-level edits are out of context and sometimes ambiguous, we propose
building clause-level edit models instead. Besides, while most language models
of code are not specifically pre-trained for SQL, they know common data
structures and their operations in programming languages such as Python. Thus,
we propose a novel representation for SQL queries and their edits that adheres
more closely to the pre-training corpora of language models of code. Our error
correction model improves the exact set match accuracy of different parsers by
2.4-6.5 and obtains up to 4.3 point absolute improvement over two strong
baselines. Our code and data are available at
https://github.com/OSU-NLP-Group/Auto-SQL-Correction.Comment: ACL 2023 Short Pape
- …