496 research outputs found
Scalable Solutions for Automated Single Pulse Identification and Classification in Radio Astronomy
Data collection for scientific applications is increasing exponentially and
is forecasted to soon reach peta- and exabyte scales. Applications which
process and analyze scientific data must be scalable and focus on execution
performance to keep pace. In the field of radio astronomy, in addition to
increasingly large datasets, tasks such as the identification of transient
radio signals from extrasolar sources are computationally expensive. We present
a scalable approach to radio pulsar detection written in Scala that
parallelizes candidate identification to take advantage of in-memory task
processing using Apache Spark on a YARN distributed system. Furthermore, we
introduce a novel automated multiclass supervised machine learning technique
that we combine with feature selection to reduce the time required for
candidate classification. Experimental testing on a Beowulf cluster with 15
data nodes shows that the parallel implementation of the identification
algorithm offers a speedup of up to 5X that of a similar multithreaded
implementation. Further, we show that the combination of automated multiclass
classification and feature selection speeds up the execution performance of the
RandomForest machine learning algorithm by an average of 54% with less than a
2% average reduction in the algorithm's ability to correctly classify pulsars.
The generalizability of these results is demonstrated by using two real-world
radio astronomy data sets.Comment: In Proceedings of the 47th International Conference on Parallel
Processing (ICPP 2018). ACM, New York, NY, USA, Article 11, 11 page
Online Matrix Completion Through Nuclear Norm Regularisation
It is the main goal of this paper to propose a novel method to perform matrix
completion on-line. Motivated by a wide variety of applications, ranging from
the design of recommender systems to sensor network localization through
seismic data reconstruction, we consider the matrix completion problem when
entries of the matrix of interest are observed gradually. Precisely, we place
ourselves in the situation where the predictive rule should be refined
incrementally, rather than recomputed from scratch each time the sample of
observed entries increases. The extension of existing matrix completion methods
to the sequential prediction context is indeed a major issue in the Big Data
era, and yet little addressed in the literature. The algorithm promoted in this
article builds upon the Soft Impute approach introduced in Mazumder et al.
(2010). The major novelty essentially arises from the use of a randomised
technique for both computing and updating the Singular Value Decomposition
(SVD) involved in the algorithm. Though of disarming simplicity, the method
proposed turns out to be very efficient, while requiring reduced computations.
Several numerical experiments based on real datasets illustrating its
performance are displayed, together with preliminary results giving it a
theoretical basis.Comment: Corrected a typo in the affiliatio
A Brief Tour through Provenance in Scientific Workflows and Databases
Within computer science, the term provenance has multiple meanings, due to different motivations, perspectives, and assumptions prevalent in the respective communities. This chapter provides a high-level “sightseeing tour” of some of those different notions and uses of provenance in scientific workflows and databases.Ope
On the Reusability of Data Cleaning Workflows
The goal of data cleaning is to make data fit for purpose, i.e., to improve data quality, through updates and data transformations, such that downstream analyses can be conducted and lead to trustworthy results. A transparent and reusable data cleaning workflow can save time and effort through automation, and make subsequent data cleaning on new data less errorprone. However, reusability of data cleaning workflows has received little to no attention in the research community. We identify some challenges and opportunities for reusing data cleaning workflows. We present a high-level conceptual model to clarify what we mean by reusability and propose ways to improve reusability along different dimensions. We use the opportunity of presenting at IDCC to invite the community to share their uses cases, experiences, and desiderata for the reuse of data cleaning workflows and recipes in order to foster new collaborations and guide future work
Games and Argumentation: Time for a Family Reunion!
The rule "defeated(X) attacks(Y,X), defeated(Y)" states
that an argument is defeated if it is attacked by an argument that is not
defeated. The rule "win(X) move(X,Y), win(Y)" states that
in a game a position is won if there is a move to a position that is not won.
Both logic rules can be seen as close relatives (even identical twins) and both
rules have been at the center of attention at various times in different
communities: The first rule lies at the core of argumentation frameworks and
has spawned a large family of models and semantics of abstract argumentation.
The second rule has played a key role in the quest to find the "right"
semantics for logic programs with recursion through negation, and has given
rise to the stable and well-founded semantics. Both semantics have been widely
studied by the logic programming and nonmonotonic reasoning community. The
second rule has also received much attention by the database and finite model
theory community, e.g., when studying the expressive power of query languages
and fixpoint logics. Although close connections between argumentation
frameworks, logic programming, and dialogue games have been known for a long
time, the overlap and cross-fertilization between the communities appears to be
smaller than one might expect. To this end, we recall some of the key results
from database theory in which the win-move query has played a central role,
e.g., on normal forms and expressive power of query languages. We introduce
some notions that naturally emerge from games and that may provide new
perspectives and research opportunities for argumentation frameworks. We
discuss how solved query evaluation games reveal how- and why-not provenance of
query answers. These techniques can be used to explain how results were derived
via the given query, game, or argumentation framework.Comment: Fourth Workshop on Explainable Logic-Based Knowledge Representation
(XLoKR), Sept 2, 2023. Rhodes, Greec
Automatic Module Detection in Data Cleaning Workflows: Enabling Transparency and Recipe Reuse
Before data from multiple sources can be analyzed, data cleaning workflows (“recipes”) usually need to be employed to improve data quality. We identify a number of technical problems that make application of FAIR principles to data cleaning recipes challenging. We then demonstrate how transparency and reusability of recipes can be improved by analyzing dataflow dependencies within recipes. In particular column-level dependencies can be used to automatically detect independent subworkflows, which then can be reused individually as data cleaning modules. We have prototypically implemented this approach as part of an ongoing project to develop open-source companion tools for OpenRefine.
Keywords: Data Cleaning, Provenance, Workflow Analysi
EquiX---A Search and Query Language for XML
EquiX is a search language for XML that combines the power of querying with
the simplicity of searching. Requirements for such languages are discussed and
it is shown that EquiX meets the necessary criteria. Both a graphical abstract
syntax and a formal concrete syntax are presented for EquiX queries. In
addition, the semantics is defined and an evaluation algorithm is presented.
The evaluation algorithm is polynomial under combined complexity.
EquiX combines pattern matching, quantification and logical expressions to
query both the data and meta-data of XML documents. The result of a query in
EquiX is a set of XML documents. A DTD describing the result documents is
derived automatically from the query.Comment: technical report of Hebrew University Jerusalem Israe
Plasma Edge Kinetic-MHD Modeling in Tokamaks Using Kepler Workflow for Code Coupling, Data Management and Visualization
A new predictive computer simulation tool targeting the development of the H-mode pedestal at the plasma edge in tokamaks and the triggering and dynamics of edge localized modes (ELMs) is presented in this report. This tool brings together, in a coordinated and effective manner, several first-principles physics simulation codes, stability analysis packages, and data processing and visualization tools. A Kepler workflow is used in order to carry out an edge plasma simulation that loosely couples the kinetic code, XGC0, with an ideal MHD linear stability analysis code, ELITE, and an extended MHD initial value code such as M3D or NIMROD. XGC0 includes the neoclassical ion-electron-neutral dynamics needed to simulate pedestal growth near the separatrix. The Kepler workflow processes the XGC0 simulation results into simple images that can be selected and displayed via the Dashboard, a monitoring tool implemented in AJAX allowing the scientist to track computational resources, examine running and archived jobs, and view key physics data, all within a standard Web browser. The XGC0 simulation is monitored for the conditions needed to trigger an ELM crash by periodically assessing the edge plasma pressure and current density profiles using the ELITE code. If an ELM crash is triggered, the Kepler workflow launches the M3D code on a moderate-size Opteron cluster to simulate the nonlinear ELM crash and to compute the relaxation of plasma profiles after the crash. This process is monitored through periodic outputs of plasma fluid quantities that are automatically visualized with AVS/Express and may be displayed on the Dashboard. Finally, the Kepler workflow archives all data outputs and processed images using HPSS, as well as provenance information about the software and hardware used to create the simulation. The complete process of preparing, executing and monitoring a coupled-code simulation of the edge pressure pedestal buildup and the ELM cycle using the Kepler scientific workflow system is described in this paper
Exploring Geopolitical Realities through Taxonomies: The Case of Taiwan
In the face of heterogeneous standards and large-scale datasets, it has become increasingly difficult to understand the underlying knowledge structures within complex information systems. These structures may encode latent assumptions that could be susceptible to issues such as ghettoization, bias, erasure, or omission. Inspired by a series of current events in the China-Taiwan conflict on the sovereignty of Taiwan, our research aims to develop methods that can elucidate multiple, often conflicting perspectives and hidden assumptions. We propose the use of a logic-based taxonomy alignment approach to first align and then reconcile distinct but overlapping taxonomies. We specifically examine three relevant taxonomies that list the world entities: (1) ISO 3166 for country codes and subdivisions; (2) the geographic regions of the US Department of Homeland Security; (3) the Center Intelligence Agency’s World Fact Book. Our results highlight multiple alternate views (or Possible Worlds) for situating Taiwan relative to other neighboring entities. We hope that this work can be a first step to demonstrate how different geopolitical perspectives can be represented using multiple, interrelated taxonomies.Ope
- …