33 research outputs found
Scalable Exact Parent Sets Identification in Bayesian Networks Learning with Apache Spark
In Machine Learning, the parent set identification problem is to find a set
of random variables that best explain selected variable given the data and some
predefined scoring function. This problem is a critical component to structure
learning of Bayesian networks and Markov blankets discovery, and thus has many
practical applications, ranging from fraud detection to clinical decision
support. In this paper, we introduce a new distributed memory approach to the
exact parent sets assignment problem. To achieve scalability, we derive
theoretical bounds to constraint the search space when MDL scoring function is
used, and we reorganize the underlying dynamic programming such that the
computational density is increased and fine-grain synchronization is
eliminated. We then design efficient realization of our approach in the Apache
Spark platform. Through experimental results, we demonstrate that the method
maintains strong scalability on a 500-core standalone Spark cluster, and it can
be used to efficiently process data sets with 70 variables, far beyond the
reach of the currently available solutions
Applications and Challenges of Real-time Mobile DNA Analysis
The DNA sequencing is the process of identifying the exact order of
nucleotides within a given DNA molecule. The new portable and relatively
inexpensive DNA sequencers, such as Oxford Nanopore MinION, have the potential
to move DNA sequencing outside of laboratory, leading to faster and more
accessible DNA-based diagnostics. However, portable DNA sequencing and analysis
are challenging for mobile systems, owing to high data throughputs and
computationally intensive processing performed in environments with unreliable
connectivity and power.
In this paper, we provide an analysis of the challenges that mobile systems
and mobile computing must address to maximize the potential of portable DNA
sequencing, and in situ DNA analysis. We explain the DNA sequencing process and
highlight the main differences between traditional and portable DNA sequencing
in the context of the actual and envisioned applications. We look at the
identified challenges from the perspective of both algorithms and systems
design, showing the need for careful co-design
Error Metrics for Learning Reliable Manifolds from Streaming Data
Spectral dimensionality reduction is frequently used to identify
low-dimensional structure in high-dimensional data. However, learning
manifolds, especially from the streaming data, is computationally and memory
expensive. In this paper, we argue that a stable manifold can be learned using
only a fraction of the stream, and the remaining stream can be mapped to the
manifold in a significantly less costly manner. Identifying the transition
point at which the manifold is stable is the key step. We present error metrics
that allow us to identify the transition point for a given stream by
quantitatively assessing the quality of a manifold learned using Isomap. We
further propose an efficient mapping algorithm, called S-Isomap, that can be
used to map new samples onto the stable manifold. We describe experiments on a
variety of data sets that show that the proposed approach is computationally
efficient without sacrificing accuracy
Predicting the Impact of Batch Refactoring Code Smells on Application Resource Consumption
Automated batch refactoring has become a de-facto mechanism to restructure
software that may have significant design flaws negatively impacting the code
quality and maintainability. Although automated batch refactoring techniques
are known to significantly improve overall software quality and
maintainability, their impact on resource utilization is not well studied. This
paper aims to bridge the gap between batch refactoring code smells and
consumption of resources. It determines the relationship between software code
smell batch refactoring, and resource consumption. Next, it aims to design
algorithms to predict the impact of code smell refactoring on resource
consumption. This paper investigates 16 code smell types and their joint effect
on resource utilization for 31 open source applications. It provides a detailed
empirical analysis of the change in application CPU and memory utilization
after refactoring specific code smells in isolation and in batches. This
analysis is then used to train regression algorithms to predict the impact of
batch refactoring on CPU and memory utilization before making any refactoring
decisions. Experimental results also show that our ANN-based regression model
provides highly accurate predictions for the impact of batch refactoring on
resource consumption. It allows the software developers to intelligently decide
which code smells they should refactor jointly to achieve high code quality and
maintainability without increasing the application resource utilization. This
paper responds to the important and urgent need of software engineers across a
broad range of software applications, who are looking to refactor code smells
and at the same time improve resource consumption. Finally, it brings forward
the concept of resource aware code smell refactoring to the most crucial
software applications
Parallel Framework for Dimensionality Reduction of Large-Scale Datasets
Dimensionality reduction refers to a set of mathematical techniques used to reduce complexity of the original high-dimensional data, while preserving its selected properties. Improvements in simulation strategies and experimental data collection methods are resulting in a deluge of heterogeneous and high-dimensional data, which often makes dimensionality reduction the only viable way to gain qualitative and quantitative understanding of the data. However, existing dimensionality reduction software often does not scale to datasets arising in real-life applications, which may consist of thousands of points with millions of dimensions. In this paper, we propose a parallel framework for dimensionality reduction of large-scale data. We identify key components underlying the spectral dimensionality reduction techniques, and propose their efficient parallel implementation. We show that the resulting framework can be used to process datasets consisting of millions of points when executed on a 16,000-core cluster, which is beyond the reach of currently available methods. To further demonstrate applicability of our framework we perform dimensionality reduction of 75,000 images representing morphology evolution during manufacturing of organic solar cells in order to identify how processing parameters affect morphology evolution
Microstructure design using graphs
Thin films with tailored microstructures are an emerging class of materials with applications such as battery electrodes, organic electronics, and biosensors. Such thin film devices typically exhibit a multi-phase microstructure that is confined, and show large anisotropy. Current approaches to microstructure design focus on optimizing bulk properties, by tuning features that are statistically averaged over a representative volume. Here, we report a tool for morphogenesis posed as a graph-based optimization problem that evolves microstructures recognizing confinement and anisotropy constraints. We illustrate the approach by designing optimized morphologies for photovoltaic applications, and evolve an initial morphology into an optimized morphology exhibiting substantially improved short circuit current (68% improvement over a conventional bulk-heterojunction morphology). We show optimized morphologies across a range of thicknesses exhibiting self-similar behavior. Results suggest that thicker films (250ânm) can be used to harvest more incident energy. Our graph based morphogenesis is broadly applicable to microstructure-sensitive design of batteries, biosensors and related applications
A software framework for data dimensionality reduction: application to chemical crystallography
Materials science research has witnessed an increasing use of data mining techniques in establishing processâstructureâproperty relationships. Significant advances in highâthroughput experiments and computational capability have resulted in the generation of huge amounts of data. Various statistical methods are currently employed to reduce the noise, redundancy, and the dimensionality of the data to make analysis more tractable. Popular methods for reduction (like principal component analysis) assume a linear relationship between the input and output variables. Recent developments in nonâlinear reduction (neural networks, selfâorganizing maps), though successful, have computational issues associated with convergence and scalability. Another significant barrier to use dimensionality reduction techniques in materials science is the lack of ease of use owing to their complex mathematical formulations. This paper reviews various spectralâbased techniques that efficiently unravel linear and nonâlinear structures in the data which can subsequently be used to tractably investigate processâstructureâproperty relationships. In addition, we describe techniques (based on graphâtheoretic analysis) to estimate the optimal dimensionality of the lowâdimensional parametric representation. We show how these techniques can be packaged into a modular, computationally scalable software framework with a graphical user interface â Scalable Extensible Toolkit for Dimensionality Reduction (SETDiR). This interface helps to separate out the mathematics and computational aspects from the materials science applications, thus significantly enhancing utility to the materials science community. The applicability of this framework in constructing reduced order models of complicated materials dataset is illustrated with an example dataset of apatites described in structural descriptor space. Cluster analysis of the lowâdimensional plots yielded interesting insights into the correlation between several structural descriptors like ionic radius and covalence with characteristic properties like apatite stability. This information is crucial as it can promote the use of apatite materials as a potential host system for immobilizing toxic elements