50,471 research outputs found
ASlib: A Benchmark Library for Algorithm Selection
The task of algorithm selection involves choosing an algorithm from a set of
algorithms on a per-instance basis in order to exploit the varying performance
of algorithms over a set of instances. The algorithm selection problem is
attracting increasing attention from researchers and practitioners in AI. Years
of fruitful applications in a number of domains have resulted in a large amount
of data, but the community lacks a standard format or repository for this data.
This situation makes it difficult to share and compare different approaches
effectively, as is done in other, more established fields. It also
unnecessarily hinders new researchers who want to work in this area. To address
this problem, we introduce a standardized format for representing algorithm
selection scenarios and a repository that contains a growing number of data
sets from the literature. Our format has been designed to be able to express a
wide variety of different scenarios. Demonstrating the breadth and power of our
platform, we describe a set of example experiments that build and evaluate
algorithm selection models through a common interface. The results display the
potential of algorithm selection to achieve significant performance
improvements across a broad range of problems and algorithms.Comment: Accepted to be published in Artificial Intelligence Journa
On the role of pre and post-processing in environmental data mining
The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed
User Review-Based Change File Localization for Mobile Applications
In the current mobile app development, novel and emerging DevOps practices
(e.g., Continuous Delivery, Integration, and user feedback analysis) and tools
are becoming more widespread. For instance, the integration of user feedback
(provided in the form of user reviews) in the software release cycle represents
a valuable asset for the maintenance and evolution of mobile apps. To fully
make use of these assets, it is highly desirable for developers to establish
semantic links between the user reviews and the software artefacts to be
changed (e.g., source code and documentation), and thus to localize the
potential files to change for addressing the user feedback. In this paper, we
propose RISING (Review Integration via claSsification, clusterIng, and
linkiNG), an automated approach to support the continuous integration of user
feedback via classification, clustering, and linking of user reviews. RISING
leverages domain-specific constraint information and semi-supervised learning
to group user reviews into multiple fine-grained clusters concerning similar
users' requests. Then, by combining the textual information from both commit
messages and source code, it automatically localizes potential change files to
accommodate the users' requests. Our empirical studies demonstrate that the
proposed approach outperforms the state-of-the-art baseline work in terms of
clustering and localization accuracy, and thus produces more reliable results.Comment: 15 pages, 3 figures, 8 table
Scalable Solutions for Automated Single Pulse Identification and Classification in Radio Astronomy
Data collection for scientific applications is increasing exponentially and
is forecasted to soon reach peta- and exabyte scales. Applications which
process and analyze scientific data must be scalable and focus on execution
performance to keep pace. In the field of radio astronomy, in addition to
increasingly large datasets, tasks such as the identification of transient
radio signals from extrasolar sources are computationally expensive. We present
a scalable approach to radio pulsar detection written in Scala that
parallelizes candidate identification to take advantage of in-memory task
processing using Apache Spark on a YARN distributed system. Furthermore, we
introduce a novel automated multiclass supervised machine learning technique
that we combine with feature selection to reduce the time required for
candidate classification. Experimental testing on a Beowulf cluster with 15
data nodes shows that the parallel implementation of the identification
algorithm offers a speedup of up to 5X that of a similar multithreaded
implementation. Further, we show that the combination of automated multiclass
classification and feature selection speeds up the execution performance of the
RandomForest machine learning algorithm by an average of 54% with less than a
2% average reduction in the algorithm's ability to correctly classify pulsars.
The generalizability of these results is demonstrated by using two real-world
radio astronomy data sets.Comment: In Proceedings of the 47th International Conference on Parallel
Processing (ICPP 2018). ACM, New York, NY, USA, Article 11, 11 page
Search based software engineering: Trends, techniques and applications
© ACM, 2012. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version is available from the link below.In the past five years there has been a dramatic increase in work on Search-Based Software Engineering (SBSE), an approach to Software Engineering (SE) in which Search-Based Optimization (SBO) algorithms are used to address problems in SE. SBSE has been applied to problems throughout the SE lifecycle, from requirements and project planning to maintenance and reengineering. The approach is attractive because it offers a suite of adaptive automated and semiautomated solutions in situations typified by large complex problem spaces with multiple competing and conflicting objectives.
This article provides a review and classification of literature on SBSE. The work identifies research trends and relationships between the techniques applied and the applications to which they have been applied and highlights gaps in the literature and avenues for further research.EPSRC and E
Spatially Aware Dictionary Learning and Coding for Fossil Pollen Identification
We propose a robust approach for performing automatic species-level
recognition of fossil pollen grains in microscopy images that exploits both
global shape and local texture characteristics in a patch-based matching
methodology. We introduce a novel criteria for selecting meaningful and
discriminative exemplar patches. We optimize this function during training
using a greedy submodular function optimization framework that gives a
near-optimal solution with bounded approximation error. We use these selected
exemplars as a dictionary basis and propose a spatially-aware sparse coding
method to match testing images for identification while maintaining global
shape correspondence. To accelerate the coding process for fast matching, we
introduce a relaxed form that uses spatially-aware soft-thresholding during
coding. Finally, we carry out an experimental study that demonstrates the
effectiveness and efficiency of our exemplar selection and classification
mechanisms, achieving accuracy on a difficult fine-grained species
classification task distinguishing three types of fossil spruce pollen.Comment: CVMI 201
Test Set Diameter: Quantifying the Diversity of Sets of Test Cases
A common and natural intuition among software testers is that test cases need
to differ if a software system is to be tested properly and its quality
ensured. Consequently, much research has gone into formulating distance
measures for how test cases, their inputs and/or their outputs differ. However,
common to these proposals is that they are data type specific and/or calculate
the diversity only between pairs of test inputs, traces or outputs.
We propose a new metric to measure the diversity of sets of tests: the test
set diameter (TSDm). It extends our earlier, pairwise test diversity metrics
based on recent advances in information theory regarding the calculation of the
normalized compression distance (NCD) for multisets. An advantage is that TSDm
can be applied regardless of data type and on any test-related information, not
only the test inputs. A downside is the increased computational time compared
to competing approaches.
Our experiments on four different systems show that the test set diameter can
help select test sets with higher structural and fault coverage than random
selection even when only applied to test inputs. This can enable early test
design and selection, prior to even having a software system to test, and
complement other types of test automation and analysis. We argue that this
quantification of test set diversity creates a number of opportunities to
better understand software quality and provides practical ways to increase it.Comment: In submissio
Automated reliability assessment for spectroscopic redshift measurements
We present a new approach to automate the spectroscopic redshift reliability
assessment based on machine learning (ML) and characteristics of the redshift
probability density function (PDF).
We propose to rephrase the spectroscopic redshift estimation into a Bayesian
framework, in order to incorporate all sources of information and uncertainties
related to the redshift estimation process, and produce a redshift posterior
PDF that will be the starting-point for ML algorithms to provide an automated
assessment of a redshift reliability.
As a use case, public data from the VIMOS VLT Deep Survey is exploited to
present and test this new methodology. We first tried to reproduce the existing
reliability flags using supervised classification to describe different types
of redshift PDFs, but due to the subjective definition of these flags, soon
opted for a new homogeneous partitioning of the data into distinct clusters via
unsupervised classification. After assessing the accuracy of the new clusters
via resubstitution and test predictions, unlabelled data from preliminary mock
simulations for the Euclid space mission are projected into this mapping to
predict their redshift reliability labels.Comment: Submitted on 02 June 2017 (v1). Revised on 08 September 2017 (v2).
Latest version 28 September 2017 (this version v3
Identifying smart design attributes for Industry 4.0 customization using a clustering Genetic Algorithm
Industry 4.0 aims at achieving mass customization at a
mass production cost. A key component to realizing this is accurate
prediction of customer needs and wants, which is however a
challenging issue due to the lack of smart analytics tools. This
paper investigates this issue in depth and then develops a predictive
analytic framework for integrating cloud computing, big data
analysis, business informatics, communication technologies, and
digital industrial production systems. Computational intelligence
in the form of a cluster k-means approach is used to manage
relevant big data for feeding potential customer needs and wants
to smart designs for targeted productivity and customized mass
production. The identification of patterns from big data is achieved
with cluster k-means and with the selection of optimal attributes
using genetic algorithms. A car customization case study shows
how it may be applied and where to assign new clusters with
growing knowledge of customer needs and wants. This approach
offer a number of features suitable to smart design in realizing
Industry 4.0
- …