6,216 research outputs found
Temporal Feature Selection with Symbolic Regression
Building and discovering useful features when constructing machine learning models is the central task for the machine learning practitioner. Good features are useful not only in increasing the predictive power of a model but also in illuminating the underlying drivers of a target variable. In this research we propose a novel feature learning technique in which Symbolic regression is endowed with a ``Range Terminal\u27\u27 that allows it to explore functions of the aggregate of variables over time. We test the Range Terminal on a synthetic data set and a real world data in which we predict seasonal greenness using satellite derived temperature and snow data over a portion of the Arctic. On the synthetic data set we find Symbolic regression with the Range Terminal outperforms standard Symbolic regression and Lasso regression. On the Arctic data set we find it outperforms standard Symbolic regression, fails to beat the Lasso regression, but finds useful features describing the interaction between Land Surface Temperature, Snow, and seasonal vegetative growth in the Arctic
FIBS: A Generic Framework for Classifying Interval-based Temporal Sequences
We study the problem of classifying interval-based temporal sequences
(IBTSs). Since common classification algorithms cannot be directly applied to
IBTSs, the main challenge is to define a set of features that effectively
represents the data such that classifiers can be applied. Most prior work
utilizes frequent pattern mining to define a feature set based on discovered
patterns. However, frequent pattern mining is computationally expensive and
often discovers many irrelevant patterns. To address this shortcoming, we
propose the FIBS framework for classifying IBTSs. FIBS extracts features
relevant to classification from IBTSs based on relative frequency and temporal
relations. To avoid selecting irrelevant features, a filter-based selection
strategy is incorporated into FIBS. Our empirical evaluation on eight
real-world datasets demonstrates the effectiveness of our methods in practice.
The results provide evidence that FIBS effectively represents IBTSs for
classification algorithms, which contributes to similar or significantly better
accuracy compared to state-of-the-art competitors. It also suggests that the
feature selection strategy is beneficial to FIBS's performance.Comment: In: Big Data Analytics and Knowledge Discovery. DaWaK 2020. Springer,
Cha
Scallop: A Language for Neurosymbolic Programming
We present Scallop, a language which combines the benefits of deep learning
and logical reasoning. Scallop enables users to write a wide range of
neurosymbolic applications and train them in a data- and compute-efficient
manner. It achieves these goals through three key features: 1) a flexible
symbolic representation that is based on the relational data model; 2) a
declarative logic programming language that is based on Datalog and supports
recursion, aggregation, and negation; and 3) a framework for automatic and
efficient differentiable reasoning that is based on the theory of provenance
semirings. We evaluate Scallop on a suite of eight neurosymbolic applications
from the literature. Our evaluation demonstrates that Scallop is capable of
expressing algorithmic reasoning in diverse and challenging AI tasks, provides
a succinct interface for machine learning programmers to integrate logical
domain knowledge, and yields solutions that are comparable or superior to
state-of-the-art models in terms of accuracy. Furthermore, Scallop's solutions
outperform these models in aspects such as runtime and data efficiency,
interpretability, and generalizability
A Formal Verification Environment for Use in the Certification of Safety-Related C Programs
In this thesis the design of an environment for the formal verification of functional properties of safety-related software written in the programming language C is described. The focus lies on the verification of (primarily) geometric computations. We give an overview of the applicable regulations for safety-related software systems. We define a combination of higher-order logic as formalised in the theorem prover Isabelle and a specification language syntactically based on C expressions. The language retains the mathematical character of higher-level specifications in code specifications. A memory model for C is formalised which is appropriate to model low-level memory operations while keeping the entailed verification overhead in tolerable bounds. Finally, a Hoare style proof calculus is devised so that correctness proofs can be performed in one integrated framework. The applicability of the approach is demonstrated by describing its use in an industrial project
Prediction of high-performance concrete compressive strength through a comparison of machine learning techniques
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceHigh-performance concrete (HPC) is a highly complex composite material whose characteristics are extremely difficult to model. One of those characteristics is the concrete compressive strength, a nonlinear function of the same ingredients that compose HPC: cement, fly ash, blast furnace slag, water, superplasticizer, age, and coarse and fine aggregates. Research has shown time and time again that concrete strength is not determined just by the water-to-cement ratio, which was for years the go to metric. In addition, traditional methods that attempt to model HPC, such as regression analysis, do not provide sufficient prediction power due to nonlinear proprieties of the mixture. Therefore, this study attempts to optimize the prediction and modeling of the compressive strength of HPC by analyzing seven different machine learning (ML) algorithms: three regularization algorithms (Lasso, Ridge and Elastic Net), three ensemble algorithms (Random Forest, Gradient Boost and AdaBoost), and Artificial Neural Networks. All techniques were built and tested with a dataset composed of data from 17 different concrete strength test laboratories, under the same experimental conditions, which enabled a fair comparison amongst them and between different previous studies in the field. Feature importance analysis and outlier analysis were also performed, and all models were subject to a
Wilcoxon Signed-Ranks Test to ensure statistically significant results. The final results show that the
more complex ML algorithms provided greater accuracy than the regularization techniques, with Gradient Boost being the superior model amongst them, providing more accurate predictions than the sate-of-the-art. Better results were achieved using all variables and without removing outlier observations
Prospects for Declarative Mathematical Modeling of Complex Biological Systems
Declarative modeling uses symbolic expressions to represent models. With such
expressions one can formalize high-level mathematical computations on models
that would be difficult or impossible to perform directly on a lower-level
simulation program, in a general-purpose programming language. Examples of such
computations on models include model analysis, relatively general-purpose
model-reduction maps, and the initial phases of model implementation, all of
which should preserve or approximate the mathematical semantics of a complex
biological model. The potential advantages are particularly relevant in the
case of developmental modeling, wherein complex spatial structures exhibit
dynamics at molecular, cellular, and organogenic levels to relate genotype to
multicellular phenotype. Multiscale modeling can benefit from both the
expressive power of declarative modeling languages and the application of model
reduction methods to link models across scale. Based on previous work, here we
define declarative modeling of complex biological systems by defining the
operator algebra semantics of an increasingly powerful series of declarative
modeling languages including reaction-like dynamics of parameterized and
extended objects; we define semantics-preserving implementation and
semantics-approximating model reduction transformations; and we outline a
"meta-hierarchy" for organizing declarative models and the mathematical methods
that can fruitfully manipulate them
How Well Do Feature-Additive Explainers Explain Feature-Additive Predictors?
Surging interest in deep learning from high-stakes domains has precipitated
concern over the inscrutable nature of black box neural networks. Explainable
AI (XAI) research has led to an abundance of explanation algorithms for these
black boxes. Such post hoc explainers produce human-comprehensible
explanations, however, their fidelity with respect to the model is not well
understood - explanation evaluation remains one of the most challenging issues
in XAI. In this paper, we ask a targeted but important question: can popular
feature-additive explainers (e.g., LIME, SHAP, SHAPR, MAPLE, and PDP) explain
feature-additive predictors? Herein, we evaluate such explainers on ground
truth that is analytically derived from the additive structure of a model. We
demonstrate the efficacy of our approach in understanding these explainers
applied to symbolic expressions, neural networks, and generalized additive
models on thousands of synthetic and several real-world tasks. Our results
suggest that all explainers eventually fail to correctly attribute the
importance of features, especially when a decision-making process involves
feature interactions.Comment: Accepted to NeurIPS Workshop XAI in Action: Past, Present, and Future
Applications. arXiv admin note: text overlap with arXiv:2106.0837
A Matrix-Based Heuristic Algorithm for Extracting Multiword Expressions from a Corpus
This paper describes an algorithm for automatically extracting multiword expressions (MWEs) from a corpus. The algorithm is node-based, ie extracts MWEs that contain the item specified by the user, using a fixed window-size around the node. The main idea is to detect the frequency anomalies that occur at the starting and ending points of an ngram that constitutes a MWE. This is achieved by locally comparing matrices of observed frequencies to matrices of expected frequencies, and determining, for each individual input, one or more sub-sequences that have the highest probability of being a MWE. Top-performing sub-sequences are then combined in a score-aggregation and ranking stage, thus producing a single list of score-ranked MWE candidates, without having to indiscriminately generate all possible sub-sequences of the input strings. The knowledge-poor and computationally efficient algorithm attempts to solve certain recurring problems in MWE extraction, such as the inability to deal with MWEs of arbitrary length, the repetitive counting of nested ngrams, and excessive sensitivity to frequency. Evaluation results show that the best-performing version generates top-50 precision values between 0.71 and 0.88 on Turkish and English data, and performs better than the baseline method even at n= 1000
- …