264,431 research outputs found
Identification of significant factors for air pollution levels using a neural network based knowledge discovery system
Artificial neural network (ANN) is a commonly used approach to estimate or forecast air pollution levels, which are usually assessed by the concentrations of air contaminants such as nitrogen dioxide, sulfur dioxide, carbon monoxide, ozone, and suspended particulate matters (PMs) in the atmosphere of the concerned areas. Even through ANN can accurately estimate air pollution levels they are numerical enigmas and unable to provide explicit knowledge of air pollution levels by air pollution factors (e.g. traffic and meteorological factors). This paper proposed a neural network based knowledge discovery system aimed at overcoming this limitation in ANN. The system consists of two units: a) an ANN unit, which is used to estimate the air pollution levels based on relevant air pollution factors; b) a knowledge discovery unit, which is used to extract explicit knowledge from the ANN unit. To demonstrate the practicability of this neural network based knowledge discovery system, numerical data on mass concentrations of PM2.5 and PM1.0, meteorological and traffic data measured near a busy traffic road in Hangzhou city were applied to investigate the air pollution levels and the potential air pollution factors that may impact on the concentrations of these PMs. Results suggest that the proposed neural network based knowledge discovery system can accurately estimate air pollution levels and identify significant factors that have impact on air pollution levels
The Grow-Shrink strategy for learning Markov network structures constrained by context-specific independences
Markov networks are models for compactly representing complex probability
distributions. They are composed by a structure and a set of numerical weights.
The structure qualitatively describes independences in the distribution, which
can be exploited to factorize the distribution into a set of compact functions.
A key application for learning structures from data is to automatically
discover knowledge. In practice, structure learning algorithms focused on
"knowledge discovery" present a limitation: they use a coarse-grained
representation of the structure. As a result, this representation cannot
describe context-specific independences. Very recently, an algorithm called
CSPC was designed to overcome this limitation, but it has a high computational
complexity. This work tries to mitigate this downside presenting CSGS, an
algorithm that uses the Grow-Shrink strategy for reducing unnecessary
computations. On an empirical evaluation, the structures learned by CSGS
achieve competitive accuracies and lower computational complexity with respect
to those obtained by CSPC.Comment: 12 pages, and 8 figures. This works was presented in IBERAMIA 201
Material Named Entity Recognition (MNER) for Knowledge-driven Materials Using Deep Learning Approach
The scientific literature contains a wealth of cutting-edge knowledge in the
field of materials science, as well as useful data (e.g., numerical data from
experimental results, material properties and structure). These data are
critical for data-driven machine learning (ML) and deep learning (DL) methods
to accelerate material discovery. Due to the large and growing number of
publications, it is difficult for humans to manually retrieve and retain this
knowledge. In this context, we investigate a deep neural network model based on
Bi-LSTM to retrieve knowledge from published scientific articles. The proposed
deep neural network-based model achieves an f-1 score of \~97\% for the
Material Named Entity Recognition (MNER) task. The study addresses motivation,
relevant work, methodology, hyperparameters, and overall performance
evaluation. The analysis provides insight into the results of the experiment
and points to future directions for current research.Comment: 10 page
Learning causal models that make correct manipulation predictions with time series data
One of the fundamental purposes of causal models is using them to predict the effects of manipulating various components of a system. It has been argued by Dash (2005, 2003) that the Do operator will fail when applied to an equilibrium model, unless the underlying dynamic system obeys what he calls Equilibration-Manipulation Commutability. Unfortunately, this fact renders most existing causal discovery algorithms unreliable for reasoning about manipulations. Motivated by this caveat, in this paper we present a novel approach to causal discovery of dynamic models from time series. The approach uses a representation of dynamic causal models motivated by Iwasaki and Simon (1994), which asserts that all “causation across time" occurs because a variable’s derivative has been affected instantaneously. We present an algorithm that exploits this representation within a constraint-based learning framework by numerically calculating derivatives and learning instantaneous relationships. We argue that due to numerical errors in higher order derivatives, care must be taken when learning causal structure, but we show that the Iwasaki-Simon representation reduces the search space considerably, allowing us to forego calculating many high-order derivatives. In order for our algorithm to discover the dynamic model, it is necessary that the time-scale of the data is much finer than any temporal process of the system. Finally, we show that our approach can correctly recover the structure of a fairly complex dynamic system, and can predict the effect of manipulations accurately when a manipulation does not cause an instability. To our knowledge, this is the first causal discovery algorithm that has demonstrated that it can correctly predict the effects of manipulations for a system that does not obey the EMC condition
A framework for dependency estimation in heterogeneous data streams
Estimating dependencies from data is a fundamental task of Knowledge Discovery. Identifying the relevant variables leads to a better understanding of data and improves both the runtime and the outcomes of downstream Data Mining tasks. Dependency estimation from static numerical data has received much attention. However, real-world data often occurs as heterogeneous data streams: On the one hand, data is collected online and is virtually infinite. On the other hand, the various components of a stream may be of different types, e.g., numerical, ordinal or categorical. For this setting, we propose Monte Carlo Dependency Estimation (MCDE), a framework that quantifies multivariate dependency as the average statistical discrepancy between marginal and conditional distributions, via Monte Carlo simulations. MCDE handles heterogeneity by leveraging three statistical tests: the Mann–Whitney U, the Kolmogorov–Smirnov and the Chi-Squared test. We demonstrate that MCDE goes beyond the state of the art regarding dependency estimation by meeting a broad set of requirements. Finally, we show with a real-world use case that MCDE can discover useful patterns in heterogeneous data streams
A symbolic data-driven technique based on evolutionary polynomial regression
This paper describes a new hybrid regression method that combines the best features of conventional numerical regression techniques with the genetic programming symbolic regression technique. The key idea is to employ an evolutionary computing methodology to search for a model of the system/process being modelled and to employ parameter estimation to obtain constants using least squares. The new technique, termed Evolutionary Polynomial Regression (EPR) overcomes shortcomings in the GP process, such as computational performance; number of evolutionary parameters to tune and complexity of the symbolic models. Similarly, it alleviates issues arising from numerical regression, including difficulties in using physical insight and over-fitting problems. This paper demonstrates that EPR is good, both in interpolating data and in scientific knowledge discovery. As an illustration, EPR is used to identify polynomial formulæ with progressively increasing levels of noise, to interpolate the Colebrook-White formula for a pipe resistance coefficient and to discover a formula for a resistance coefficient from experimental data
Consistent Second-Order Conic Integer Programming for Learning Bayesian Networks
Bayesian Networks (BNs) represent conditional probability relations among a
set of random variables (nodes) in the form of a directed acyclic graph (DAG),
and have found diverse applications in knowledge discovery. We study the
problem of learning the sparse DAG structure of a BN from continuous
observational data. The central problem can be modeled as a mixed-integer
program with an objective function composed of a convex quadratic loss function
and a regularization penalty subject to linear constraints. The optimal
solution to this mathematical program is known to have desirable statistical
properties under certain conditions. However, the state-of-the-art optimization
solvers are not able to obtain provably optimal solutions to the existing
mathematical formulations for medium-size problems within reasonable
computational times. To address this difficulty, we tackle the problem from
both computational and statistical perspectives. On the one hand, we propose a
concrete early stopping criterion to terminate the branch-and-bound process in
order to obtain a near-optimal solution to the mixed-integer program, and
establish the consistency of this approximate solution. On the other hand, we
improve the existing formulations by replacing the linear "big-" constraints
that represent the relationship between the continuous and binary indicator
variables with second-order conic constraints. Our numerical results
demonstrate the effectiveness of the proposed approaches
Detection of a close supernova gravitational wave burst in a network of interferometers, neutrino and optical detectors
Trying to detect the gravitational wave (GW) signal emitted by a type II
supernova is a main challenge for the GW community. Indeed, the corresponding
waveform is not accurately modeled as the supernova physics is very complex; in
addition, all the existing numerical simulations agree on the weakness of the
GW emission, thus restraining the number of sources potentially detectable.
Consequently, triggering the GW signal with a confidence level high enough to
conclude directly to a detection is very difficult, even with the use of a
network of interferometric detectors. On the other hand, one can hope to take
benefit from the neutrino and optical emissions associated to the supernova
explosion, in order to discover and study GW radiation in an event already
detected independently. This article aims at presenting some realistic
scenarios for the search of the supernova GW bursts, based on the present
knowledge of the emitted signals and on the results of network data analysis
simulations. Both the direct search and the confirmation of the supernova event
are considered. In addition, some physical studies following the discovery of a
supernova GW emission are also mentioned: from the absolute neutrino mass to
the supernova physics or the black hole signature, the potential spectrum of
discoveries is wide.Comment: Revised version, accepted for publication in Astroparticle Physic
Cooperation between expert knowledge and data mining discovered knowledge: Lessons learned
Expert systems are built from knowledge traditionally elicited from the human expert. It is precisely knowledge elicitation from the expert that is the bottleneck in expert system construction. On the other hand, a data mining system, which automatically extracts knowledge, needs expert guidance on the successive decisions to be made in each of the system phases. In this context, expert knowledge and data mining discovered knowledge can cooperate, maximizing their individual capabilities: data mining discovered knowledge can be used as a complementary source of knowledge for the expert system, whereas expert knowledge can be used to guide the data mining process. This article summarizes different examples of systems where there is cooperation between expert knowledge and data mining discovered knowledge and reports our experience of such cooperation gathered from a medical diagnosis project called Intelligent Interpretation of Isokinetics Data, which we developed. From that experience, a series of lessons were learned throughout project development. Some of these lessons are generally applicable and others pertain exclusively to certain project types
- …