95,171 research outputs found
Algorithms and implementation of functional dependency discovery in XML : a thesis presented in partial fulfilment of the requirements for the degree of Master of Information Sciences in Information Systems at Massey University
1.1 Background Following the advent of the web, there has been a great demand for data interchange between applications using internet infrastructure. XML (extensible Markup Language) provides a structured representation of data empowered by broad adoption and easy deployment. As a subset of SGML (Standard Generalized Markup Language), XML has been standardized by the World Wide Web Consortium (W3C) [Bray et al., 2004], XML is becoming the prevalent data exchange format on the World Wide Web and increasingly significant in storing semi-structured data. After its initial release in 1996, it has evolved and been applied extensively in all fields where the exchange of structured documents in electronic form is required. As with the growing popularity of XML, the issue of functional dependency in XML has recently received well deserved attention. The driving force for the study of dependencies in XML is it is as crucial to XML schema design, as to relational database(RDB) design [Abiteboul et al., 1995]
An Algorithm for Pattern Discovery in Time Series
We present a new algorithm for discovering patterns in time series and other
sequential data. We exhibit a reliable procedure for building the minimal set
of hidden, Markovian states that is statistically capable of producing the
behavior exhibited in the data -- the underlying process's causal states.
Unlike conventional methods for fitting hidden Markov models (HMMs) to data,
our algorithm makes no assumptions about the process's causal architecture (the
number of hidden states and their transition structure), but rather infers it
from the data. It starts with assumptions of minimal structure and introduces
complexity only when the data demand it. Moreover, the causal states it infers
have important predictive optimality properties that conventional HMM states
lack. We introduce the algorithm, review the theory behind it, prove its
asymptotic reliability, use large deviation theory to estimate its rate of
convergence, and compare it to other algorithms which also construct HMMs from
data. We also illustrate its behavior on an example process, and report
selected numerical results from an implementation.Comment: 26 pages, 5 figures; 5 tables;
http://www.santafe.edu/projects/CompMech Added discussion of algorithm
parameters; improved treatment of convergence and time complexity; added
comparison to older method
Structure induction by lossless graph compression
This work is motivated by the necessity to automate the discovery of
structure in vast and evergrowing collection of relational data commonly
represented as graphs, for example genomic networks. A novel algorithm, dubbed
Graphitour, for structure induction by lossless graph compression is presented
and illustrated by a clear and broadly known case of nested structure in a DNA
molecule. This work extends to graphs some well established approaches to
grammatical inference previously applied only to strings. The bottom-up graph
compression problem is related to the maximum cardinality (non-bipartite)
maximum cardinality matching problem. The algorithm accepts a variety of graph
types including directed graphs and graphs with labeled nodes and arcs. The
resulting structure could be used for representation and classification of
graphs.Comment: 10 pages, 7 figures, 2 tables published in Proceedings of the Data
Compression Conference, 200
ExplaiNE: An Approach for Explaining Network Embedding-based Link Predictions
Networks are powerful data structures, but are challenging to work with for
conventional machine learning methods. Network Embedding (NE) methods attempt
to resolve this by learning vector representations for the nodes, for
subsequent use in downstream machine learning tasks.
Link Prediction (LP) is one such downstream machine learning task that is an
important use case and popular benchmark for NE methods. Unfortunately, while
NE methods perform exceedingly well at this task, they are lacking in
transparency as compared to simpler LP approaches.
We introduce ExplaiNE, an approach to offer counterfactual explanations for
NE-based LP methods, by identifying existing links in the network that explain
the predicted links. ExplaiNE is applicable to a broad class of NE algorithms.
An extensive empirical evaluation for the NE method `Conditional Network
Embedding' in particular demonstrates its accuracy and scalability
Differential analysis of biological networks
In cancer research, the comparison of gene expression or DNA methylation
networks inferred from healthy controls and patients can lead to the discovery
of biological pathways associated to the disease. As a cancer progresses, its
signalling and control networks are subject to some degree of localised
re-wiring. Being able to detect disrupted interaction patterns induced by the
presence or progression of the disease can lead to the discovery of novel
molecular diagnostic and prognostic signatures. Currently there is a lack of
scalable statistical procedures for two-network comparisons aimed at detecting
localised topological differences. We propose the dGHD algorithm, a methodology
for detecting differential interaction patterns in two-network comparisons. The
algorithm relies on a statistic, the Generalised Hamming Distance (GHD), for
assessing the degree of topological difference between networks and evaluating
its statistical significance. dGHD builds on a non-parametric permutation
testing framework but achieves computationally efficiency through an asymptotic
normal approximation. We show that the GHD is able to detect more subtle
topological differences compared to a standard Hamming distance between
networks. This results in the dGHD algorithm achieving high performance in
simulation studies as measured by sensitivity and specificity. An application
to the problem of detecting differential DNA co-methylation subnetworks
associated to ovarian cancer demonstrates the potential benefits of the
proposed methodology for discovering network-derived biomarkers associated with
a trait of interest
Understanding Learned Models by Identifying Important Features at the Right Resolution
In many application domains, it is important to characterize how complex
learned models make their decisions across the distribution of instances. One
way to do this is to identify the features and interactions among them that
contribute to a model's predictive accuracy. We present a model-agnostic
approach to this task that makes the following specific contributions. Our
approach (i) tests feature groups, in addition to base features, and tries to
determine the level of resolution at which important features can be
determined, (ii) uses hypothesis testing to rigorously assess the effect of
each feature on the model's loss, (iii) employs a hierarchical approach to
control the false discovery rate when testing feature groups and individual
base features for importance, and (iv) uses hypothesis testing to identify
important interactions among features and feature groups. We evaluate our
approach by analyzing random forest and LSTM neural network models learned in
two challenging biomedical applications.Comment: First two authors contributed equally to this work, Accepted for
presentation at the Thirty-Third AAAI Conference on Artificial Intelligence
(AAAI-19
- …