63 research outputs found
DALE: Differential Accumulated Local Effects for efficient and accurate global explanations
Accumulated Local Effect (ALE) is a method for accurately estimating feature
effects, overcoming fundamental failure modes of previously-existed methods,
such as Partial Dependence Plots. However, ALE's approximation, i.e. the method
for estimating ALE from the limited samples of the training set, faces two
weaknesses. First, it does not scale well in cases where the input has high
dimensionality, and, second, it is vulnerable to out-of-distribution (OOD)
sampling when the training set is relatively small. In this paper, we propose a
novel ALE approximation, called Differential Accumulated Local Effects (DALE),
which can be used in cases where the ML model is differentiable and an
auto-differentiable framework is accessible. Our proposal has significant
computational advantages, making feature effect estimation applicable to
high-dimensional Machine Learning scenarios with near-zero computational
overhead. Furthermore, DALE does not create artificial points for calculating
the feature effect, resolving misleading estimations due to OOD sampling.
Finally, we formally prove that, under some hypotheses, DALE is an unbiased
estimator of ALE and we present a method for quantifying the standard error of
the explanation. Experiments using both synthetic and real datasets demonstrate
the value of the proposed approach.Comment: 16 pages, to be published in Asian Conference of Machine Learning
(ACML) 202
RHALE: Robust and Heterogeneity-aware Accumulated Local Effects
Accumulated Local Effects (ALE) is a widely-used explainability method for
isolating the average effect of a feature on the output, because it handles
cases with correlated features well. However, it has two limitations. First, it
does not quantify the deviation of instance-level (local) effects from the
average (global) effect, known as heterogeneity. Second, for estimating the
average effect, it partitions the feature domain into user-defined, fixed-sized
bins, where different bin sizes may lead to inconsistent ALE estimations. To
address these limitations, we propose Robust and Heterogeneity-aware ALE
(RHALE). RHALE quantifies the heterogeneity by considering the standard
deviation of the local effects and automatically determines an optimal
variable-size bin-splitting. In this paper, we prove that to achieve an
unbiased approximation of the standard deviation of local effects within each
bin, bin splitting must follow a set of sufficient conditions. Based on these
conditions, we propose an algorithm that automatically determines the optimal
partitioning, balancing the estimation bias and variance. Through evaluations
on synthetic and real datasets, we demonstrate the superiority of RHALE
compared to other methods, including the advantages of automatic bin splitting,
especially in cases with correlated features.Comment: Accepted at ECAI 2023 (European Conference on Artificial
Intelligence
Efficient evaluation of generalized path pattern queries on XML data
Finding the occurrences of structural patterns in XML data is a key operation in XML query processing. Existing algorithms for this operation focus almost exclusively on path-patterns or tree-patterns. Requirements in flexible querying of XML data have motivated recently the introduction of query languages that allow a partial specification of path-patterns in a query. In this paper, we focus on the efficient evaluation of partial path queries, a generalization of path pattern queries. Our approach explicitly deals with repeated labels (that is, multiple occurrences of the same label in a query). We show that partial path queries can be represented as rooted dags for which a topological ordering of the nodes exists. We present three algorithms for the efficient evaluation of these queries under the indexed streaming evaluation model. The first one exploits a structural summary of data to generate a set of path-patterns that together are equivalent to a partial path query. To evaluate these path-patterns, we extend PathStack so that it can work on path-patterns with repeated labels. The second one extracts a spanning tree from the query dag, uses a stack-based algorithm to find the matches of the root-to-leaf paths in the tree, and merge-joins the matches to compute the answer. Finally, the third one exploits multiple pointers of stack entries and a topological ordering of the query dag to apply a stack-based holistic technique. An analysis of the algorithms and extensive experimental evaluation shows that the holistic algorithm outperforms the other ones
TarBase 6.0: capturing the exponential growth of miRNA targets with experimental support
As the relevant literature and the number of experiments increase at a super linear rate, databases that curate and collect experimentally verified microRNA (miRNA) targets have gradually emerged. These databases attempt to provide efficient access to this wealth of experimental data, which is scattered in thousands of manuscripts. Aim of TarBase 6.0 (http://www.microrna.gr/tarbase) is to face this challenge by providing a significant increase of available miRNA targets derived from all contemporary experimental techniques (gene specific and high-throughput), while incorporating a powerful set of tools in a user-friendly interface. TarBase 6.0 hosts detailed information for each miRNAāgene interaction, ranging from miRNA- and gene-related facts to information specific to their interaction, the experimental validation methodologies and their outcomes. All database entries are enriched with function-related data, as well as general information derived from external databases such as UniProt, Ensembl and RefSeq. DIANA microT miRNA target prediction scores and the relevant prediction details are available for each interaction. TarBase 6.0 hosts the largest collection of manually curated experimentally validated miRNAāgene interactions (more than 65ā000 targets), presenting a 16.5ā175-fold increase over other available manually curated databases
miRGen 2.0: a database of microRNA genomic information and regulation
MicroRNAs are small, non-protein coding RNA molecules known to regulate the expression of genes by binding to the 3ā²UTR region of mRNAs. MicroRNAs are produced from longer transcripts which can code for more than one mature miRNAs. miRGen 2.0 is a database that aims to provide comprehensive information about the position of human and mouse microRNA coding transcripts and their regulation by transcription factors, including a unique compilation of both predicted and experimentally supported data. Expression profiles of microRNAs in several tissues and cell lines, single nucleotide polymorphism locations, microRNA target prediction on protein coding genes and mapping of miRNA targets of co-regulated miRNAs on biological pathways are also integrated into the database and user interface. The miRGen database will be continuously maintained and freely available at http://www.microrna.gr/mirgen/
- ā¦