350 research outputs found
Discontinuities in pattern inference
This paper deals with the inferrability of classes of E-pattern languages—also referred
to as extended or erasing pattern languages—from positive data in Gold’s
model of identification in the limit. The first main part of the paper shows that
the recently presented negative result on terminal-free E-pattern languages over binary
alphabets does not hold for other alphabet sizes, so that the full class of these
languages is inferrable from positive data if and only if the corresponding terminal
alphabet does not consist of exactly two distinct letters. The second main part yields
the insight that the positive result on terminal-free E-pattern languages over alphabets
with three or four letters cannot be extended to the class of general E-pattern
languages. With regard to larger alphabets, the extensibility remains open.
The proof methods developed for these main results do not directly discuss the
(non-)existence of appropriate learning strategies, but they deal with structural
properties of classes of E-pattern languages, and, in particular, with the problem
of finding telltales for these languages. It is shown that the inferrability of classes
of E-pattern languages is closely connected to some problems on the ambiguity
of morphisms so that the technical contributions of the paper largely consist of
combinatorial insights into morphisms in word monoids
Discontinuities in pattern inference
This paper deals with the inferrability of classes of E-pattern languages—also referred
to as extended or erasing pattern languages—from positive data in Gold’s
model of identification in the limit. The first main part of the paper shows that
the recently presented negative result on terminal-free E-pattern languages over binary
alphabets does not hold for other alphabet sizes, so that the full class of these
languages is inferrable from positive data if and only if the corresponding terminal
alphabet does not consist of exactly two distinct letters. The second main part yields
the insight that the positive result on terminal-free E-pattern languages over alphabets
with three or four letters cannot be extended to the class of general E-pattern
languages. With regard to larger alphabets, the extensibility remains open.
The proof methods developed for these main results do not directly discuss the
(non-)existence of appropriate learning strategies, but they deal with structural
properties of classes of E-pattern languages, and, in particular, with the problem
of finding telltales for these languages. It is shown that the inferrability of classes
of E-pattern languages is closely connected to some problems on the ambiguity
of morphisms so that the technical contributions of the paper largely consist of
combinatorial insights into morphisms in word monoids
A discontinuity in pattern inference
This paper examines the learnability of a major subclass
of E-pattern languages – also known as erasing or extended pattern languages
– in Gold’s learning model: We show that the class of terminal-free
E-pattern languages is inferrable from positive data if the corresponding
terminal alphabet consists of three or more letters. Consequently, the
recently presented negative result for binary alphabets is unique
PMP: Privacy-Aware Matrix Profile against Sensitive Pattern Inference
Recent rapid development of sensor technology has allowed massive fine-grained time series (TS) data to be collected and set the foundation for the development of data-driven services and applications. During the process, data sharing is often involved to allow the third-party modelers to perform specific time series data mining (TSDM) tasks based on the need of data owner. The high resolution of TS brings new challenges in protecting privacy. While meaningful information in high-resolution TS shifts from concrete point values to local shape-based segments, numerous research have found that long shape-based patterns could contain more sensitive information and may potentially be extracted and misused by a malicious third party. However, the privacy issue for TS patterns is surprisingly seldom explored in privacy-preserving literature. In this work, we consider a new privacy-preserving problem: preventing malicious inference on long shape-based patterns while preserving short segment information for the utility task performance. To mitigate the challenge, we investigate an alternative approach by sharing Matrix Profile (MP), which is a non-linear transformation of original data and a versatile data structure that supports many data mining tasks. We found that while MP can prevent concrete shape leakage, the canonical correlation in MP index can still reveal the location of sensitive long pattern. Based on this observation, we design two attacks named Location Attack and Entropy Attack to extract the pattern location from MP. To further protect MP from these two attacks, we propose a Privacy-Aware Matrix Profile (PMP) via perturbing the local correlation and breaking the canonical correlation in MP index vector. We evaluate our proposed PMP against baseline noise-adding methods through quantitative analysis and real-world case studies to show the effectiveness of the proposed method
Quootstrap: Scalable Unsupervised Extraction of Quotation-Speaker Pairs from Large News Corpora via Bootstrapping
We propose Quootstrap, a method for extracting quotations, as well as the
names of the speakers who uttered them, from large news corpora. Whereas prior
work has addressed this problem primarily with supervised machine learning, our
approach follows a fully unsupervised bootstrapping paradigm. It leverages the
redundancy present in large news corpora, more precisely, the fact that the
same quotation often appears across multiple news articles in slightly
different contexts. Starting from a few seed patterns, such as ["Q", said S.],
our method extracts a set of quotation-speaker pairs (Q, S), which are in turn
used for discovering new patterns expressing the same quotations; the process
is then repeated with the larger pattern set. Our algorithm is highly scalable,
which we demonstrate by running it on the large ICWSM 2011 Spinn3r corpus.
Validating our results against a crowdsourced ground truth, we obtain 90%
precision at 40% recall using a single seed pattern, with significantly higher
recall values for more frequently reported (and thus likely more interesting)
quotations. Finally, we showcase the usefulness of our algorithm's output for
computational social science by analyzing the sentiment expressed in our
extracted quotations.Comment: Accepted at the 12th International Conference on Web and Social Media
(ICWSM), 201
FixMiner: Mining Relevant Fix Patterns for Automated Program Repair
Patching is a common activity in software development. It is generally
performed on a source code base to address bugs or add new functionalities. In
this context, given the recurrence of bugs across projects, the associated
similar patches can be leveraged to extract generic fix actions. While the
literature includes various approaches leveraging similarity among patches to
guide program repair, these approaches often do not yield fix patterns that are
tractable and reusable as actionable input to APR systems. In this paper, we
propose a systematic and automated approach to mining relevant and actionable
fix patterns based on an iterative clustering strategy applied to atomic
changes within patches. The goal of FixMiner is thus to infer separate and
reusable fix patterns that can be leveraged in other patch generation systems.
Our technique, FixMiner, leverages Rich Edit Script which is a specialized tree
structure of the edit scripts that captures the AST-level context of the code
changes. FixMiner uses different tree representations of Rich Edit Scripts for
each round of clustering to identify similar changes. These are abstract syntax
trees, edit actions trees, and code context trees. We have evaluated FixMiner
on thousands of software patches collected from open source projects.
Preliminary results show that we are able to mine accurate patterns,
efficiently exploiting change information in Rich Edit Scripts. We further
integrated the mined patterns to an automated program repair prototype,
PARFixMiner, with which we are able to correctly fix 26 bugs of the Defects4J
benchmark. Beyond this quantitative performance, we show that the mined fix
patterns are sufficiently relevant to produce patches with a high probability
of correctness: 81% of PARFixMiner's generated plausible patches are correct.Comment: 31 pages, 11 figure
Omics analysis in Caenorhabditis elegans: pattern inference and interpretation
High-throughput molecular technologies have greatly enhanced our understanding of biological processes by characterizing expression changes of genes (microarray and RNA-Seq data) and proteins (proteomics data), or transcription factor targets and epigenetics states (ChIP-chip and ChIP-Seq data). Among them, transcriptome studies based on microarrays or RNA-Seq have the ability to identify genes involved in the response to environmental change or specific stressors, thereby helping us to infer the underlying biological processes.
During my PhD, I mainly focused on transcriptomic data analysis, using in most cases the nematode Caenorhabditis elegans as a model taxon. In particular, I have addressed seven specific projects: i) development of ABSSeq, an improved detection approach of differential gene expression for RNA-Seq data; ii) development of aFold, a method to fully moderate fold-change of RNA-Seq data and to improve gene ranking and visualization; iii) development of WormExp, a knowledge-based approach for interpreting gene sets in C. elegans; iv) exploration of the regulation of the C. elegans immune system using curated data sets from WormExp; v) characterization of putative major effectors (GATA transcription factors) in the C. elegans innate immune system; vi) comparison of the immune response of C. elegans at protein and transcript level.
In general, our work facilitates high-throughput data analysis via improving pattern inference and interpretation, which in practice provides new insights into the immune system of C. elegans
Interpreting CNN Knowledge via an Explanatory Graph
This paper learns a graphical model, namely an explanatory graph, which
reveals the knowledge hierarchy hidden inside a pre-trained CNN. Considering
that each filter in a conv-layer of a pre-trained CNN usually represents a
mixture of object parts, we propose a simple yet efficient method to
automatically disentangles different part patterns from each filter, and
construct an explanatory graph. In the explanatory graph, each node represents
a part pattern, and each edge encodes co-activation relationships and spatial
relationships between patterns. More importantly, we learn the explanatory
graph for a pre-trained CNN in an unsupervised manner, i.e., without a need of
annotating object parts. Experiments show that each graph node consistently
represents the same object part through different images. We transfer part
patterns in the explanatory graph to the task of part localization, and our
method significantly outperforms other approaches.Comment: in AAAI 201
- …