1,670 research outputs found
Finding Frequent Subsequences in a Set of Texts
Given a set of strings, the Common Subsequence Automaton accepts all common subsequences of these strings. Such an automaton can be deduced from other automata like the Directed Acyclic Subsequence Graph or the Subsequence Automaton. In this paper, we introduce some new issues in text algorithm on the basis of Common Subsequences related problems. Firstly, we make an overview of different existing automata, focusing on their similarities and differences. Secondly, we present a new automaton, the Constrained Subsequence Automaton, which extends the Common Subsequence Automaton, by adding an integer denoted quorum
Subsequence Automata with Default Transitions
Let be a string of length with characters from an alphabet of size
. The \emph{subsequence automaton} of (often called the
\emph{directed acyclic subsequence graph}) is the minimal deterministic finite
automaton accepting all subsequences of . A straightforward construction
shows that the size (number of states and transitions) of the subsequence
automaton is and that this bound is asymptotically optimal.
In this paper, we consider subsequence automata with \emph{default
transitions}, that is, special transitions to be taken only if none of the
regular transitions match the current character, and which do not consume the
current character. We show that with default transitions, much smaller
subsequence automata are possible, and provide a full trade-off between the
size of the automaton and the \emph{delay}, i.e., the maximum number of
consecutive default transitions followed before consuming a character.
Specifically, given any integer parameter , , we
present a subsequence automaton with default transitions of size
and delay . Hence, with we
obtain an automaton of size and delay . On
the other extreme, with , we obtain an automaton of size and delay , thus matching the bound for the standard subsequence
automaton construction. Finally, we generalize the result to multiple strings.
The key component of our result is a novel hierarchical automata construction
of independent interest.Comment: Corrected typo
Improving legibility of natural deduction proofs is not trivial
In formal proof checking environments such as Mizar it is not merely the
validity of mathematical formulas that is evaluated in the process of adoption
to the body of accepted formalizations, but also the readability of the proofs
that witness validity. As in case of computer programs, such proof scripts may
sometimes be more and sometimes be less readable. To better understand the
notion of readability of formal proofs, and to assess and improve their
readability, we propose in this paper a method of improving proof readability
based on Behaghel's First Law of sentence structure. Our method maximizes the
number of local references to the directly preceding statement in a proof
linearisation. It is shown that our optimization method is NP-complete.Comment: 33 page
Specifying ODP computational objects in Z
The computational viewpoint contained within the Reference Model of Open Distributed Processing (RM-ODP) shows how collections of objects can be configured within a distributed system to enable interworking. It prescribes certain capabilities that such objects are expected to possess and structuring rules that apply to how these objects can be configured with one another. This paper highlights how the specification language Z can be used to formalise these capabilities and the associated structuring rules, thereby enabling specifications of ODP systems from the computational viewpoint to be achieved
Improving the smoothed complexity of FLIP for max cut problems
Finding locally optimal solutions for max-cut and max--cut are well-known
PLS-complete problems. An instinctive approach to finding such a locally
optimum solution is the FLIP method. Even though FLIP requires exponential time
in worst-case instances, it tends to terminate quickly in practical instances.
To explain this discrepancy, the run-time of FLIP has been studied in the
smoothed complexity framework. Etscheid and R\"{o}glin showed that the smoothed
complexity of FLIP for max-cut in arbitrary graphs is quasi-polynomial. Angel,
Bubeck, Peres, and Wei showed that the smoothed complexity of FLIP for max-cut
in complete graphs is , where is an upper bound on
the random edge-weight density and is the number of vertices in the input
graph.
While Angel et al.'s result showed the first polynomial smoothed complexity,
they also conjectured that their run-time bound is far from optimal. In this
work, we make substantial progress towards improving the run-time bound. We
prove that the smoothed complexity of FLIP in complete graphs is . Our results are based on a carefully chosen matrix whose rank
captures the run-time of the method along with improved rank bounds for this
matrix and an improved union bound based on this matrix. In addition, our
techniques provide a general framework for analyzing FLIP in the smoothed
framework. We illustrate this general framework by showing that the smoothed
complexity of FLIP for max--cut in complete graphs is polynomial and for
max--cut in arbitrary graphs is quasi-polynomial. We believe that our
techniques should also be of interest towards addressing the smoothed
complexity of FLIP for max--cut in complete graphs for larger constants .Comment: 36 page
Interpreting and using CPDAGs with background knowledge
We develop terminology and methods for working with maximally oriented
partially directed acyclic graphs (maximal PDAGs). Maximal PDAGs arise from
imposing restrictions on a Markov equivalence class of directed acyclic graphs,
or equivalently on its graphical representation as a completed partially
directed acyclic graph (CPDAG), for example when adding background knowledge
about certain edge orientations. Although maximal PDAGs often arise in
practice, causal methods have been mostly developed for CPDAGs. In this paper,
we extend such methodology to maximal PDAGs. In particular, we develop
methodology to read off possible ancestral relationships, we introduce a
graphical criterion for covariate adjustment to estimate total causal effects,
and we adapt the IDA and joint-IDA frameworks to estimate multi-sets of
possible causal effects. We also present a simulation study that illustrates
the gain in identifiability of total causal effects as the background knowledge
increases. All methods are implemented in the R package pcalg.Comment: 17 pages, 6 figures, UAI 201
Recommended from our members
Minimally supervised induction of morphology through bitexts
textA knowledge of morphology can be useful for many natural language processing systems. Thus, much effort has been expended in developing accurate computational tools for morphology that lemmatize, segment and generate new forms. The most powerful and accurate of these have been manually encoded, such endeavors being without exception expensive and time-consuming. There have been consequently many attempts to reduce this cost in the development of morphological systems through the development of unsupervised or minimally supervised algorithms and learning methods for acquisition of morphology. These efforts have yet to produce a tool that approaches the performance of manually encoded systems.
Here, I present a strategy for dealing with morphological clustering and segmentation in a minimally supervised manner but one that will be more linguistically informed than previous unsupervised approaches. That is, this study will attempt to induce clusters of words from an unannotated text that are inflectional variants of each other. Then a set of inflectional suffixes by part-of-speech will be induced from these clusters. This level of detail is made possible by a method known as alignment and transfer (AT), among other names, an approach that uses aligned bitexts to transfer linguistic resources developed for one language–the source language–to another language–the target. This approach has a further advantage in that it allows a reduction in the amount of training data without a significant degradation in performance making it useful in applications targeted at data collected from endangered languages. In the current study, however, I use English as the source and German as the target for ease of evaluation and for certain typlogical properties of German. The two main tasks, that of clustering and segmentation, are approached as sequential tasks with the clustering informing the segmentation to allow for greater accuracy in morphological analysis.
While the performance of these methods does not exceed the current roster of unsupervised or minimally supervised approaches to morphology acquisition, it attempts to integrate more learning methods than previous studies. Furthermore, it attempts to learn inflectional morphology as opposed to derivational morphology, which is a crucial distinction in linguistics.Linguistic
- …