2 research outputs found
Optimal string clustering based on a Laplace-like mixture and EM algorithm on a set of strings
In this study, we address the problem of clustering string data in an
unsupervised manner by developing a theory of a mixture model and an EM
algorithm for string data based on probability theory on a topological monoid
of strings developed in our previous studies. We first construct a parametric
distribution on a set of strings in the motif of the Laplace distribution on a
set of real numbers and reveal its basic properties. This Laplace-like
distribution has two parameters: a string that represents the location of the
distribution and a positive real number that represents the dispersion. It is
difficult to explicitly write maximum likelihood estimators of the parameters
because their log likelihood function is a complex function, the variables of
which include a string; however, we construct estimators that almost surely
converge to the maximum likelihood estimators as the number of observed strings
increases and demonstrate that the estimators strongly consistently estimate
the parameters. Next, we develop an iteration algorithm for estimating the
parameters of the mixture model of the Laplace-like distributions and
demonstrate that the algorithm almost surely converges to the EM algorithm for
the Laplace-like mixture and strongly consistently estimates its parameters as
the numbers of observed strings and iterations increase. Finally, we derive a
procedure for unsupervised string clustering from the Laplace-like mixture that
is asymptotically optimal in the sense that the posterior probability of making
correct classifications is maximized.Comment: 56 page
Evolutionary model of a population of DNA sequences through the interaction with an environment and its application to speciation analysis
In this study, we construct an evolutionary model of a population of DNA
sequences interacting with the surrounding environment on the topological
monoid A* of strings on the alphabet A = { a, c, g, t }. A partial differential
equation governing the evolution of the DNA population is derived as a kind of
diffusion equation on A*. Analyzing the constructed model in a theoretical
manner, we present conditions for sympatric speciation, the possibility of
which continues to be discussed. It is shown that under other same conditions
one condition determines whether sympatric speciation occurs or the DNA
population continues to move around randomly in a subset of A*. We next
demonstrate that the population maintains a kind of equlibrium state under
certain conditions. In this situation, the population remains nearly unchanged
and does not differentiate even if it can differentiate into others.
Furthermore, we calculate the probability of sympatric speciation and the time
expected to elapse before it