1,705 research outputs found
A Quantile Variant of the EM Algorithm and Its Applications to Parameter Estimation with Interval Data
The expectation-maximization (EM) algorithm is a powerful computational
technique for finding the maximum likelihood estimates for parametric models
when the data are not fully observed. The EM is best suited for situations
where the expectation in each E-step and the maximization in each M-step are
straightforward. A difficulty with the implementation of the EM algorithm is
that each E-step requires the integration of the log-likelihood function in
closed form. The explicit integration can be avoided by using what is known as
the Monte Carlo EM (MCEM) algorithm. The MCEM uses a random sample to estimate
the integral at each E-step. However, the problem with the MCEM is that it
often converges to the integral quite slowly and the convergence behavior can
also be unstable, which causes a computational burden. In this paper, we
propose what we refer to as the quantile variant of the EM (QEM) algorithm. We
prove that the proposed QEM method has an accuracy of while the MCEM
method has an accuracy of . Thus, the proposed QEM method
possesses faster and more stable convergence properties when compared with the
MCEM algorithm. The improved performance is illustrated through the numerical
studies. Several practical examples illustrating its use in interval-censored
data problems are also provided
Annotation of genomics data using bidirectional hidden Markov models unveils variations in Pol II transcription cycle
DNA replication, transcription and repair involve the recruitment of protein complexes that change their composition as they progress along the genome in a directed or strand-specific manner. Chromatin immunoprecipitation in conjunction with hidden Markov models (HMMs) has been instrumental in understanding these processes, as they segment the genome into discrete states that can be related to DNA-associated protein complexes. However, current HMM-based approaches are not able to assign forward or reverse direction to states or properly integrate strand-specific (e.g.,RNA expression) with non-strand-specific (e.g.,ChIP) data, which is indispensable to accurately characterize directed processes. To overcome these limitations, we introduce bidirectional HMMs which infer directed genomic states from occupancy profiles de novo. Application to RNA polymerase II-associated factors in yeast and chromatin modifications in human T cells recovers the majority of transcribed loci, reveals gene-specific variations in the yeast transcription cycle and indicates the existence of directed chromatin state patterns at transcribed, but not at repressed, regions in the human genome. In yeast, we identify 32 new transcribed loci, a regulated initiation-elongation transition, the absence of elongation factors Ctk1 and Paf1 from a class of genes, a distinct transcription mechanism for highly expressed genes and novel DNA sequence motifs associated with transcription termination. We anticipate bidirectional HMMs to significantly improve the analyses of genome-associated directed processes
Adaptive Seeding for Gaussian Mixture Models
We present new initialization methods for the expectation-maximization
algorithm for multivariate Gaussian mixture models. Our methods are adaptions
of the well-known -means++ initialization and the Gonzalez algorithm.
Thereby we aim to close the gap between simple random, e.g. uniform, and
complex methods, that crucially depend on the right choice of hyperparameters.
Our extensive experiments indicate the usefulness of our methods compared to
common techniques and methods, which e.g. apply the original -means++ and
Gonzalez directly, with respect to artificial as well as real-world data sets.Comment: This is a preprint of a paper that has been accepted for publication
in the Proceedings of the 20th Pacific Asia Conference on Knowledge Discovery
and Data Mining (PAKDD) 2016. The final publication is available at
link.springer.com (http://link.springer.com/chapter/10.1007/978-3-319-31750-2
24
Authorship Attribution With Few Training Samples
This chapter discusses authorship attribution through a training sample. The focus on authorship attribution discussed in this chapter differs in two ways from the traditional authorship identification problem discussed in the earlier chapters of this book. Firstly, the traditional authorship attribution studies [63, 65] only work in the presence of large training samples from each candidate author, which are typically enough to build a classification model. With authorship attribution, the emphasis is on using a few training samples for each suspect. In some scenarios, no training samples may exist, and the suspects may be asked (usually through court orders) to produce a writing sample for investigation purposes. Secondly, in traditional authorship studies, the goal is to attribute a single anonymous document to its true author. In this chapter, we look at cases where we have more than one anonymous message that needs to be attributed to the true author(s). It is likely that the perpetrator may either create a ghost e-mail account or hack an existing account, and then use it for sending illegitimate messages in order to remain anonymous. To address the aforementioned shortfalls, the authorship attribution problem has been redefined as follows: given a collection of anonymous messages potentially written by a set of suspects {S1, ···, Sn}, a cybercrime investigator first wants to identify the major groups of messages based on stylometric features; intuitively, each message group is written by one suspect. Then s/he wants to identify the author of each anonymous message collection from the given candidate suspects. To address the newly defined authorship attribution problem, the stylometric pattern-based approach of AuthorMinerl (described previously in Sect. 5.4.1) is extended and called AuthorMinerSmall. When applying this approach, the stylometric features are first extracted from the given anonymous message collection Ω
Parameter estimation for load-sharing system subject to Wiener degradation process using the expectation-maximization algorithm
In practice, many systems exhibit load-sharing behavior, where the surviving components share the total load imposed on the system. Different from general systems, the components of load-sharing systems are interdependent in nature, in such a way that when one component fails, the system load has to be shared by the remaining components, which increases the failure rate or degradation rate of the remaining components. Because of the load-sharing mechanism among components, parameter estimation and reliability assessment are usually complicated for load-sharing systems. Although load-sharing systems with components subject to sudden failures have been intensely studied in literatures with detailed estimation and analysis approaches, those with components subject to degradation are rarely investigated. In this paper, we propose the parameter estimation method for load-sharing systems subject to continuous degradation with a constant load. Likelihood function based on the degradation data of components is established as a first step. The maximum likelihood estimators for unknown parameters are deduced and obtained via expectation-maximization (EM) algorithm considering the nonclosed form of the likelihood function. Numerical examples are used to illustrate the effectiveness of the proposed method
Crowd Learning with Candidate Labeling: an EM-based Solution
Crowdsourcing is widely used nowadays in machine learning for data labeling. Although in the traditional case annotators are
asked to provide a single label for each instance, novel approaches allow annotators, in case of doubt, to choose a subset of labels as a way to extract more information from them. In both the traditional and these novel approaches, the reliability of the labelers can be modeled based on the collections of labels that they provide. In this paper, we propose an Expectation-Maximization-based method for crowdsourced data with candidate sets. Iteratively the likelihood of the parameters that model
the reliability of the labelers is maximized, while the ground truth is estimated. The experimental results suggest that the proposed method performs better than the baseline aggregation schemes in terms of estimated accuracy.BES-2016-078095
SVP-2014-068574
IT609-13
TIN2016-78365-
Fast non-negative deconvolution for spike train inference from population calcium imaging
Calcium imaging for observing spiking activity from large populations of
neurons are quickly gaining popularity. While the raw data are fluorescence
movies, the underlying spike trains are of interest. This work presents a fast
non-negative deconvolution filter to infer the approximately most likely spike
train for each neuron, given the fluorescence observations. This algorithm
outperforms optimal linear deconvolution (Wiener filtering) on both simulated
and biological data. The performance gains come from restricting the inferred
spike trains to be positive (using an interior-point method), unlike the Wiener
filter. The algorithm is fast enough that even when imaging over 100 neurons,
inference can be performed on the set of all observed traces faster than
real-time. Performing optimal spatial filtering on the images further refines
the estimates. Importantly, all the parameters required to perform the
inference can be estimated using only the fluorescence data, obviating the need
to perform joint electrophysiological and imaging calibration experiments.Comment: 22 pages, 10 figure
Evidence Propagation and Consensus Formation in Noisy Environments
We study the effectiveness of consensus formation in multi-agent systems
where there is both belief updating based on direct evidence and also belief
combination between agents. In particular, we consider the scenario in which a
population of agents collaborate on the best-of-n problem where the aim is to
reach a consensus about which is the best (alternatively, true) state from
amongst a set of states, each with a different quality value (or level of
evidence). Agents' beliefs are represented within Dempster-Shafer theory by
mass functions and we investigate the macro-level properties of four well-known
belief combination operators for this multi-agent consensus formation problem:
Dempster's rule, Yager's rule, Dubois & Prade's operator and the averaging
operator. The convergence properties of the operators are considered and
simulation experiments are conducted for different evidence rates and noise
levels. Results show that a combination of updating on direct evidence and
belief combination between agents results in better consensus to the best state
than does evidence updating alone. We also find that in this framework the
operators are robust to noise. Broadly, Yager's rule is shown to be the better
operator under various parameter values, i.e. convergence to the best state,
robustness to noise, and scalability.Comment: 13th international conference on Scalable Uncertainty Managemen
A reliability-based approach for influence maximization using the evidence theory
The influence maximization is the problem of finding a set of social network
users, called influencers, that can trigger a large cascade of propagation.
Influencers are very beneficial to make a marketing campaign goes viral through
social networks for example. In this paper, we propose an influence measure
that combines many influence indicators. Besides, we consider the reliability
of each influence indicator and we present a distance-based process that allows
to estimate the reliability of each indicator. The proposed measure is defined
under the framework of the theory of belief functions. Furthermore, the
reliability-based influence measure is used with an influence maximization
model to select a set of users that are able to maximize the influence in the
network. Finally, we present a set of experiments on a dataset collected from
Twitter. These experiments show the performance of the proposed solution in
detecting social influencers with good quality.Comment: 14 pages, 8 figures, DaWak 2017 conferenc
Outlier detection with partial information:Application to emergency mapping
This paper, addresses the problem of novelty detection in the case that the observed data is a mixture of a known 'background' process contaminated with an unknown other process, which generates the outliers, or novel observations. The framework we describe here is quite general, employing univariate classification with incomplete information, based on knowledge of the distribution (the 'probability density function', 'pdf') of the data generated by the 'background' process. The relative proportion of this 'background' component (the 'prior' 'background' 'probability), the 'pdf' and the 'prior' probabilities of all other components are all assumed unknown. The main contribution is a new classification scheme that identifies the maximum proportion of observed data following the known 'background' distribution. The method exploits the Kolmogorov-Smirnov test to estimate the proportions, and afterwards data are Bayes optimally separated. Results, demonstrated with synthetic data, show that this approach can produce more reliable results than a standard novelty detection scheme. The classification algorithm is then applied to the problem of identifying outliers in the SIC2004 data set, in order to detect the radioactive release simulated in the 'oker' data set. We propose this method as a reliable means of novelty detection in the emergency situation which can also be used to identify outliers prior to the application of a more general automatic mapping algorithm. © Springer-Verlag 2007
- …